This morning the World Health Organization released downloadable, machine-readable data (yay!) of the number of Ebola cases at the county level (available here). This release is particularly special, because it includes both the number of cases according to the situation reports and the counts as reported by the patient database. The patient database (also known as a line list) is usually considered the gold standard for outbreak data. Until this release, the public had no data from the database - the situation reports were the only resource.
The good news is that this data is immensely useful for epidemiologists, modelers, public health responders - pretty much everyone involved with Ebola work. The bad news is the situation reports are apparently fairly unreliable. Ideally the two data sources would match up very closely. This is not the case.
It's disappointing that the sitreps are not what we thought they were. But ultimately I'm really glad that this information is being made available, because now we can adjust accordingly. I think we should commend the Ministries of Health and the WHO for releasing this data - I value open data and transparency very highly, even when it brings some surprising results.
[Graphics below the break]
One year ago, I complained about the state of public data. I specifically pointed out that over 100 years of Nationally Notifiable Disease Surveillance System data were trapped in PDF, basically unusable. Thankfully, Project Tycho has since come to the rescue by releasing NNDSS data from 1888 to 2013.
Not only can you retrieve machine readable data from their website, but they have an API! Although wonderful, APIs can sometimes be a bit of a pain to get to know, so I wrote a python wrapper to make life a little easier. I've named it pycho, and you can find it on github here.
I went on an Epic Quest today to find publicly-available line listings. I found a grand total of three machine readable ones (one of which is my own), and a couple more non-machine readable ones.
In honor of Open Access Week, here are some relevant posts I've published:
"Send me your data - PDF is fine," said no one ever - Learn how most data shares are useless, and how to share more effectively.
Let's make data a civic right - Here's why you sold care about the abysmal state of public data sharing
Scholarly impact of open access journals - Did you know that two thirds of open access journals don't have publishing fees? Or that open access journals are catching up to traditional, closed journals in impact?
Why I support open access science - Here's why I care. Why do you?
Join the conversation with #oaweek on Twitter, or find an event that interests you.
Unrelated: Interested in epidemiology, python or MERS-CoV? Please check out my new project, a visualization of disease clusters and other contagions.
What if I told you a brand new public library is coming to your town. It's going to be really well stocked with great books - but you can't open them, you can only look at the covers. That is the current state of our public data right now. Here's why you should care.
There's a serious problem with the current state of shared data - it is almost completely unusable! Here are some ideas for sharing more effectively.
I often have a question I'd like to answer for which I know data are available. Most recently I wanted to look up the incidence (number of new cases) of various infectious diseases over the last decade. This should be easy - CDC publishes the Morbidity and Mortality Weekly Report of just that. Well, the data are indeed available - put only in PDF. Why even bother with computers? They might as well mail around a printout. If I wanted to actually analyze it, I would first need to enter a decade's worth of data by hand. Ain't nobody got time for that.
This post was originally written on 1/26/2013, and updates on 3/18/2013
After I wrote about the apparent decline in interest in open access/science, one commenter suggested that search volume may be declining as the concepts become more mainstream. Here are those trends again, without open science to obscure the lower search volume terms.
It’s a classic research conundrum - is the effect we are observing real?
I looked into it more using data specifically on open access and open science. I downloaded a list of open access journals from the Directory of Open Access Journals (DOAJ). I also downloaded a spreadsheet of 2011 impact data from Journal Metrics, an offshoot of Scopus that assesses journal impact. Journal Metrics provides two impact measures: Source Normalized Impact per Paper (SNIP) and SCImago Journal Rank (SJR). I will mostly be using SNIP score for this analysis.
According to their FAQ, SNIP “measures a source’s contextual citation impact. It takes into account characteristics of the source's subject field, especially the frequency at which authors cite other papers in their reference lists, the speed at which citation impact matures, and the extent to which the database used in the assessment covers the field’s literature. SNIP is the ratio of a source's average citation count per paper, and the ‘citation potential’ of its subject field. It aims to allow direct comparison of sources in different subject fields.”
All analysis were done in an ipython notebook, and relied heavily on pandas. You can view the notebook here, and regular .py code here. You can also download figures or obtain a doi at figshare.
There are quite a few more analysis available on the ipython notebook (no coding skills necessary, its a webpage). Here is the link again.
A comparison of `open access', `open source', and `open science' shows that interest in the movement, as measured by search volume, has been on a steady decline since 2005.
`Open access' and `open source' are tanking. `Open science' is mostly holding strong, but the search volume is tiny compared to the other two terms. Even Aaron Swartz's death, which Times Higher Education deems (somewhat distastefully) an `unexpected martyr' has done little to turn the tides.
These data show a clear trend, but I do still wonder what is going on? Could it be that open access advocates are a loud minority, while interest in the movement stagnates among the silent majority?
Or worse, could it be that prospective participants have given OA a try, but deemed it unworkable? Given my own less than pleasurable experiences with data.gov, I wouldn't necessarily blame people who struggled to maintain enthusiasm after attempting to use data from a file that uses white space in a .xls file to denote category hierarchies. Especially when there are trailing white spaces, too... (I'm looking at you, American Time Use Survey). Or perhaps it was data from the CDC, not available for download, that soured some to the idea.
Clearly those are not examples of OA done right (and they are both specifically open government initiatives), but something somewhere seems to be causing a decline in interest in OA.
How do we reverse the trend, and bring open access into the mainstream?
A lot has been written about open access in the last week, and although the circumstances of the publicity are heart wrenching, I'm glad that the movement is receiving the attention it deserves. Though the justification for open access science has already been eloquently covered by many others, my nascent blog is as good a place as any to enumerate my reasons for supporting OA.
Because most science is publicly funded, it only makes sense that results be made available to the public. Even projects that receive private funding benefit from state and federal dollars if the research is conducted at a public institution. How unfortunate that science writers, non-academic researchers, and the interested public all contributed their tax dollars and might benefit from access to published findings, yet currently lack privileges. As a researcher in training, I dislike the idea of all my hard work being hidden away from those who need it.
Even those who are in academic research can't always get what they need. Although most educational institutions subscribe to a variety of academic journals, some are always missing. Virginia Tech, for example, does not subscribe to online access to the New England Journal of Medicine. Epidemiologists on campus despair. It's usually possible to order articles through the library, but you have to know exactly which article you want, and it takes some time for the request to be fulfilled. Given that you can't know how useful an article will be without having read it, most don't bother with the hassle.
Which brings me to the most important reason of all: science is done incrementally, with each research effort drawing from and building on the work that came before it. It becomes that much more difficult to make the next step forward without access to the research corpus. Scientific progress cannot progress at an optimal pace if foundational work is behind a paywall.
"Send me your data - PDF is fine," said no one ever
The public health paradox ("When public health works, it's invisible")
Let's make data a civic right
Scholarly impact of open access journals
Six months later, disease detectives still battling fungal meningitis outbreak