During the summer I began converting Ebola situation reports (sitreps) from PDF to a text format. The reports had critical information like the number of new Ebola cases, how many people were in treatment, details about contract tracing, and more. I needed the data for my research on modeling infectious disease, and since the data were completely unusable in PDF, every day I did the painful manual conversion into a standardized machine-readable format*.
I figured that if I needed that data, other people probably did to, so I pushed each day's sitreps to github. Turns out I was correct on that point, and by October, I found myself maintaining a data repository that received a couple thousand hits each day, and had a cadre of active contributors. Although I expected some people would find it useful, I had no idea it would receive the attention it did. That shortsightedness meant I had not done a lot of the work needed to make the repository maximally useful.
Here is what I would do if I could start over:
1. Use github (this one I got right). I like Figshare and have used it to publish several data sets, but it's not convenient for daily data updates, or for accepting crowdsourced contributions. Pushing updates to github is super simple, and there are well-defined mechanisms for users to contribute. It is a little tricky to use for non-technical users, but the pros outweigh the cons.
2. Include a license. There are a lot of options out there - probably too many. I did not have a license at first, but eventually contributors helped me pick a super simple one:
This data release is licenced as follows: You may copy and redistribute the data. You may make derivative works from the data. You may use the data for commercial purposes. You may not sublicence the data when redistributing it. You may not redistribute the data under a different license. Source attribution on any use of this data: Must refer source.
3. Generate a citation, and include it in the README. I did not do this at all, and it left people scrambling to figure out how to cite. Github now has a mechanism to provide digital object identifiers, but apparently DOIs are for a specific release, making them imperfect for evolving data. If I were to go back in time, I would probably just generate my own citation format (which is not ideal and I'll probably get flack for it, but it seems like a simple, workable solution).
4. Make it really simple for people to contribute. I provided links to the source data and templates for the machine-readable data format, but it still left would-be contributors really confused. What I would do next time is write a script to download the PDFs from source and push to github. I would also make a screencast and detailed writeup of my exact process, leaving nothing to the imagination. Expecting contributors to get the gist of it from my vague explanations did not work out well, and I'm sure it deterred some people.
4b. One outstanding problem with the process described in 4 is that it is not clear what data has already been converted, and what remains undone. A digital platform where contributors can claim a PDF and have it removed from the pool would be excellent. Ideally the platform would also have several submission mechanisms, so that non-github users have options too.
5. Make clear your limits. I received a lot of emails from people asking me for different slices of the data. They would want it as one big file, or only certain columns, etc. I turned them all away, until someone suggested I change my README to ask that people not write to me with these requests. That worked, and should have been implemented sooner. In hindsight though, I could have also posted a pricing scheme to accommodate these requests, and receive some renumeration for my efforts.
In the end, I had to stop updating the repo in December so I could focus on finishing my dissertation. It was a wonderful experience, and perhaps one of the biggest community contributions I've made in my nascent scientific career. I met a lot of people, learned so much, and would not hesitate to do it again. But next time, I would do it better!
*Yes, Tabula, scraping, etc. Been there, done that. Doesn't always work for all situations. No really, trust me on this.
"Send me your data - PDF is fine," said no one ever
The public health paradox ("When public health works, it's invisible")
Let's make data a civic right
Scholarly impact of open access journals
Six months later, disease detectives still battling fungal meningitis outbreak