By Deborah Fitchett, Digital Access Coordinator at Lincoln University.
The movement to open up research data is gaining momentum. Both publishers and funders are starting to require researchers to publish their data at the same time as the results and conclusions they’ve derived from it. Most recently, the National Science Challenges’ Request for Proposals (pdf) requires fundees to explain how they will comply with the New Zealand Government Open Access and Licensing Framework and provide for access to and re-use of data generated (p.13).
Because publication of research data is still relatively new, it looks complicated and scary. And it can be complicated if you want it to be. A really large project might make it worth creating its own website to host the resulting dataset(s) and present them in a custom set of interfaces. [quote float=”right”]Every dataset you publish, no matter how small or incomplete or imperfect, is a wheel someone else doesn’t have to reinvent[/quote](The National Science Challenges let you apply funding to provide for access to and re-use of data.) Look at GeoNet (on the off-chance you haven’t already!) for an example: it provides raw data from the quake drums, and both initial and confirmed calculations about the size and location of quakes. Data is presented pre-digested on maps for the casual visitor, and in open data formats for the more sophisticated reuser.
But publication can also be as easy as spending a minute creating a figShare account and another minute uploading your file and adding a title and subject keywords. Click a button and your data has a permanent home and DOI.
Perhaps midway are subject-specific repositories (over 600 listed on re3data alone): these tend to have more metadata/documentation requirements or other forms of quality control. Dryad has recently instigated a small data publishing charge. A number of journals are paying this charge for their authors: perhaps for the good of science, perhaps recognising that the data publication citation advantage is good for their impact factors.
Publishing is a good start, but it’s not the same as making the data open for reuse. Peter Desmet’s illegal bullfrogs demonstrates how we lose out when data reuse is restricted, whether by intent or neglect.
Fortuitously a lot of data publication venues support or require published data to be open data. FigShare and Dryad for example both require a Creative Commons Zero licence. This lets people use the data in any application without even the need for attribution — useful if their application pulls together data from dozens, hundreds, or thousands of sources.
It’s important to note that this doesn’t affect the scholarly norm of citing your sources, any more than the expiration of copyright means you no longer need to cite Aristotle, Murasaki, or Marie Curie. DataCite, among many others, is working on data citation standards, but the main principle is that data should be cited just as articles and books are.
So publishing data is easy, and publishing data openly is easy. Publishing open data well is the hard thing — just as it is for any human skill or endeavour. You start with the basics and level up according to your capacity and needs. The ODI Open Data Certification process is a friendly way both to recognise what level you’re at and to let you know what direction you can develop in next.
You might, for example, start by publishing in a proprietary format like Excel, then later level up to publishing in the open CSV format, or machine-readable XML, or a chart with the data embedded behind it using a tool like Datawrapper, or a bundle of formats for different uses plus interactive visualisations.
Your first time, you might realise you don’t have consent from your human participants to publish their data, so limit yourself to publishing only the data gathered by other means — then on your next research project you might plan for data publication right from the ethics approval stage.
Planning for data
The words “Data Management Plan” strike fear in the researcher’s heart, conjuring up the spectre of voluminous forms full of bureacratese and technical specifications. But all it means is to ask yourself:
- what data you intend to gather;
- how you’ll analyse and document it;
- what ethical questions it raises and how you’ll deal with those;
- how you’ll store it safely and securely while you’re carrying out your research;
- and what you’ll do with it afterwards — whether to publish all or part of it, archive it privately for a period, or in some cases destroy it.
The UK’s Digital Curation Centre’s 2009 checklist for a data management plan (pdf) consisted of over 80 questions – but its most recent version (pdf) has only 24 (or just 13, depending how you count). The idea is simply to make sure you’ve thought about these things before beginning your project, so you don’t get caught by surprise when the journal you submit to at the end of it requires you to publish your data.
There’s no such thing.
The best data
Ranganathan’s Five Laws of Library Science could easily be reapplied here, beginning: Data is for use.
The best data is the data that someone finds when they need it. They might have to convert it or tidy it or reanalyse it — but if they have the data they can do all these things.
If you can thoroughly document your data, do; but if not, don’t let that stop you putting your data up. If you can convert it into an open machine-readable format, do; but if not, don’t let that be why it languishes on your hard drive. If you can publish all your data under a Creative Commons Zero licence, do; but if not, publish some of it, under whatever licence you can.
Every dataset you publish, no matter how small or incomplete or imperfect, is a wheel someone else doesn’t have to reinvent. Start small, but start; and keep the wheels of science turning.