CMS: Understanding the data and code publication process

Since 2016, CMS has been helping the Centre’s researchers and students publish their data using the NCI data services. We also recently started a software collection on the Zenodo platform. We were looking to demystify the publication process and highlight its benefits to the researchers. The best way seemed to get researchers to talk about their own experiences with the publication process of their data and code at NCI and Zenodo, helped by the CMS team. Thank you to Sanaa Hobeichi, Steefan Contractor, Annette Hirsch and Nic Pittman for taking the time to answer our questions on short notice.

What did you publish?

Sanaa:
DOLCE a combination of 6 evapotranspiration products
LORA a runoff aggregated product
CLASS a monthly, 0.5-degree, global dataset of the water and energy budget variables

Steefan: Two versions of REGEN a new global land-based daily precipitation dataset

Annette: Two Github repositories for analysis code required for a publication.

Nic: The source code and source data, which is able to reproduce a number of figures in my paper. I also published a final reprocessed dataset of chlorophyll estimates for the Tropical Pacific Ocean.

Why did you publish?

Steefan: to make it available to other researchers. We developed a new rainfall product that required a lot of work and I wanted a way for others to be able to access it. I personally feel it’s very important to have a good research practice to make available the data and the code you used to produce your research with proper documentation attached to it. Having a DOI also helps to identify your dataset and get recognition for it.

Annette: AGU Journals now require the analysis code to be published with its own DOI to comply with the new FAIR rules. So the journals wouldn’t even go into production until this was complete and statements in the acknowledgements similar to ‘contact the corresponding author’ will no longer be accepted!

Nic: It is a requirement by most journals to publish accompanying datasets with papers. I decided to publish both the source code and a usable dataset for other scientists/users. This helps in the reproducibility of my science and hopefully help other people to use these products. I might even get more citations out of publishing these easy-to-use code and dataset.

How was the publishing process?

Annette: So much easier than I expected. Basically, just create the Github repo, upload your code, write a readme and ask CMS to publish via Zenodo!

Nic: The process to get these datasets published was surprisingly simple. Firstly, I made sure that the datasets were in the formats I wanted, I was comfortable with publishing all the code/data and permissions had been received and cited correctly. Paola then looked over the source code and NetCDF files to ensure no obvious mistakes and that they were of an acceptable quality to be published. This included some minor CF convention changes to the NetCDF file variables. The GitHub source code was then released and published through Zenodo, and the larger dataset was stored on NCI with a DOI minted and is now available online through THREDDs.

Why did you choose to publish with the Centre rather than with your institution?

Sanaa: I had to publish with UNSW RDA storage but there are no quality checks run on the files, while I know when publishing in the CLEX collections my files will be checked and more valuable information will be added to them. I use a lot of datasets and I appreciate when they are well documented

Steefan: At the start, I wasn’t aware I could publish with UNSW but looking back now the publication process at NCI is a lot more mature. The DMP is easy to fill out and detailed and makes you think of the information you are providing.

Nic: After a little research, it appeared easier to publish through NCI and Zenodo rather than through my institution. These datasets now have a larger reach and visibility because of this choice. Paola also made it very easy to get the data available online.

What did you get out of the process?

Steefan: The CMS helps you to transform your first draft for the data management plan and your files into something which is user-friendly and satisfy NCI requirements. I found this process very valuable since I want others to be able to use my data.

Nic: I have learned a lot about reproducible science by publishing my own source code and data. It can be frightening and challenging opening yourself up to criticism. I was a little scared of the possibility of minor bugs being found by others checking my work, however, this makes for more robust science in the end. I will definitely organise my code better next time during the analysis phase, in order to make the publishing process even more streamlined and reproducible for my future research.