How to archive research data from a modeling study

Why is it important to archive data properly? 

Good research practice requires that the data and methods described in a published article are accessible after the publication. There are several reasons to have a secure copy of all relevant data. With the model source, input files and model output, it is possible to prove that the data is genuine and not fabricated. It is also possible that years after the publication, the authors or other scientists would need part of the data or scripts for a follow-up study. Creating a secure archive can prevent data loss caused by hardware breakup or human error. Already when conducting research, it is a good idea to keep archiving in mind and document and organize data files and scripts appropriately. Placing data of a completed study in an archive also frees space from more expensive storage systems as  /stornext/field/ on Voima

What to archive

The exact list of objects to archive depends on each specific study, but some general guidelines can be given to archive data of a modeling study. This list is written with ECHAM-HAMMOZ in mind.

  •  A general README file of the archived data set that explains what is included. A sample README file is provided at the end of this document
  • Additional README files can be placed in the subfolders with actual data files to provide more information (meta data)
  • Input files required to make the model runs. With ECHAM-HAMMOZ, it is probably an overkill to save the standard input files for each study as it takes a lot of space. Make only symbolic links to standard input files so they can be easily found if simulations have to be repeated later on. Archive only input files that are specific to your study. If these specific input files are currently used by others and they available elsewhere on Voima, it is sufficient to add symbolic links pointing to the input files.
  • Scripts used to make post-processing, other data analysis, and figures. Ideally, scripts used to make the figures in the publication should be clearly identified. Again, if you want to save disk space, and scripts are in current use and available elsewhere on Voima, it is sufficient to add symbolic links pointing to the scripts.
  • If the model source code is in a model repository (e.g. HAMMOZ Redmine), document the model revision in the README file. If the used model is not in a repository, archive the source code of the model version(s) used in the study.
  • Working directories of the model runs. These contain automatically links to input files, model run logs, first-order post-processing logs of model output, namelists, and other additional information which can be needed in future. The working directories takes also fairly little space so it's easiest to archive them entirely, provided that there are no redundant input files or a collection of unnecessary rerun files
  • Scripts from the co-authors, so that all relevant information about the study is archived in the same place
  • If there is no lack of hard disk space, archive post-processed model output data and their derivatives.
  • Tables or other forms of data that are only partly present in the manuscript
  • Any figures that are not present in the manuscript, but are either referenced in the text or otherwise important to the study 
  • Data or figures for the co-authors
  • Anything other that is crucial to reproduce the study or might be useful for follow-up studies by you or some others

Archive data on stornext

A folder /stornext/field/kuopiodata was created on Voima to store common data for the ISI modelling group. For the atmospheric and ocean modelling group (FMI Helsinki), a folder /stornext/field/hel551data was created.

However, files stored there are counted on file owner's quota. To create a secure archive that is accessible and easy to find for others, create a file named LastName_et_al_PublicationYear.txt on /stornext/field/kuopiodata/publication_archive. In the text file, write the full reference of the publication and the exact location of the data archive for that publication (see below for details).

There is also a tape drive available on stornext. It offers a cheaper alternative to store long-term data compared to /stornext/field. To get a folder on the tape drive, contact Lasse Jalava. The tape drive is not backed up. Therefore, in order to create a secure and doubled data archive, ask for two different locations on the tape drive. When using the tape drive, there are a few things to keep in mind:

  • Combine files into tar archives. This is especially important for small files such as text, figures, or scripts. There is no need to compress the files as this is done automatically on this drive.
  • If you delete a file on the tape drive, the space is not freed. Therefore, try to copy the data on the tape drive only once and do not modify it afterwards.
  • Keep the number of individual files low in order to make data retrieval faster
  • Keep the size of single tar archive below 99 GB as files larger than that are saved on several tapes. However, file size as large as 198 GB is still manageable if absolutely necessary.
  • Ask Lasse Jalava for further instructions if necessary.

Sample README file

Author: Antti-Ilari Partanen

Date: 25/02/2015

This folder contains the archived data and scripts of the following publication:

Partanen, A.-I., Dunne, E. M., Bergman, T., Laakso, A., Kokkola, H., Ovadnevaite, J., Sogacheva, L., Baisnée, D., Sciare, J., Manders, A., O'Dowd, C., de Leeuw, G., and Korhonen, H.: Global modelling of direct and indirect effects of sea spray aerosol using a source function encapsulating wave state, Atmos. Chem. Phys., 14, 11731-11752, doi:10.5194/acp-14-11731-2014, 2014.

Monthly mean files of the ECHAM-HAMMOZ simulations are stored in files monthlymeans_EXP.tar where EXP is the name of each experiment. Temporal and spatial averages and other post-processed files are stored in files final_EXP.tar. 

The name of the simulations in the data match the ones in the publication in the following way:

Name in the data Name in the publication
ctrl control
orig-salt default-salt
ossa-highdep ossa-high-ics
ossa-highflux ossa-highflux
ossa-lowdep ossa-low-ics
ossa-lowflux ossa-lowflux
ossa-ref ossa-ref
ossa-salt ossa-salt

Other files are explained below.

aerocom.tar

Anthropogenic aerosol emission files used in the simulations.

DataTables.tar.gz         

Contains tabular data of forcings, in-situ comparison, AOD comparison with Parasol data, and sea spray budget.

ossa-matlab-scripts.tar.gz 

Matlab scripts used to process and visualize data, including scripts to make the publication figures.

voima-scripts.tar

The ‘prepare' folder contains the scripts used to initialize the working directories of all the runs (ie., generating running and post-processing scripts, and creating links for input files). The package contains also the scripts used to do initial processing of wave height and chlorophyll-a data. Further processing was done using the Matlab scripts in ossa-matlab-scripts.tar.gz .

chlorophyll-a.tar

Final and interpolated chlorophyll-a data used for model input. The dates on file names refer to the start date of each 8-day cycle.

ossa-analysis-data.tar.gz

In-situ and PARASOL data used in the study.

post-scripts.tar
Scripts used to calculate monthly mean files from model output and do further post-processing such as calculate means over whole simulation time.

rundata.tar

Working directories of each simulation. These contain run and post-processing logs, links to input files, model binary, namelists, and restart files to rerun the experiments from 1.1.2006.

waveheight.tar

Final and interpolated wave height data used as model input.