Leibniz MMS Days 2023 - Abstract

Eggert, Anja

Reproducible bioinformatics workflows: A case study with software containers and interactive notebooks

(joint work with Pål O. Westermark)
Reproducible specification of workflows in bioinformatics is challenging given their complexity. We developed a new statistical method in the field of circadian rhythmicity, which allows to rigorously determine whether measured quantities such as gene expression are not rhythmic. The statistical method itself was implemented in the R package "HarmonicRegression", available on the CRAN repository. However, the bioinformatics workflow is much larger than the statistical test. For instance, to ensure the validity of the statistical method, we simulated data sets of 20,000 gene expressions, with a large range of parameter combinations (e.g. sampling interval, fraction of rhythmicity, number of outliers). We now demonstrate the use of Jupyter notebooks to document and distribute our statistical method and its application to both simulated and experimental data sets. The notebook runs inside a Docker software container. It ensures complete long-term reproducibility of the workflow.