Methodological School - 3 R’s of Trustworthy Science
Margherita Calderan
May 18, 2026
What is reproducible science?
Reproducibility can be considered as the most fundamental pre-requisite of replication in science.
“…obtaining consistent results using the same input data, computational steps, methods, and code, and conditions of analysis.” Reproducibility et al. (2019)
Meaning that someone else, or even you, in the future, can reproduce your results from your materials: your data, your code, your documentation.
Keys to reproducible science
Data: organize, document, and share your datasets.
Code: write analysis scripts that are clean, transparent, and reusable.
Literate programming: combine code and text in the same document, so your reports are dynamic and replicable.
Version Control and Sharing: track changes, collaborate, and make your work openly available using tools like GitHub and OSF (and/or Zenodo).
Our job is hard
Running experiments
Analyzing data
Managing trainees
Writing papers
Responding to reviewers
Reproducibility helps!
Organizes your workflow.
Saves time by documenting steps.
Builds trust in your findings.
Enables others to reproduce and extend your work.
Outline
1. Data
2. Code
3. R projects
4. Literate Programming
5. Version Control
Data
Data types in research
Raw Data: Original, unprocessed (e.g., survey responses).
Processed Data: Cleaned, digitized, or compressed.
Analyzed Data: Summarized in tables, charts, or text.
Collaborative: Add collaborators and manage projects
Citable: Every file gets a unique URL for citing, project gets a DOI.
Comprehensive: Automate version control, preregister research, and share preprints.
Long-term: Guaranteed 50+ years of read access hosting
GitHub Integration: Easily preserve your GitHub repositories.
Bad data sharing example
Imagine this scenario: you read a paper that seems really relevant to your research. At the end, you’re excited to see they’ve shared their data on OSF. You go to the repository, and there’s one file…
Bad data sharing example
You download it, open it, and you see this…
x1
x2
x3
x4
x5
x6
x7
0.3981105
13.912435
a
0
-0.6775811
0.8759740
-0.2051604
-0.1434733
1.093743
c
0
0.7055193
0.2521987
1.8816947
-0.2526000
4.898035
c
0
0.4744651
-0.5628840
0.3245589
-1.2272588
14.717053
b
0
-0.5132792
-1.1368242
-0.1355150
-0.4360417
8.547025
c
1
-0.1736804
-0.7120962
-1.2714320
What do these variables mean? What’s x3?
What do 0 and 1 represent?
How are missing values coded?
Is x6 a z-score or raw data?
Good data sharing practices
Use plain-text formats (e.g., .csv, .txt).
Include a data dictionary with variable descriptions.
Findable: Use metadata and DOIs to make data easy to locate.
Accessible: Ensure data is retrievable via open repositories.
Interoperable: Use standard formats (e.g., .csv, .txt) for compatibility.
Reusable: Include clear documentation and open licenses.
Data dictionary
A data dictionary is a document that outlines the structure, content, and variable definitions for a dataset (harvard/datamanagement).
It is critical for reproducibility because it explains what all the variable names and values in your spreadsheet really mean (osf/datadictionary).
Data dictionary
From OSF guide
Variable names
Human-readable variable names
Measurement units for the variable
Allowed values for the variable
Definition of the variable
Data dictionary: let’s try
Imagine we are collecting data (n = 12) to explore the relationship between anxiety (measured via the State-Trait Anxiety Inventory) and education levels:
A README file is the first thing someone sees when they open your dataset (or project folder). It should answer basic questions like:
What is this dataset?
How was it collected?
What are the variables?
Which is the structure of the project?
README
Anxiety and Education Example Dataset
This repository contains a small simulated dataset with 12 observations, designed to explore the relationship between anxiety and education level.
Anxiety is measured using the State-Trait Anxiety Inventory (STAI), specifically the trait scale. Education level is recorded as the highest degree obtained (Bachelor, Master, PhD).
A license tells others what they can and can’t do with your data. If you don’t include one, legally speaking, people might not be allowed to use it, even if you meant to share it openly.
Data licensing
Common licenses for documents, data, and other non‑software:
CC BY: Permits reuse of documents, data, and other non‑software works with attribution.
CC0: Places documents, data, or other works in the public domain; no restrictions on reuse.
Data licensing
Common licenses for software:
GNU GPL‑3.0: Applies to software and guarantees freedom to run, study, share, and modify software, requiring modified versions be distributed under the same license.
AGPL‑3.0: Extends the GNU GPL by requiring source code to be made available if a modified version runs on a publicly accessible server.
Code
What are the alternatives?
There are several excellent open-source software options based on R, such as:
# Load datadata <-read.csv("data.csv")# Filter agedata <- data[data$age >=18, ]# Analyzesummary(lm(score ~ condition, data = data))# Make plotggplot(data, aes(x = condition, y = score)) +geom_boxplot()
One line change, rerun, and everything updates.
Why scripting?
Scripting ensures transparent and reproducible workflows.
Reproducible: You can rerun them.
Documented: You can see what you did and when.
Shareable: Others can inspect and reproduce your analysis.
and RStudio
R: Free, open-source, with thousands of packages for analysis.
RStudio: Intuitive interface for coding, plotting, and debugging.
Vibrant community for support and resources.
Writing better code
Name descriptively: Use snake_case or camelCase for readability.
Comment clearly: Document your logic for clarity.
Organize scripts: Load packages and data upfront.
Use descriptive names
# Bad x1 <-c("UNIPD psychology", "university of padova medicine", "unito_biology")# BetteruniDep <-c("unipdPsy", "unipdMed", "unitoBio")
Comments, comments and comments…
Write the code for your future self and for others, not for yourself right now.
study_final <- study_raw |># 1. Focus on participants who completed the primary outcomefilter(!is.na(anxiety_score)) |># 2. Convert raw scores to clinical categoriesmutate(severity =if_else(anxiety_score >15, "High", "Low")) |># 3. Drop pilot-phase data (pre-2024)filter(date >=as.Date("2024-01-01"))
Try to open a (not well documented) old code after a couple of years and you will understand :)
Functions are the primary building blocks of your program. You write small, reusable, self-contained functions that do one thing well, and then you combine them.
Avoid repeating the same operation multiple times in the script. The rule is, if you are doing the same operation more than two times, write a function.
A function can be re-used, tested and changed just one time affecting the whole project.
Functional programming, example…
We have a dataset (mtcars) and we want to calculate the mean, median, standard deviation, minimum and maximum of each column and store the result in a table.
The standard (~imperative) option is using a for loop, iterating through columns, calculate the values and store into another data structure.
ncols <-ncol(mtcars) # number of columns# create vectors of length ncols with 0smeans <- medians <- mins <- maxs <-rep(0, ncols)# loop over the columns (variables) and fill the vectorsfor(i in1:ncols){ means[i] <-mean(mtcars[[i]]) medians[i] <-median(mtcars[[i]]) mins[i] <-min(mtcars[[i]]) maxs[i] <-max(mtcars[[i]])}
# combine everything into a dfresults <-data.frame(means, medians, mins, maxs)results$col <-names(mtcars) # add variable nameshead(results, n =3) # display 3 rows
means medians mins maxs col
1 20.09062 19.2 10.4 33.9 mpg
2 6.18750 6.0 4.0 8.0 cyl
3 230.72188 196.3 71.1 472.0 disp
Functional programming
The main idea is to decompose the problem writing a function and loop over the columns of the dataframe:
summ <-function(x){data.frame(means =mean(x), #given an input compute statsmedians =median(x), mins =min(x), maxs =max(x))} # return a dfncols <-ncol(mtcars) #number of columnsdfs <-vector(mode ="list", length = ncols) #empty list for(i in1:ncols){ dfs[[i]] <-summ(mtcars[[i]]) #each element of the list is a df with the summary stat}
[[1]]
means medians mins maxs
1 20.09062 19.2 10.4 33.9
[[2]]
means medians mins maxs
1 6.1875 6 4 8
[[3]]
means medians mins maxs
1 230.7219 196.3 71.1 472
[[4]]
means medians mins maxs
1 146.6875 123 52 335
[[5]]
means medians mins maxs
1 3.596563 3.695 2.76 4.93
[[6]]
means medians mins maxs
1 3.21725 3.325 1.513 5.424
[[7]]
means medians mins maxs
1 17.84875 17.71 14.5 22.9
[[8]]
means medians mins maxs
1 0.4375 0 0 1
[[9]]
means medians mins maxs
1 0.40625 0 0 1
[[10]]
means medians mins maxs
1 3.6875 4 3 5
[[11]]
means medians mins maxs
1 2.8125 2 1 8
Functional programming
#combine list to obtain data.frameresults <-do.call(rbind, dfs)results$var <-names(mtcars) # add variable nameshead(results, n =3) # display 3 rows
means medians mins maxs var
1 20.09062 19.2 10.4 33.9 mpg
2 6.18750 6.0 4.0 8.0 cyl
3 230.72188 196.3 71.1 472.0 disp
means medians mins maxs var
1 7.788 7.25 0.8 17.4 Murder
2 170.760 159.00 45.0 337.0 Assault
3 65.540 66.00 32.0 91.0 UrbanPop
Functional programming, *apply📦
The *apply family is one of the best tool in R. The idea is pretty simple: apply a function to each element of a list.
The powerful side is that in R everything can be considered as a list. A vector is a list of single elements, a dataframe is a list of columns etc.
Internally, R is still using a for loop but the verbose part (preallocation, choosing the iterator, indexing) is encapsulated into the *apply function.
means <-rep(0, ncol(mtcars))for(i in1:length(means)){ means[i] <-mean(mtcars[[i]])}# the same with sapplymeans <-sapply(mtcars, mean)
The *apply family
Apply your function…
results <-lapply(mtcars, summ)
Now results is a list of data frames, one per column.
We can stack them into one big data frame:
results_df <-do.call(rbind, results)head(results_df, n =5)
means medians mins maxs
mpg 20.090625 19.200 10.40 33.90
cyl 6.187500 6.000 4.00 8.00
disp 230.721875 196.300 71.10 472.00
hp 146.687500 123.000 52.00 335.00
drat 3.596563 3.695 2.76 4.93
Using sapply, vapply, and apply
lapply() always returns a list.
sapply() tries to simplify the result into a vector or matrix.
vapply() is like sapply() but safer (you specify the return type).
apply() is for applying functions over rows or columns of a matrix or data frame.
for loops are bad?
for loops are the core of each operation in R (and in every programming language). For complex operation they are more readable and effective compared to *apply. In R we need extra care for writing efficent for loops.
Extremely slow, no preallocation:
res <-c()for(i in1:1000){# do something res[i] <- i^2}
Very fast:
res <-rep(0, 1000)for(i in1:length(res)){# do something res[i] <- i^2}
microbenchmark📦
library(microbenchmark)microbenchmark(grow_in_loop = { res <-c()for (i in1:10000) { res[i] <- i^2 } },preallocated = { res <-rep(0, 10000)for (i in1:length(res)) { res[i] <- i^2 } }, times =100)[1:2,1:2]
Unit: microseconds
expr min lq mean median uq max neval
grow_in_loop 1526.963 1526.963 1526.963 1526.963 1526.963 1526.963 1
preallocated 811.472 811.472 811.472 811.472 811.472 811.472 1
Going further: custom function lists
Let’s define a list of functions:
funs <-list(mean = mean, sd = sd, min = min, max = max, median = median)
$dbl_empty
numeric(0)
$dbl_single
[1] 1.5
$dbl_mutliple
[1] 1.5 2.5 3.5
$dbl_with_na
[1] 1.5 2.5 NA
$dbl_single_na
[1] NA
$dbl_all_na
[1] NA NA NA
Why functional programming?
We can write less and reusable code that can be shared and used in multiple projects.
The scripts are more compact, easy to modify and less error prone (imagine that you want to improve the summ function, you only need to change it once instead of touching the for loop).
Functions can be easily and consistently documented (see roxygen documentation) improving the reproducibility and readability of your code.
Import your functions
You can write some R scripts only with functions and source() them into the global environment.
To make this even “easier”, you can use the rrtools package to create what’s called a reproducible research compendium.
… the goal is to provide a standard and easily recognisable way for organising the digital materials of a project to enable others to inspect, reproduce, and extend the research… (Marwick, Boettiger, and Mullen 2018)
rrtools::create_compendium("compedium") builds the basic structure for a research compendium.
Another challenge for reproducibility is package versions.
You write some code today using dplyr 1.1.2.
In six months, dplyr gets updated… 😢
renv helps you create reproducible environments for your R projects!
What does renv do?
It records all the packages you use, with versions, in a lockfile
It installs them in a project-specific library
It ensures that anyone who runs your code gets exactly the same environment
Project specific library
install.packages("renv")renv::init()install.packages('bayesplot')# These packages will be installed into # "~/repro-pre-school/example-renv/renv/library/macos/R-4.4/aarch64-ap# ple-darwin20".
For example jupyter notebooks, R Markdown and now Quarto are literate programming frameworks to integrate code and text.
Literate Programming, the markup language
The markup language is the core element of a literate programming framework. When you write in a markup language, you’re writing plain text while also giving instructions for how to generate the final result.
Markdown is one of the most popular markup languages for several reasons:
easy to write and read compared to Latex and HTML
easy to convert from Markdown to basically every other format using pandoc
Quarto
Quarto (https://quarto.org/) is the evolution of R Markdown that integrate a programming language with the Markdown markup language. It is very simple but quite powerful.
APA Quarto is a Quarto extension that makes it easy to write documents in APA 7th edition style, with automatic formatting for title pages, headings, citations, references, tables, and figures.
Git tracks your project on your computer. GitHub is the online platform where you can:
Back up your project safely in the cloud
Share it publicly or privately with others
Collaborate without overwriting each other’s work
Track issues and project progress
Git workflow
Files move through three local stages before reaching GitHub:
📁
Working Directory
Edit files here
git add
→
📋
Staging Area
Choose what to commit
git commit
→
💾
Local Repository
Snapshot saved locally
git push →
← git pull
☁️
GitHub
Shared online
Remember:
New files are untracked until you run git add.
GitHub in practice
git init # turn folder into a repogit add analysis.R # stage file for commitgit commit -m"Initial commit"# save a snapshot locallygit remote add origin <URL># link to a GitHub repogit push -u origin main # first push (sets upstream)git push # all subsequent pushesgit pull # download others' commits
You can also do most of this from RStudio’s Git pane.
Branching & merging 🌱
By default, you work on the main branch. A new branch is an independent copy of your project where you can experiment safely/without touching the working version.
Try out new features without breaking main
Fix bugs in isolation
Let multiple people work in parallel
When the work is ready, you merge it back into main.
Branching in practice
git checkout -b new-feature # 1. Create & switch branchgit add analysis.R # 2. Commit your changesgit commit -m"Add new plot"git checkout main # 3. Switch back to maingit merge new-feature # 4. Merge branch in
GitHub + RStudio Integration
You don’t have to use the terminal, RStudio has a built-in Git panel, you can:
Clone a repo: File → New Project → Version Control
Stage, commit, push, pull, browse history: use the Git tab
New project with Git: tick “Create a git repository” at setup
If Git and GitHub feel too technical, or if your collaborators are less technical, the OSF is a fantastic alternative or complement.
Upload data, code, and documents
Create public or private projects
Add collaborators
Create preregistrations
Generate DOIs for citation
Track changes
You can also connect OSF to GitHub.
Integrated workflow 🛠️
Develop your analysis using R and Quarto.
Track code and scripts using Git.
Host your code on GitHub (public or private).
Upload your data and materials to OSF, including a data dictionary.
Link your GitHub repository to your OSF project.
Use renv for reproducible R environments.
Share the OSF project and cite it in your paper.
Reproducibility
It’s about credibility and transparency.
Reproducible science is not about being perfect.
It’s about showing your work so that others can follow, understand, and build upon it.
Start simple, don’t wait until you’re “ready”, and teach what you learn!
THANK YOU!
References
Marwick, Ben, Carl Boettiger, and Lincoln Mullen. 2018. “Packaging Data Analytical Work Reproducibly Using r (and Friends).”The American Statistician 72 (1): 80–88. https://doi.org/10.1080/00031305.2017.1375986.
Reproducibility, Committee on, Replicability in Science, Board on Behavioral, Cognitive, and Sensory Sciences, Committee on National Statistics, Division of Behavioral, Social Sciences, Education, et al. 2019. Reproducibility and Replicability in Science. Washington, D.C.: National Academies Press. https://doi.org/10.17226/25303.
Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, et al. 2016. “The FAIR Guiding Principles for Scientific Data Management and Stewardship.”Scientific Data 3 (1): 160018. https://doi.org/10.1038/sdata.2016.18.
Comments, comments and comments…
Write the code for your future self and for others, not for yourself right now.
Try to open a (not well documented) old code after a couple of years and you will understand :)