Post-doctoral researcher in Cognitive Psychology, University of Padova.
Research: Computational modeling of cognitive and learning processes, Bayesian hypothesis testing.
PhD in Psychological Science, completed March 6, 2025.
Passionate about reproducible science after struggling with disorganized datasets in my early research!
Our job is hard 🔥
Running experiments
Analyzing data
Managing trainees
Writing papers
Responding to reviewers
Reproducibility helps!
Organizes your workflow.
Saves time by documenting steps.
Builds trust in your findings.
Enables others to reproduce and extend your work.
What is reproducible science?
At its core, reproducible science means that someone else, or even you, in the future, can reproduce your results from your materials: your data, your code, your documentation.
It means your workflow is transparent.
Keys to reproducible science 🔐
Data: organize, document, and share your datasets in ways that are usable by others and understandable by you (even years later).
Code: write analysis scripts that are clean, transparent, and reusable..
Literate programming: combine code and text in the same document, so your reports are dynamic and replicable.
Version Control and Sharing: track changes, collaborate, and make your work openly available using tools like GitHub and OSF.
So… Is reproducible science even harder?
At first, yes - but then…🧯🔥
Helps you stay organized.
Makes it easier to remember what you did.
Allows others to understand, reproduce, and build on your work.
Learning the tools takes effort but once you do, your workflow becomes smoother, clearer, and more reliable.
Outline
Data
Code
R projects
Literate Programming
Version Control
Data
Data types in research
Raw Data: Original, unprocessed (e.g., survey responses).
Processed Data: Cleaned, digitized, or compressed.
Analyzed Data: Summarized in tables, charts, or text.
Free platform to organize, document, and share research.
Supports preregistration, archiving, and collaboration.
Integrates with GitHub, Dropbox, Google Drive.
Bad data sharing example
Imagine this scenario: you read a paper that seems really relevant to your research. At the end, you’re excited to see they’ve shared their data on OSF. You go to the repository, and there’s one file…
Bad data sharing example
You download it, open it, and you see this: . . .
x1
x2
x3
x4
x5
x6
x7
0.3981105
13.912435
a
0
-0.6775811
0.8759740
-0.2051604
-0.1434733
1.093743
c
0
0.7055193
0.2521987
1.8816947
-0.2526000
4.898035
c
0
0.4744651
-0.5628840
0.3245589
-1.2272588
14.717053
b
0
-0.5132792
-1.1368242
-0.1355150
-0.4360417
8.547025
c
1
-0.1736804
-0.7120962
-1.2714320
What do these variables mean? What’s x3?
What do 0 and 1 represent? How are missing values coded?
Is x6 a z-score or raw data?
Good data sharing practices
Use plain-text formats (e.g., .csv, .txt).
Include a data dictionary with variable descriptions.
Findable: Use metadata and DOIs to make data easy to locate.
Accessible: Ensure data is retrievable via open repositories.
Interoperable: Use standard formats (e.g., .csv, .txt) for compatibility.
Reusable: Include clear documentation and open licenses.
Data licensing 🔒
A license tells others what they can and can’t do with your data. If you don’t include one, legally speaking, people might not be allowed to use it, even if you meant to share it openly.
GNU-GPL: guarantees end users the freedom to run, study, share, and modify the software while requiring that all modified versions and derivative works also be distributed under the same license. ❤️
Code
Why scripting?
The SPSS Workflow
Click menu items to run analysis
“exclude <18”
Click through everything again
Forget a step? Round differently?
Stressful, error-prone, and undocumented.
R Workflow
# Load datadata <-read.csv("data.csv")# Filter agedata <- data[data$age >=18, ]# Analyzesummary(lm(score ~ condition, data = data))# Make plotggplot(data, aes(x = condition, y = score)) +geom_boxplot()
One line change, rerun, and everything updates.
Why scripting?
Scripting ensures transparent and reproducible workflows.
Reproducible: You can rerun them.
Documented: You can see what you did and when.
Shareable: Others can inspect and reproduce your analysis.
Functions are the primary building blocks of your program. You write small, reusable, self-contained functions that do one thing well, and then you combine them.
Avoid repeating the same operation multiple times in the script. The rule is, if you are doing the same operation more than two times, write a function.
A function can be re-used, tested and changed just one time affecting the whole project.
Functional Programming, example…
We have a dataset (mtcars) and we want to calculate the mean, median, standard deviation, minimum and maximum of each column and store the result in a table.
The *apply family is one of the best tool in R. The idea is pretty simple: apply a function to each element of a list.
The powerful side is that in R everything can be considered as a list. A vector is a list of single elements, a dataframe is a list of columns etc.
Internally, R is still using a for loop but the verbose part (preallocation, choosing the iterator, indexing) is encapsulated into the *apply function.
means <-rep(0, ncol(mtcars))for(i in1:length(means)){ means[i] <-mean(mtcars[[i]])}# the same with sapplymeans <-sapply(mtcars, mean)
The *apply Family
Apply your function…
results <-lapply(mtcars, summ)
Now results is a list of data frames, one per column.
sapply() tries to simplify the result into a vector or matrix.
vapply() is like sapply() but safer (you specify the return type).
apply() is for applying functions over rows or columns of a matrix or data frame.
for loops are bad?
for loops are the core of each operation in R (and in every programming language). For complex operation thery are more readable and effective compared to *apply. In R we need extra care for writing efficent for loops.
Extremely slow, no preallocation:
res <-c()for(i in1:1000){# do something res[i] <- i^2}
Very fast:
res <-rep(0, 1000)for(i in1:length(res)){# do something res[i] <- i^2}
microbenchmark📦
library(microbenchmark)microbenchmark(grow_in_loop = { res <-c()for (i in1:10000) { res[i] <- i^2 } },preallocated = { res <-rep(0, 10000)for (i in1:length(res)) { res[i] <- i^2 } }, times =100)
Unit: microseconds
expr min lq mean median uq max neval
grow_in_loop 1168.090 1234.7355 1446.4919 1276.2275 1393.3235 6538.393 100
preallocated 655.508 672.2975 717.7563 685.7045 715.2245 2393.990 100
cld
a
b
Going further: custom function lists
Let’s define a list of functions:
funs <-list(mean = mean, sd = sd, min = min, max = max, median = median)
$dbl_empty
numeric(0)
$dbl_single
[1] 1.5
$dbl_mutliple
[1] 1.5 2.5 3.5
$dbl_with_na
[1] 1.5 2.5 NA
$dbl_single_na
[1] NA
$dbl_all_na
[1] NA NA NA
Why functional programming?
We can write less and reusable code that can be shared and used in multiple projects.
The scripts are more compact, easy to modify and less error prone (imagine that you want to improve the summ function, you only need to change it once instead of touching the for loop).
Functions can be easily and consistently documented (see roxygen documentation) improving the reproducibility and readability of your code.
Functional programming in the wild
You can write some R scripts only with functions and source() them into the global environment.
Instead of hardcoding paths, we want to use projects with relative paths.
R Projects
An R Project (.Rproj) is a file that defines a self-contained workspace.
When you open an R Project, your working directory is automatically set to the project root, no need to use setwd() ever again.
To make this even “easier”, you can use the rrtools package to create what’s called a reproducible research compendium.
… the goal is to provide a standard and easily recognisable way for organising the digital materials of a project to enable others to inspect, reproduce, and extend the research… (Marwick et al., 2018)
Research compendium rrtools📦
Organize its files according to the prevailing conventions.
Maintain a clear separation of data (original data is untouched!), method, and output.
Specify the computational environment that was used for the original analysis
rrtools::create_compendium("compedium") builds the basic structure for a research compendium.
Another challenge for reproducibility is package versions.
You write some code today using dplyr 1.1.2.
In six months, dplyr gets updated… 😢
renv helps you create reproducible environments for your R projects!
What does renv do?
It records all the packages you use, with versions, in a lockfile
It installs them in a project-specific library
It ensures that anyone who runs your code gets exactly the same environment
Project specific library
install.packages("renv")
renv::init()
install.packages('bayesplot')
These packages will be installed into "~/repro-pre-school/example-renv/renv/library/macos/R-4.4/aarch64-apple-darwin20".
For example jupyter notebooks, R Markdown and now Quarto are literate programming frameworks to integrate code and text.
Literate Programming, the markup language
The markup language is the core element of a literate programming framework. When you write in a markup language, you’re writing plain text while also giving instructions for how to generate the final result.
Markdown is one of the most popular markup languages for several reasons:
easy to write and read compared to Latex and HTML
easy to convert from Markdown to basically every other format using pandoc
Quarto
Quarto (https://quarto.org/) is the evolution of R Markdown that integrate a programming language with the Markdown markup language. It is very simple but quite powerful.
APA Quarto is a Quarto extension that makes it easy to write documents in APA 7th edition style, with automatic formatting for title pages, headings, citations, references, tables, and figures.
# 1. Initialize a Git repository in your current project foldergit init# 2. Stage a file to be tracked (e.g., your script)git add analysis.R# 3. Save a snapshot of your work with a messagegit commit -m"Initial commit"# 4. Connect your local project to a GitHub repo (change the URL)git remote add origin https://github.com/yourname/repo.git# 5. Upload your commits to GitHubgit push -u origin main
Branching & merging 🌱
Try out new features
Fix bugs safely
Work on different versions in parallel
# Create and switch to a new branch called 'new-feature'git checkout -b new-feature# (Make your changes in code, then stage and commit them)# Save those changes with a descriptive messagegit commit -m"Add new plot"# Switch back to the main branchgit checkout main# Merge the changes from 'new-feature' into 'main'git merge new-feature
Use branches to keep your main branch clean.
Handling conflicts
Sometimes, Git can’t automatically merge changes. This happens when two branches modify the same line in a file.
<<<<<<< HEADplot(data)=======plot(data, col = "blue")>>>>>>> new-feature
Git will insert conflict markers directly into the file:
The code between <<<<<<< HEAD and ======= is from the current branch (e.g., main)
The code between ======= and >>>>>>> new-feature is from the other branch you’re merging (e.g., new-feature)
Handling conflicts
To resolve the conflict, choose the correct version (or combine them), delete the markers, and save the file.
For example:
plot(data, col ="blue") # resolved version
Then:
git add file.Rgit commit -m"Resolve merge conflict in file.R"
GitHub + RStudio Integration
Clone repos with File → New Project → Version Control
Start small. Use Git for one script. Then grow your skills from there.
If Git and GitHub feel too technical, or if your collaborators are less technical, the OSF is a fantastic alternative or complement.
Upload data, code, and documents
Create public or private projects
Add collaborators
Create preregistrations
Generate DOIs for citation
Track changes
You can also connect OSF to GitHub.
Integrated workflow 🛠️
Develop your analysis using R and Quarto.
Track code and scripts using Git.
Host your code on GitHub (public or private).
Upload your data and materials to OSF, including a data dictionary.
Link your GitHub repository to your OSF project.
Use renv for reproducible R environments.
Share the OSF project and cite it in your paper.
Reproducibility
It’s about credibility and transparency.
Reproducible science is not about being perfect.
It’s about showing your work so that others can follow, understand, and build upon it.
Start simple, don’t wait until you’re “ready”, and teach what you learn!
THANK YOU!
References
Marwick, B., Boettiger, C., & Mullen, L. (2018). Packaging data analytical work reproducibly using r (and friends). The American Statistician, 72(1), 80–88. https://doi.org/10.1080/00031305.2017.1375986
Comments, comments and comments…
Write the code for your future self and for others, not for yourself right now.
Try to open a (not well documented) old coding project after a couple of years and you will understand :)
Invest time in writing more comprehensible and documented code for you and others.