class: center, middle, inverse, title-slide # Reproducible data analysis ## Why you should learn to code ### Gabriel Mateus Bernardo Harrington ### Keele University ### CDT Joint Conference - 2021/06/15
(updated: 2022-02-07) --- ## Science has a reproducibility problem - Lots of [scientists agree](https://www.nature.com/articles/533452a) - Cancer biology: [~80% of studies not reproducible](https://doi.org/10.1093/nsr/nwy021) - None of the hundreds of machine learning Covid-19 papers are suitable for clinical use due to [methodological flaws and/or underlying biases](https://www.nature.com/articles/s42256-021-00307-0) - Ubiquitous [P-Hacking](https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002106) - [The natural selection of bad science](https://royalsocietypublishing.org/doi/10.1098/rsos.160384) - [Is the staggeringly profitable business of scientific publishing bad for science? ](https://www.theguardian.com/science/2017/jun/27/profitable-business-scientific-publishing-bad-for-science) - Spoiler: it is ## My take on the issue: 1. [The Sovereign of Science](https://gmbernardoharrington.netlify.app/post/the-sovereign-of-science/) - Scientists are not incentivised to produce quality research 2. [How might science be done better?](https://gmbernardoharrington.netlify.app/post/how-might-science-be-done-better/) - Exploration of ideas to improve transparency and reproducibility - A article version of this talk: [Reproducible data analysis](https://oskor.netlify.app/post/reproducible-data-analysis/) --- # Your current tools? -- - Excel, SPSS, Graphpad, etc. -- - Characterised by: 1. Limited and clunky graphic user interface (GUI) 1. Mixture of **data entry**, analysis, and visualisation -- - Question: Have you even asked if these tools are good for what you use them for? - Or have you ever wondered if there might be a better way? ![It's what I know](https://media.giphy.com/media/SwUqgyCG8mJZxHtnKS/giphy.gif) --- class: center, middle ## Sounds great? ![Sounds great!](https://media.giphy.com/media/26tk03guzOxhjkL6w/giphy.gif) --- ## So what's the problem you damn nerd? -- - A mistaken click or drag action can lead to errors or the overwriting of data -- - No way to see history of changes -- - Doesn't scale to larger data, as our [glorious government learnt the hard way](https://www.theguardian.com/politics/2020/oct/05/how-excel-may-have-caused-loss-of-16000-covid-tests-in-england) during covid times. -- - You have to pay for the software - It also isn't supported across all platforms (e.g: Microsoft Office on Linux) - You don't have true ownership of your files, the licence holder can take away your ability to create/view/edit them at any time for any reason - They are closed-source - No ability to inspect underlying code, potential bugs and inconsistencies -- - An investigator cannot guarantee that the claims made in a study are correct - Reproducibility is important not because it ensures that the results are correct, but rather because it ensures **transparency** and gives us confidence in understanding *exactly what was done*. ??? - Easy to confuse cells that contain raw data from those that are the product of analysis - potential backwards compatibility issues --- # Example: .pull-left[ .center[[The Excel mistake heard round the world](https://www.marketplace.org/2013/04/17/economy/excel-mistake-heard-round-world/)] - 2 economists published a paper suggesting countries with debt over 90% of GDP would enter recession - Had a huge impact on US budget for years - They forgot to include 5 countries when calculating averages in Excel... - Their inclusion changes the results significantly ## So what could we use instead? ] .pull-right[ ![](https://media.giphy.com/media/Sb9KqeeymLlESGWZyE/giphy.gif) ] ??? - the economists convinced US senators they couldn't afford stimulus and should use austerity --- .pull-left[ ![](https://media.giphy.com/media/xT1XGzXhVgWRLN1Cco/giphy.gif) <br/> <!-- ![](https://media.giphy.com/media/LmNwrBhejkK9EFP504/giphy.gif) --> <img src="https://media.giphy.com/media/LmNwrBhejkK9EFP504/giphy.gif" width="80%"/> ] .pull-right[ <img src="https://media.giphy.com/media/Wsju5zAb5kcOfxJV9i/giphy.gif" width="60%"/> <img src="https://media.giphy.com/media/MdA16VIoXKKxNE8Stk/giphy.gif" width="60%"/> ] --- ## Why a programming language? -- - Code is just a set of instructions for a computer to follow - Thus steps of data manipulation/analysis are clearly laid out in the code, in order. - This makes it possible for colleagues and collaborators to know what you've done and **replicate** it, even after you're gone -- - No risk of messing up original data - Raw experimental data should be **sacrosanct!!** - Easier to experiment with different methods - Avoids creation of monster spreadsheets -- - Allows for easy use of the same method with new data - Increases your versatility as a scientist by giving you the ability to use: - Machine learning/AI - Omics data - And anything else you can think of - Code necessarily can do everything a GUI can do, and a whole lot more besides --- ## And it makes you more employable! .center[![](https://media.giphy.com/media/xTiIzFiGj0C3RFKjHa/giphy.gif)] .center[You: "Sounds great! Is there anything else I can do to make my research more reproducible?"] .center[Me: "**I'm glad you asked**"] .center[You: "I was being sarcastic you stupid pri-"] --- ## What is "version control", and why should you care? -- .left-column[![git logo](https://torbjornzetterlund.com/wp-content/uploads/2015/06/git-logo.png)] -- - System that records changes to a file or set of files over time -- - You: "Can't I just backup my data by copying it somewhere else?" -- - This is very error prone so some nerds made a program called [RCS](https://www.gnu.org/software/rcs/) many years ago to solve this -- - New problem: I would like to collaborate with others on the development of my software/paper/thesis/crap fan-fiction -- - Nerds returned and made various flavours of **Distributed Version Control Systems** of which **Git** is the most popular --- ## So what does Git do? - Automatically tracks any changes within a local "repository" (AKA a git repo) - A repo is just a folder with a repo initialised in it - A service like [GitHub](https://github.com/), [GitLab](https://about.gitlab.com/) or [Bitbucket](https://bitbucket.org/product) can then be used to upload your local repository - It doesn't matter which you use, the basic functionality is the same -- - This allows any collaborators to "clone" the repository to their machine and get all the files **and** the history of changes to said files -- - Any changes the collaborate makes can then be re-synced to the shared repo - Git allows you to see who has made what changes to each file - Git also requires you to make a comment when you perform a commit, giving a summary of what each change did .pull-left[ ![](https://logz.io/wp-content/uploads/2016/03/github.png) ] .pull-right[ <img src="https://cdn-images-1.medium.com/max/1600/1*j9hbjszo0zXS32yhvSkdAQ.jpeg" width="70%" /> ] --- ## Going even further down the rabbit hole -- - Now that your data manipulation and analysis are transparent and reproducible, why not your writing as well? -- - Documents made with a suite like Microsoft Office need special software to open and use - If these tools were no longer supported, the company goes bust, or the price increases and your overloads decide they wont pay for it you might not be able to access your work -- - Tools like Markdown allow you to make documents any computer can view and edit -- - RMarkdown allows you to integrate your new reproducible data workflow with your writing - You can produce professional .pdf or sleek and modern .html documents (and .doc files if you have to work with a Luddite...) - It also allows you to make [websites](https://gmbernardoharrington.netlify.app/) and slides -- - Example: [clinical trail reports by Frank Harrell](http://hbiostat.org/R/hreport/report.html#) --- ## Licensing considerations .pull-left[ - You don't need to be a lawyer, but you should be aware of some common licenses to watch out for - Stick to using free and open-source software as much as possible - Work that relies on paid licenses **isn't** reproducible! ] .pull-right[ <img src="https://altc.alt.ac.uk/oesig/wp-content/uploads/sites/1119/2016/07/creative-commons-783531_1280.png" width="80%" /> ] - Also worth thinking about licensing your work - [Creative commons licences](https://creativecommons.org/licenses/) are generally very permissive, but there are multiple version to match different concerns: - Attribution (give me credit) - ShareAlike (new creations must share license terms) - NoDerivs (no sharing in adapted form) - NonCommercial (self-explanatory...) - Alternatives: [GPL](https://www.gnu.org/licenses/gpl-3.0.en.html), [MIT](https://www.mit.edu/~amini/LICENSE.md), etc. --- ## Containerisation: bringing it all together - So what if someone actually wants to download your data, script/s and writing and try to reproduce them? .pull-left[ - They may run into the fun of [dependency hell](https://www.browserlondon.com/blog/2020/09/02/dependency-hell-how-to-avoid-it/) - They may have different versions of software, or be missing a dependency that's no longer available - Solution: Containerisation - "A standardised unit of software" - Packages up code and dependencies to run consistently in different computing environments - [Docker](https://www.docker.com/resources/what-container) is a popular approach ] .pull-right[ <img src="https://assets.browserlondon.com/app/uploads/2020/08/XKCD-dependency.png" width="100%" /> ] --- ### Further resources - How to guide for reproducible research: [The Turing Way](https://the-turing-way.netlify.app/welcome.html) - For R: - [R for Data Science](https://r4ds.had.co.nz/) - [R video tutorial](https://www.youtube.com/watch?v=32o0DnuRjfg) - For git: - [Happy Git with R](https://happygitwithr.com/) - [Comprehensive book (you don't need to read the whole thing!)](https://git-scm.com/book/en/v2) - For RMarkdown: - [The holy bible](https://bookdown.org/yihui/rmarkdown/) - [Advanced RMarkdown](https://arm.rbind.io/) - [Referencing with Zotero](https://www.r-bloggers.com/bibliography-with-knitr-cite-your-references-and-packages/) - [Bookdown guide - good for writing a thesis](https://bookdown.org/yihui/bookdown/) --- #### Links to this slide deck: The rendered slides can be found at this address: - [https://h-mateus.github.io/presentations/reproducible_research_2021-04-30/index.html#1](https://h-mateus.github.io/presentations/reproducible_research_2021-04-30/index.html#1) .center[ ![Link to rendered slides](repo_res_qr.png) ] The raw files are on GitHub [here](https://github.com/H-Mateus/presentations) under the GPL license #### Acknowledgements .pull-left[ - Thanks to humans: Karina Wright, Paul Cool, Charlotte Hulme & the rest of the [OsKOR team](https://oskor.netlify.app/) - Thanks for the money: [EPSRC](https://epsrc.ukri.org/) ] .pull-right[ <img src="../logos/EPSRC+logo.png" width="80%" /> ]