Wednesday, November 25, 2015

Preventing statistical reporting errors by integrating writing and coding

tl;dr: Using RMarkdown with knitr is a nice way to decrease statistical reporting errors.

How often are there statistical reporting errors in published research? Using a new automated method for scraping APA-formatted stats out of PDFs, Nuijten et al. (2015) found that over 10% of p-values were inconsistent with the reported details of the statistical test, and 1.6% were what they called "grossly" inconsistent, e.g. difference between the p-value and the test statistic meant that one implied statistical significance and the other did not (another summary here). Here are two key figures, first for proportion inconsistent by article and then for proportion of articles with an inconsistency:

These graphs are upsetting news. Around half of articles had at least one error by this analysis, which is not what you want from your scientific literature.* Daniel Lakens has a nice post suggesting that three errors account for many of the problems: incorrect use of < instead of =, use of one-sided tests without clear reporting as such, and errors in rounding and reporting.

Speaking for myself, I'm sure that some of my articles have errors of this type, almost certainly from copying and pasting results from an analysis window into a manuscript (say Matlab in the old days or R now).**  The copy-paste thing is incredibly annoying. I hate this kind of slow, error-prone, non-automatable process.

So what are we supposed to do? Of course, we can and should just check our numbers, and maybe run statcheck (the R package Nuijten et al. created) on our own work as well. But there is a much better technical solution out there: write statistics into the manuscript in one executable package that automatically generates the figures, tables, and statistical results. In my opinion, doing this used to be almost as much of a pain as doing the cutting and pasting (and this is spoken as someone who writes academic papers in LaTeX!). But now the tools for writing text and code together have gotten so good that I think there's no excuse not to. 

In particular, the integration of the knitr package with RStudio and RPubs means that it is essentially trivial to create a well-formatted document that includes text, code, and data inside it. I've posted a minimal working example to RPubs; you can see the source code here. Critically, this functionality allows you to do things like this:

which eliminates the cut and paste step.*** And even more importantly, you can get out fully-formatted results tables:

You can even use bibtex for references (shown in the full example). Kyle MacDonald, Dan Yurovsky, and I recently wrote a paper together on the role of social cues in cross-situational word learning (the manuscript is under review at the moment). Kyle did the entire thing in RMarkdown using this workflow (repository here), and then did journal formatting using a knitr style that he bundled into his own R package.

The RStudio knitr integration is such that it's really trivial to get started using this workflow (here's a good initial guide), and it's pretty interactive to re-knit and see the output. Occasionally debugging is still a bit tricky, but you can easily switch back to the REPL to debug more complex code blocks. Perhaps the strongest evidence about how easy it is to work this way. More and more I've found myself turning to this workflow as the starting point of data analysis, rather as a separate packaging step at the end of a project.

Often we tend to think of there being a tension between the principles of open science and the researcher's own incentive to work quickly. In contrast, this is a case where I think that there is no tension at all: a better, easier, and faster workflow leads to both a lower risk of errors and more transparency.

* There are some potential issues in the automated extraction procedure that Nuijten et al. used. In particular, they have a very inflexible schema for reporting: if authors included an effect size,  formatted their statistical results in a single parenthetical, or any other common formatting alternative, the package would not extract the appropriate stat (in practice, they get around 68% of tests). This kind of thing would be easy to improve on using modern machine reading packages (e.g., I'm thinking of DeepDive's extractors). But they also report a validation study in the Appendix that looks pretty good, so I'm not hugely worried about this aspect of the work.

**  Actually, I doubt the statcheck package that Nuijten et al. used would find many of my stats at all, though: at this point, I do relatively few t-tests, chi-squareds, or ANOVAs. Instead I prefer to use regression or other models to try and describe the set of quantitative trends across an entire dataset – more like the approach that Andrew Gelman has advocated for.

*** You can of course still make coding errors here. But that was true before. You just don't have to copy and paste the output of your error into a separate window. Nuijten MB, Hartgerink CH, van Assen MA, Epskamp S, & Wicherts JM (2015). The prevalence of statistical reporting errors in psychology (1985-2013). Behavior Research Methods. PMID: 26497820


  1. Hi Michael,

    great post on an important topic. I have been working on a similar package myself ( It provides an Rmd-template for APA manuscripts and several functions that automate reporting of analysis results (such as t-tests, correlations, regression and anova). I write all my papers using this package, although it's still an early development version and not on CRAN, yet. Maybe this is useful for some of your readers, too.

    Best regards,

  2. Thanks, Frederik! This is extremely helpful. We will check it out.

  3. We switched to using knitr (and Sweave before that) years ago, and it has served us well. The first published paper where we used Sweave appeared in 2011:

    Titus von der Malsburg and Shravan Vasishth. What is the scanpath signature of syntactic reanalysis? Journal of Memory and Language, 65:109-127, 2011.

    Now we use knitr, and I have recently started working with Aust's papaja package. I like it, but perhaps due to my old-fashioned ways, I prefer writing papers in Rnw format, not Rmd. One weird problem with R markdown in Beamer slides is that I can't get the page numbers to show---if anyone knows how to fix that, please let me know.

    As Michael points out, however, errors will still creep in due to coding errors, and so I don't think that using knitr etc is going to solve the problem. What is more important is that people release their data and code publicly when their paper is published. We've been doing this for years now, but I don't know anyone else who does (except other people in Potsdam).

    Another major issue with all this reproducible stuff is that code often does not reproduce on another machine/environment. I can't even get my code to reproduce 5-10 years later because the software (e.g., lme4) will have fundamental changes in it over time. In one instance, we could not reproduce our analyses 12 months later (on our own machines!), once we updated lme4. So one has to take this reproducibility idea with a grain of salt; it only works in a very limited way in practice. However, releasing your code, whether it works or not, along with your data should be a matter of course in our field. I am amazed that almost nobody does this yet.

    1. This comment has been removed by the author.

    2. I have had similar problems to reproduce my analyses with different lme4 versions. I think packages like packrat ( and checkpoint ( can help with these kinds of problems.

    3. Thanks for these tips; I will make these packages part of the workflow. Great job with papaja, by the way!

  4. @Shravan I suspect more people do than you're aware of. For example, I independently ended up with similar tooling when writing my MSc dissertation. Incidentally, later in the process I dropped a lot of my knitr in favour of LaTeX includes and a Makefile. This speeded up builds of the whole document and allowed me to develop particularly fiddly tables in isolation. So this arrangement sounds like a pattern to me. A significant proportion of (psychology) papers are (or should be) software so there's a thriving ecosystem of tools and workflow to support these use cases.

    1. Yes, I'm sure that lots of people must be using knitr. I was only talking about my narrow field of psycholinguistics. I know a few people who use it, but very few. I think that releasing data and code (such as it is) is a much more important to-do for our field than using knitr (although that's great too).

  5. @Shravan I agree with Paul that more people are doing this sort of thing than we know. But also there are still far fewer than should be. My lab releases lots of code and data - if you look at our pubs page ( there are repo links for many of our papers. Should be all of them but we are slow in some cases.

    1. That's really great that you release data and code! But I am sure you agree that you are a rare scientist in your field in that respect.

      I think that in the long run if people can be convinced to release data and code as a matter of course ("of course it's online!"), we will make much more progress in our fields. Right now I have limited success in obtaining published data from others; this can be easily changed.

      I'm also facing your problem of not being quick enough to put up data. It takes a long time to assemble the data and code, sometimes too long.