Monday, June 30, 2014

Revisions to APA

Following a recent twitter thread, I've been musing on the APA publication and citation formats. Here are few modest suggestions for modification of APA standards.
  • Place figures and captions in the text. Back when we used typewriters and early word processors, it was not possible for people to put the figures in text. That's no longer the case. Flipping back and forth between figures, captions, and text is cognitively challenging for reviewers and serves no current typesetting purpose.
  • Get rid of addresses for publishers. There is no clear format for these – do you put City, State, Country? Or City, State? Or City, Country? Does it depend on whether it's a US city or not? It's essentially impossible to find out what city you should put for most major publishers anyway, since they are all currently international conglomerates. 
  • Do away with issue numbers. The current idea is that you are supposed to know how the page numbers are assigned in a volume to decide whether to include the issue number. That is absolutely crazy. 
  • Require DOIs. This recommendation is currently not enforced in any systematic way because it's a bit of a pain. But requiring DOIs would deal with many of the minor, edge-case ambiguities caused by removing addresses and issue numbers, as well as making many bibliometric analyses far easier.
  • Come up with a better standard for conference proceedings. For those of us who cite papers at Cognitive Science or Computer Science conferences like CogSci, NIPS, or ACL, it can be a total pain to invent or track down volume, page, editors, etc. for digital proceedings that don't really have this information.
Update in response to questions: I have had several manuscripts returned without review for failures to put figures and captions at the end of the manuscript, both from APA journals (JEP:G and JEP:LMC), although other journals explicitly ask you to put figures inline

Taxonomies for teaching and learning

(This post is joint with Pat Shafto and Noah Goodman; Pat is the first author and you can also find it on his blog).

Earlier this year, we read an article in the journal journal Behavioral and Brain Sciences (BBS) by Kline, entitled How to learn about teaching: An evolutionary framework for the study of teaching behavior in humans and other animals. The article starts out from the premise that there are major debates about what constitutes teaching behavior, both in human communities and in ethological analyses of non-human behavior. Kline then proposes a functionalist definition of teaching, that teaching is "behavior that evolved to facilitate learning in others," and outlines a taxonomy of teaching behaviors that includes:
  • Teaching by social tolerance, where you let someone watch you do something;
  • Teaching by opportunity provisioning, where you create opportunities for someone to try something;
  • Teaching by stimulus or local enhancement, where you highlight the relevant aspects of a problem for a learner;
  • Teaching by evaluative feedback, which is what it sounds like; and
  • Direct active teaching, which is teaching someone verbally or by demonstration.
We were interested in this taxonomy because it intersects with work we've done on understanding teaching in the context of the inferences that learners make. In addition, Kline didn't seem to have considered things from the perspective of the learner, which is what we thought made the most sense in our work – since adaptive benefits of teaching typically accrue to the learner, not the teacher. We wrote a proposal to comment on the Kline piece (BBS solicits such proposals), but our commentary was rejected. So we're posting it here.

In brief, we argue that evolutionary benefits of teaching are driven by the benefits to learners. Thus, an evolutionary taxonomy should derive from the inferential affordances that teaching allows for learners: what aspects of the input they can learn from, what they can learn, and hence what the consequences are for their overall fitness. In our work, we have outlined a taxonomy of social learning that distinguishes three levels of learning based on the inferences that can be made in different teaching situations:
  • Learning from physical evidence, where the learner cannot make any inference stronger than that a particular action causes a particular result;
  • Learning from rational action, where the learner can make the inference that a particular action is the best one to take in order to obtain a particular result, modulo constraints on the actor's action; and
  • Learning from communicative action, where the learner can infer that a teacher chose this example because it is the best or maximally useful example for them to learn from.
Critically, our work provides a formal account of these inferences such that their assumptions and predictions are explicit (Shafto, Goodman, & Frank, 2012).  Our taxonomy distinguishes between cases where an individual chooses information intended to facilitate another’s learning (social goal-directed action in our terminology, or “direct active teaching” in Kline’s taxonomy) and cases where an individual engages in merely goal-directed behavior and is observed by a learner (“teaching by social tolerance” in Kline’s taxonomy). We've found that this framework lines up nicely with a number of results on teaching and learning in developmental psychology, including "rational imitation" findings and Pat's work on the "double-edged sword of pedagogy" as well as other empirical data (Goodman et al., 2009; Shafto, et al., 2014)

The remaining distinctions proposed by Kline – teaching by opportunity provisioning, teaching by stimulus enhancement, and teaching by evaluative feedback – also fit neatly into our framework for social learning. Opportunity provisioning is a case of non-social learning, where the possibilities have been reduced to facilitate learning. Stimulus enhancement is a form of social-goal directed learning where the informant chooses information to facilitate learning (much as in direct active teaching). Teaching by evaluative feedback is a classic form of non-social learning known as supervised learning.

On our account, these distinctions correspond to qualitatively different inferential opportunities for the learner. As such, the different cases have different evolutionary implications – through qualitative differences in knowledge propagation within groups over time. If the goal is to understand teaching as an adaption, we argue that it is critical to analyze social learning situations in terms of the differential learning opportunities that they provide, and any taxonomy of teaching or social learning must distinguish among these possibilities by focusing on the inferential implications for learners, not just through characterization of the circumstances of teaching.


Bonawitz, E. B., Shafto, P., Gweon, H., Goodman, N. D., Spelke, E. & Schulz, L. (2011). The double-edged sword of pedagogy: Instruction affects spontaneous exploration and discovery. Cognition, 120, 322-330.

Goodman, N. D., Baker, C. L., and Tenenbaum, J. B. (2009). Cause and intent: Social reasoning in causal learning. Proceedings of the Thirty-First Annual Conference of the Cognitive Science Society.

Shafto, P., Goodman, N. D., and Griffiths, T. L. (2014). Rational reasoning in pedagogical contexts. Cognitive Psychology.

Shafto, P., Goodman, N. D. & Frank, M. C. (2012). Learning from others: The consequences of psychological reasoning for human learning. Perspectives on Psychological Science, 7, 341-351. Kline, M. (2014). How to learn about teaching: An evolutionary framework for the study of teaching behavior in humans and other animals. Behavioral and Brain Sciences, 1-70 DOI: 10.1017/S0140525X14000090

First word?

(The somewhat cubist Brown Bear that may or may not have been the referent of M's first word.) 

M appears to have had a first word. As someone who studies word learning, I suppose I should have been prepared. But as excited as I was, I still found the whole thing very surprising. Let me explain.

M has been babbling ba, da, and ma for quite a while now. But at about 10.5 months, she started producing a very characteristic sequence: "BAba." This sequence showed up in the presence of "Brown Bear, Brown Bear, What Do You See," a board book by Eric Carle that she loves and that she reads often both at daycare and at home.

I was initially skeptical that this form was really a word, but three things convinced me:

  1. Consistency of form. The intonation was descending, the stress was on the first syllable, and there was a hint of rounding ("B(r)Ab(r)a"). It felt very word-y.
  2. Consistency of context. We heard this again and again when we would bring the book to her. 
  3. Low frequency in other contexts. We pretty much only heard it when "Brown Bear" was present, with the exception of one or two potential false alarms when another book was present.
Even sillier, M stopped using it around 3 weeks later. Now we think she's got "mama" and "dada" roughly associated with us, but we haven't heard "BAba" in a while, even with repeated prompting. 

This whole trajectory highlights a feature of  development that I find fascinating: its non-linearity. M's growth – physically, motorically, and cognitively – proceeds in fits and starts, rather than continuously. We see new developments, and then a period of consolidation. We may even see what appears to be regression. 

It's easy to read about non-linearities in development. But observing one myself made me think again about the importance of microgenetic studies, where you sample a single child's development in depth across a particular transition point for the behavior you're interested in. As readers of the developmental literature, we forget this kind of thing sometimes; as parents, we are the original microgeneticists. 

Monday, June 2, 2014

Shifting our cultural understanding of replication

tl;dr - I agree that replication is important – very important! But the way to encourage it as a practice is to change values, not to shame or bully.


The publication of the new special issue on replications in Social Psychology has prompted a huge volume of internet discussion. For one example, see the blogpost by Simone Schnall, one of the original authors of a paper in the replication issue – much of the drama has played out in the comment thread (and now she has posted a second response). My favorite recent additions to this conversation have focused on how to move the field forward. For example, Betsy Levy Paluck has written a very nice piece on the issue that also happens to recapitulate several points about data sharing that I strongly believe in. Jim Coan also has a good post on the dangers of negativity.

All of this discussion has made me consider two questions: First, what is the appropriate attitude towards replication? And second, can we make systematic cultural changes in psychology to encourage replication without the kind of negative feelings that have accompanied the recent discussion? Here are my thoughts:
  1. The methodological points made by proponents of replication are correct. Something is broken.
  2. Replication is a major part of the answer; calls for direct replication may even understate its importance if we focus on cumulative, model-driven science.
  3. Replication must not depend on personal communications with authors. 
  4. Scolding, shaming, and "bullying" will not create the cultural shifts we want. Instead, I favor technical and social solutions. 
I'll expand on each of these below.

1. Something is broken in psychology right now.

The Social Psychology special issue and the Reproducibility Project (which I'm somewhat involved in) both suggest that there may be systematic issues in our methodological, statistical, and reporting standards. My own personal experiences confirm this generalization. I teach a course based on students conducting replications. A paper describing this approach is here, and my syllabus – along with lots of other replication education materials – is here.

In class last year, we conducted a series of replications of an ad-hoc set of findings that the students and I were interested in. Our reproducibility rate was shockingly low. I coded our findings on a scale from 0 - 1, with 1 denoting full replication (a reliable significance test on the main hypothesis of interest) and .5 denoting partial replication (a trend towards significance, or a replication of the general pattern but without a predicted interaction or with an unexpected moderator). We reproduced 8.5 / 19 results (50%), with a somewhat higher probability of replication for more "cognitive" effects (~75%, N=6) and a somewhat lower probability for more "social" effects (~30%, N=11). Alongside the obvious possibility for our failures – that some of the findings we tried to reproduce were spurious to begin with – there are many other very plausible explanations. Among others: We conducted our replications on the web, there may have been unknown moderators, we may have made methodological errors, and we could have been underpowered (though we tried for 80% power relative to the reported effects).*

To state the obvious: Our numbers don't provide an estimate of the probability that these findings were incorrect. But they do estimate the probability that a tenacious graduate student could reproduce the finding effectively for a project, given the published record and an email – sometimes unanswered – to the original authors. Although our use of Mechanical Turk as a platform was for convenience, preliminary results from the RP suggest that my course's estimate of reproducibility isn't that far off base.

When "prep" – the probability of replication – is so low (the true prep, not the other one), we need to fix something. If we don't, we run a large risk that students who want to build on previous work will end up wasting tremendous amounts of time and resources trying to reproduce findings that – even if they are real – are nevertheless so heavily moderated or so finicky that they will not form a solid basis for new work.

2. Replication is more important than even the replicators emphasize.  

Much of the so-called "replication debate" has been about the whether, how, who, and when of doing direct replications of binary hypothesis tests. These hypothesis tests are used in paradigms from papers that support particular claims (e.g. cleanliness is related to moral judgment, or flags prime conservative attitudes). This NHST approach – even combined with a meta-analytic effect-size estimation approach, as in the Many Labs project – understates the importance of replication. That's because these effects typically aren't used as measurements supporting a quantitative theory.

Our goal as scientists (psychological, cognitive, or otherwise) should be to construct theories that make concrete, quantitative predictions. While verbal theories are useful up to a point, formal theories are a more reliable method for creating clear predictions; these formal theories are often – but don't have to be – instantiated in computational models. Some more discussion of this viewpoint, which I call "models as formal theories," here and here. If our theories are going to make quantitative predictions about the relationship between measurements, we need to be able to validate and calibrate our measurements. This validation and calibration is where replication is critical.

Validation. In the discussion to date (largely surrounding controversial findings in social psychology), it has been assumed that we should replicate simply to test the reliability of previous findings. But that's not why every student methods class performs the Stroop task. They are not checking to see that it still works. They are testing their own reliability – validating their measurements.

Similarly, when I first set up my eye-tracker, I set out to replicate the developmental speedup in word processing shown by Anne Fernald and her colleagues (reviewed here). I didn't register this replication, and I didn't run it by her in advance. I wasn't trying to prove her wrong; as with students doing Stroop as a class exercise, I was trying to validate my equipment and methods. I believed so strongly in Fernald's finding that I figured that if I failed to replicate it, then I was doing something wrong in my own methods. Replication isn't always adversarial. This kind of bread and butter replication is – or should be – much more common.

Calibration. If we want to make quantitative predictions about the performance of new group of participants in tasks derived from previous work, we need to calibrate our measurements to those of other scientists. Consistent and reliable effects may nevertheless be scaled differently due to differences in participant populations. For example, studies of reading time among college undergraduates at selective institutions may end up finding overall faster reading than studies conducted among a sample with a broader educational background.

As one line of my work, I've studied artificial language learning in adults as a case study of language learning mechanisms that could have implications for the study of language learning in children. I've tried to provide computational models of these sorts of learning phenomena (e.g. here, here, and here). Fitting these models to data has been a big challenge because typical experiments only have a few datapoints - and given the overall scaling differences in learning described above, a model needs to have 1 - 2 extra parameters (minimally an intercept but possibly also a slope) to integrate across experiment sets from different labs and populations.

As a result, I ended up doing a lot of replication studies of artificial language experiments so that I could vary parameters of interest and get quantitatively-comparable measures. I believed all of the original findings would replicate – and indeed they did, often precisely as specified. If you are at all curious about this literature, I replicated (all with adults): Saffran et al. (1996a; 1996b); Aslin et al. (1998); Marcus et al. (1999); Gomez (2002); Endress, Scholl, & Mehler (2005); and Yu & Smith (2007). All of these were highly robust. In addition, in all cases where there were previous adult data, I found differences in the absolute level of learning from the prior report (as you might expect, considering I was comparing participants on Mechanical Turk or at MIT with whatever population the original researchers had used). I wasn't surprised or worried by these differences. Instead, I just wanted to get calibrated – find out what the baseline measurements were for my particular participant population.

In other words, even – or maybe even especially – when you assume that the binary claims of a paper are correct, replication plays a role by helping to validate empirical measurements and calibrate those measurements against prior data.

3. Replications can't depend on contacting the original authors for details. 

As Andrew Wilson argues in his nice post on the topic, we need to have the kind of standards that allow reproducibility – as best as we can – without direct contact of the initial authors. Of course, no researcher will always know perfectly what factors matter to their finding, especially in complex social scenarios. But how are we ever supposed to get anything done if we can't just read the scientific literature and come up with new hypotheses and test them? Should we have to contact the authors for every finding we're interested in, to find out whether the authors knew about important environmental moderators that they didn't report? In a world where replication is commonplace and unexceptional – where it is the typical starting point for new work rather than an act of unprovoked aggression – the extra load caused by these constant requests would be overwhelming, especially for authors with popular paradigms.

There's a different solution. Authors could make all of their materials (at least the digital ones) directly and publicly accessible as part of publication. Psycholinguists have been attaching their experimental stimulus items as an appendix to their papers for years – no reason not to do this more ubiquitously. For most studies, posting code and materials will be enough. In fact, for most of my studies – which are now all run online – we can just link to the actual HTML/javascript paradigm so that interested parties can try it out. If researchers believe that their effects are due to very specific environmental factors, then they can go the extra mile to take photos or videos of the experimental circumstances. The sharing of materials and data (whether using the Open Science Framework, github, or other tools) is free and costs almost nothing in terms of time. Used properly, these tools can even improve the reliability of own work along with its reproducibility by others.

I don't mean to suggest that people shouldn't contact original authors, merely that they shouldn't be obliged to. Original authors are – by definition – experts in a phenomenon, and can be very helpful in revealing the importance of particular factors, providing links to followup literature both published and unpublished, and generally putting work in context. But a requirement to contact authors prior to performing a replication emphasizes two negative values: the possibility for perceived aggressiveness in the act of replication, and the incompleteness of methods reporting. I strongly advocate for the reverse default. We should be humbled and flattered when others build on our work by assuming that it is a strong foundation, and they should assume our report is complete and correct. Neither of these assumptions will always be true, but good defaults breed good cultures.

4. The way to shift to a culture of replication is not by shaming the authors of papers that don't replicate. 

No one likes it when people are mean to one another. There's been some considerable discussion of tone on the SPSP blog and on twitter, and I think this is largely to the good. It's important to be professional in our discussion or else we alienate many within the field and hurt our reputation outside it. But there's a larger reason why shaming and bullying shouldn't be our MO: they won't bring about the cultural changes we need. For that we need two ingredients. First, technical tools that decrease the barriers to replication; and second, role models who do cool research that moves the field forward by focusing on solid measurement and quantitative detail, not flashy demonstrations. 

Technical tools. One of the things I have liked about the – otherwise somewhat acrimonious – discussion of Schnall et al.'s work is the use of the web to post data, discuss alternative theories, and iterate in real time on an important issue (three links herehere, and here, with meta-analysis here). If nothing else comes of this debate, I hope it convinces its participants that posting data for reanalysis is a good thing. 

More generally, my feeling is that there is a technical (and occasionally generational) gap at work in some of this discussion. On the data side, there is a sense that if all we do are t-tests on two sets of measurements from 12 people, then no big deal, no one needs to see your analysis. But datasets and analyses are getting larger and more sophisticated. People who code a lot accept that everyone makes errors. In order to fight error, we need to have open review of code and analysis. We also need to have reuse of code across projects. If we publish models or novel analyses, we need to give people the tools to reproduce them. We need sharing and collaborating, open-source style – enabled by tools like github and OSF. Accepting these ideas about data and analyses means that replication on the data side should be trivial: a matter of downloading and rerunning a script. 

On the experimental side, reproducibility should be facilitated by a combination of web-based experiments and code-sharing. There will always be costly and slow methods – think fMRI or habituating infants – but standard issue social and cognitive psychology is relatively cheap and fast. With the advent of Mechanical Turk and other online study venues, often an experiment is just a web page, perhaps served by a tool like PsiTurk. And I think it should go without saying: if your experiment is a webpage, then I would like to see the webpage if I am reading your paper. That way if I want to reproduce your findings I should be able to make a good start by simply directing people – online or in the lab – to look at your webpage, measuring your responses, and rerunning your analysis code. Under this model, if I have $50 and a bit of curiosity about your findings, I can run a replication. No big deal.** 

We aren't there yet. And perhaps we will never be for more involved social psychological interventions (though see PERTS, the Project for Education Research that Scales, for a great example of such interventions in a web context). But we are getting closer and closer. The more open we are with experimental materials, code, and data, the easier replication and reanalysis will be and the less we will have to imagine replication as a last-resort, adversarial move, and the more collecting new data will be part of a general ecosystem of scientific sharing and reuse.

Role models. These tools will only catch on if people think they are cool. For example, Betsy Levy Paluck's research on high-schoolers suggests something that we probably all know intuitively. We all want to be like the cool people, so the best way to change a culture is by having the cool kids endorse your value of choice. In other words, students and early-career psychologists will flock to new approaches if they see awesome science that's enabled by these methods. I think of this as a new kind of bling: Instead of being wowed by the counterintuitiveness or unlikeliness of a study's conclusions, can we instead praise how thoroughly it nailed the question? Or the breadth and scope of its approach


For what it's worth, some of the rush to publish high-profile tests of surprising hypotheses has to be due to pressures related to hiring and promotion in psychology. Here I'll again follow Betsy Levy Paluck and Brian Nosek in reiterating that, in the search committees I've sat on, the discussion over and over turns to how thorough, careful, and deep a candidate's work is – not how many publications they have. Our students have occasionally been shocked to see that a candidate with a huge, stellar CV doesn't get a job offer, and have asked "what more does someone need to do in order to get hired." My answer (and of course this is only my personal opinion, not the opinion of anyone else in the department): Engage deeply with an interesting question and do work on that question that furthers the field by being precise, thorough, and clearly-thought out. People who do this may pay a penalty in terms of CV length - but they are often the ones who get the job in the end

I've argued here that something really is broken in psychology. It's not just that some of our findings don't (easily) replicate, it's also that we don't think of replication as core to the enterprise of making reliable and valid measurements to support quantitative theorizing. In order to move away from this problematic situation, we are going to need technical tools to support easier replication, reproduction of analyses, and sharing more generally. We will also need the role models to make it cool to follow these new scientific standards.

Thanks very much to Noah Goodman and Lera Boroditsky for quick and helpful feedback on a previous draft. (Minor typos fixed 6/2 afternoon).

* I recently posted about one of the projects from that course, an unsuccessful attempt to replicate Schnall, Benton, & Harvey (2008)'s cleanliness priming effect. As I wrote in that posting, there are many reasons why we might not have reproduced the original finding – including differences between online and lab administration. Simone Schnall wrote in her response that "If somebody had asked me whether it makes sense to induce cleanliness in an online study, I would have said 'no,' and they could have saved themselves some time and money." It's entirely possible that cleanliness priming specifically is hard to achieve online. That would surprise me given the large number of successes in online experimentation more broadly (including the IAT and many subtle cognitive phenomena, among other things – I also just posted data from a different social priming finding that does replicate nicely online). In addition, while the effectiveness of a prime should depend on the baseline availability of the concept being primed, I don't see why this extra noise would completely eliminate Schnall et al.'s effect in the two large online replications that have been conducted so far.

** There's been a lot of talk about why web-based experiments are "just as good" as in-lab experiments. In this sense, they're actually quite a bit better! The fact that a webpage is so easily shown to others around the world as an exemplar of your paradigm means that you can communicate and replicate much more easily.