Monday, November 24, 2014

The piecemeal emergence of language

It's been a while since I last wrote about M. She's now 16 months, and it's remarkable to see the trajectory of her early language. On the one hand, she still produces relatively few distinct words that I can recognize; on the other, her vocabulary in comprehension is quite large and she clearly understands a number of different speech acts (declaratives, imperatives, questions) and their corresponding constructions.

Some observations on production:

  • She still doesn't say "mama." She does say "mamamamamama" to express need, a pattern that Clark 1973 noted is common. She definitely knows what "mama" means, and even does funny things like pointing to me and saying "dada" then pointing to her mother and opening her mouth. 
  • I have nevertheless heard her make un-cued productions of "scissors," "bulldozer," and "motorcycle" (though not with great reliability). Motorcycle translated to something like "dodo SY-ku" – a kind of indistinct prosodic foot and then a second heavily stressed foot. Her production vocabulary is extremely idiosyncratic compared with her comprehension, precisely the pattern identified by Mayor & Plunkett (2014) in a very cool recent paper. 
  • "BA ba" (repeated over and over again) seems to mean "let's sing a song" – or especially, let's watch inane internet children's song videos. We don't do this last all that often, but it has made an outsize impression on her, perhaps because she's seen so little TV in her short life. This is also the first time that she's taken to repeating a single word / label over and over again, so as to emphasize the point. 
And on comprehension:
  • Our life got vastly better when M learned how to say "yes" to yes/no questions. For about a month now, we've been able to say things like "would you like to go outside?" and she will reply "da!" (she is Russian, apparently). "Da" has very recently morphed into "yah" but it's very clearly a strong affirmative. M will occasionally turn her head away and wrinkle her nose if she doesn't like the suggestion. This response feels a lot like a generalization of her I don't want to eat that bite face. 
  • Other types of questions have been slower. Maybe unsurprisingly, "or" is still not a success – she either stays silent or responds to the second option, even if she knows how to produce a word for one or both options. "Where" questions have been emerging in the last week or so. This morning, M was very clear in directing me when I asked her "where should we go?" "What's this" is uneven – occasionally I'll get a "ba" or "da" (ball/dog) type production. And "what do you want" has only gotten a successful production once or twice (bottle, I think). 
  • M understands and responds to simple imperatives just fine: "take the cup to baby" gets a positive response, though her accuracy on less plausible sentences is low.
  • Explanations seem to hold a lot of water with her. I don't think she understands the explanation at all, but if we need to give something to someone, or leave something behind that she's holding, we ask her and then explain. For example, telling her why we can't bring her favorite highlighter pen in the car with us seems to convince her to put it down. What's going through her mind here? Maybe just our seriousness about the idea – something like wow, they used a lot of words, they must really mean it
  • She is remarkably good at negation (at least when she wants to be). A few days ago we were headed out the door to the playground, and M tried to drag a big stroller blanket out the door.  I said "We're not going to bring our blanket outside." She headed back over to the stroller, and dropped the blanket. Of course, then she headed back towards the door, turned back, and grabbed a smaller blanket. There was a lot of contextual support to this sequence, but understanding my sentence still took some substantial sophistication. The negation "we're not" is embedded in the sentence, and wasn't supported by too much in the way of prosody. This success was very striking to me, given the failures of much older toddlers to understand more decontextualized negations in some research that Ann Nordmeyer and I have been doing
Overall, I am still struck by how hard production is for M, compared with comprehension. A new word, say "playground" might start as something resembling "PAI-go" but merge back into "BA-ba" by the end of a few repetitions. M has never been a big babbler, and so I suspect that she is slow to produce language because the skills of production are simply not as well-practiced. There are some kids who babble up a storm, and I imagine all of the motor routines are much easier for them In contrast, M just doesn't have the sounds of language in her mouth yet.

Wednesday, November 19, 2014

Musings on the "file drawer" effect

tl;dr: Even if you love science, you don't have to publish every experiment you conduct.

I was talking with a collaborator a few days ago and discussing which of a series of experiments we should include in our writeup. In the course of this conversation, he expressed uncertainty about whether we were courting ethical violation by choosing to exclude from a potential publication a set of generally consistent but somewhat more poorly executed studies. Publication bias is a major problem in the social sciences (and elsewhere).* Could we be contributing to the so-called "file drawer problem," in which meta-analytic estimates of effects are inflated by the failure to publish negative findings?

I'm pretty sure the answer is "no."

Some time during my first year of graduate school, I had run a few studies that produced positive findings (e.g., statistically significant differences between groups).  I went to my advisor and started saying all kinds of big things about how I would publish them and they'd be in this paper and that paper and the other; probably it came off as quite grandiose. After listening for a while, he said, "we don't publish every study we run."

His point was that a publishable study – or set of studies – is not one that produces a "significant" result. A publishable study is one that advances our knowledge, whether the result is statistically significant or not. If a study is uninteresting, it may not be worth publishing. Of course, the devil is in the details of what "worth publishing" means, so I've been thinking about how you might assess this. Here's my proposal:
It is unethical to avoid publishing a result if a knowledgeable and adversarial reviewer could make a reasonable case that your publication decision was due to a theoretical commitment to one outcome over another. 
I'll walk through both sides of this proposal below. If you have feedback, or counterexamples, I'd be eager to hear them. 

When it's fine not to publish. First, everyone doesn't have an obligation to publish scientific research. For example, I've supervised some undergraduate honors theses that were quite good, but the students weren't interested in a career in science. I regret that they didn't do the work to write up their data for publication, but I don't think they were being unethical, at least from the perspective of publication bias (if they had discovered a lifesaving drug, the analysis might be different).

Second, publication has a cost. The cost is mostly in terms of time, but time is translatable directly into money (whether from salary or from research opportunity cost). Under the current publication system, publishing a peer-reviewed paper is extremely slow. In addition to the authors' writing time, a paper takes hours of time from editors and reviewers, and much thought and effort in responding to reviews. A discussion of the merits of peer review is a topic for another post (spoiler: I'm in favor of it).** But even the most radical alternatives – think generalized arXiv – do not eliminate the cost of writing a clear, useful manuscript. 

So on a cost-benefit analysis, there is a lot of work that shouldn't be written up. For example, cases of experimenter error are pretty clear cut. If I screw up my stimuli and Group A's treatment was contaminated with items that Group B should have seen, then what do we learn? The generalizable knowledge from that kind of experiment is pretty thin. It seems uncontroversial that this sort of results aren't worth publishing.

What about correct but boring experiments? What if I show that the Stroop effect is unaffected by font choice – or perhaps I show a tiny, statistically significant but not meaningful, effect of serif fonts on Stroop effect.*** For either of these experiments, I imagine I could find someone to publish them. In principle, if they were well-executed, PLoS ONE would be a viable venue, since they do not referee for impact. But I am not sure why anyone would be particularly interested, and I don't think it'd be unethical not to publish them.

When it's NOT fine not to publish. First, when a finding is "null" – meaning, not statistically significant despite your expectation that it would be. Someone who held an alternative position (e.g. that the finding would not be predicted to yield a significant result) could say that you were biasing the literature due to your theoretical commitment. This is probably the most common case of publication bias.

Second, if your finding is inconsistent with a particular theory, this fact also should not be used in the decision about publication. Obviously, an adversarial critic could argue – rightly – that you suppressed the finding, which in turn leads to an exaggeration in the degree of published evidence for your preferred theory.

Third, when a finding (finding #1) is contradictory to another finding (finding #2) that you do intend to publish. Here, just think about if your reviewer knew about #1 as well. Could you justify on independent, a priori grounds that you should not publish #1, independent of the theory? In my experience, the only time that is possible is if #1 is clearly a flawed experiment and does not have any evidential value for the question you're interested in.****

Conclusions. Publication bias is a significant issue, and we need use a variety of tools to combat it. Funnel plots are a useful tool, and some new work by Simonsohn et al. uses p-curve analysis. But the solution is certainly not to assume that researchers should publish all their experiments – that solution might be as bad as the problem, in terms of the cost for scientific productivity. Instead, to determine if they are suppressing evidence due to their own biases, researchers should consider applying an ethical test like the one I proposed above.

(The footnotes here got a little out of control).

* A recent, high impact study used TESS (Time-Sharing Experiments in the Social Sciences, a resource for doing pre-peer reviewed experiments with large, representative samples) to estimate publication bias in the social sciences. I like this study a lot, but I am not sure how general the bias estimates are, because TESS is a special case. TESS is a limited resource, and experiments submitted to TESS undergo substantial additional scrutiny due to TESS's pre-data collection review. They are relatively more well-vetted for potential theoretical impact, and substantially less likely to have basic errors, compared with a one-off study using a convenience sample. I suspect – based on no data except my own experience – that relatively more data is left unpublished than the TESS study's estimate, but also that relatively less of it should be published.

** You could always say, hey, we should just put all our data online. We actually do something sort of like that. But you can't just go to and easily find out whether we conducted an experiment on your theoretical topic of choice. Reporting experiments is not just about putting the data out there – you need description, links to relevant literature, etc.

*** Actually, someone has done Stroop for fonts, though that's a different and slightly more interesting experiment.

**** Here's a trickier one. If a finding is consistent with a theory, could this consistency be grounds to avoid publishing it? A Popperian falsificationist scientist should never publish data that are simply consistent with a particular theory, because those data have no value. But basically no one operates in this way – we all routinely make predictions from theory and are excited when they are satisfied.  For a Bayesian scientist of this type, data consistent with a theory are important. But some data may be consistent with many theories and hence provide little evidential value. Other data may be consistent with a theory, but that theory is already so well-supported, so the experiments make little change in our overall degree of belief – consider the case of experiments supportive of Newton's laws, or of further Stroop replications. These cases also potentially work under the adversarial reviewer test, but only if we include the cost-benefit analysis above, and the logic is dicier. A reviewer could accuse you of bias against the Stroop effect, but you might respond that you just didn't think the incremental evidence was worth the effort. Nevertheless, this balance seems less straightforward. Reflecting this complexity, perhaps the failure to publish confirmatory evidence actually does matter. In a talk I heard last spring, John Ioannidis made the point that there are basically no medical interventions out there with d (standardized effect size) > 3 or so (I forget the exact number). I think this is actually a case of publication bias against confirmation of obvious effects. For example, I can't find a clinical trial of the rabies vaccine anywhere after Pasteur – because the mortality rate without the vaccine is apparently around 99%, and with the vaccine most people survive. The effect size there is just enormous – so big that you should just treat people! So actually the literature does have systematic bias against really big effects.

Monday, November 10, 2014

Comments on "reproducibility in developmental science"

A new article by Duncan et al. in the journal Developmental Psychology highlights best practices for reproducibility in developmental research. From the abstract:
Replications and robustness checks are key elements of the scientific method and a staple in many disciplines. However, leading journals in developmental psychology rarely include explicit replications of prior research conducted by different investigators, and few require authors to establish in their articles or online appendices that their key results are robust across estimation methods, data sets, and demographic subgroups. This article makes the case for prioritizing both explicit replications and, especially, within-study robustness checks in developmental psychology. 
I'm very interested in this topic in general and think that the broader message is on target. Nevertheless, I was surprised by the specific emphasis in this article on what they call "robustness checking" practices. In particular, all three of the robustness practices they describe – multiple estimation techniques, multiple datasets, and subgroup analyses – seem to be most useful for non-experimental studies that involve large correlational datasets (e.g. from nationally representative studies).

Multiple estimation techniques refers to the use of several different statistical models (e.g. standard regression, propensity matching, instrumental variable regression) to estimate the same effect. This is not a bad practice, but it is much more important when there are many different ways of controlling for confounders (e.g. in a large observational dataset). In a two-condition experiment, the menu of options is more limited. Similarly, subgroup estimation – estimating models on smaller populations within the main sample – is typically only possible with a large, multi-site dataset. And the use of multiple datasets presupposes that there are many datasets that bear on the question of interest, something that is not usually true when you are making experimental tests of a new theoretical question.

So all this means that the primary empirical claim of the article – that developmental psych is behind other disciplines (like applied economics) in these practices – is a bit unfair. Here's the key table from the article:

The main point we're supposed to take away from this table is that the econ articles are doing many more robustness checks than the developmental psych articles. But I'd bet that most of the developmental psych journals are filled with novel empirical studies that don't afford comparison with large, pre-existing datasets; subgroup analyses; or use of multiple estimation techniques. And I'm not sure that's a bad thing – at very least, causal inference is far more straightforward in randomized experiments than large-scale observational studies.

I think I have the same goals as the authors: making developmental (and other) research more reproducible. But I would start with a different set of recommendations to the developmental psych community. Here are three simple ones:
  • Larger samples. It is still common in the literature on infancy and early childhood to have extremely small sample sizes. N=16 is still the accepted standard in infancy research, believe it or not. Given the evidence that looking time is a quantitative variable (e.g. here and here), we need to start measuring it with precision. Infants are expensive, but not as expensive as false positives. And preschoolers are cheap, so there's really no excuse for tiny cell sizes.
  • Internal replication. There are many papers – again especially in infant research but also in work with older children – where the primary effect is demonstrated in Study 1 and then the rest of the reported findings are negative controls. A good practice for these studies is to pair each control with a de novo replication. This facilitates statistical comparison (e.g., equating for small aspects of population or testing setup that may change between studies) and also ensures robustness of the effect. 
  • Developmental comparison. This recommendation probably should go without saying. For developmental research – that is, work that tries to understand mechanisms of growth and change – it's critical to provide developmental comparisons and not just sample a single convenient age group. Developmental comparison groups also provide an important opportunity for internal replication. If 3-year-olds are above chance on your task and 4- and 5-year-olds aren't, then perhaps you've discovered an amazing phenomenon; but it's also possible you have a false positive. Our baseline hypotheses about development provide useful constraints on the pattern of results we expect, meaning that developmental comparison groups can provide both new data and a useful sanity check.
Perhaps this all just reflects my different orientation towards the field than Duncan et al.; but a quick flip through a recent issue of Child Development suggests that the modal article is not a large observational study but a much smaller-scale set of experiments. The recommendations Duncan et al. make are certainly reasonable, but we need to supplement them with guidelines for experimental research as well. Duncan GJ, Engel M, Claessens A, & Dowsett CJ (2014). Replication and robustness in developmental research. Developmental psychology, 50 (11), 2417-25 PMID: 25243330

(HT: Dan Yurovsky)

Monday, November 3, 2014

Is mind-reading automatic? A replication and investigation

tl;dr: We have a new paper casting doubt on previous work on automatic theory of mind. Musings on open science and replication.

Do we automatically represent other people's perspective and beliefs, just by seeing them? Or is mind-reading effortful and slow? An influential paper by Kovács, Téglás, and Endress (2010; I'll call them KTE) argued for the automaticity of mind-reading based on an ingenious and complex paradigm.  The participant watched an event – a ball moving back and forth, sometimes going behind a screen – at the same time as another agent (a Smurf) watched part of the same event but sometimes missed critical details. So, for example, the Smurf might leave and not see the ball go behind the screen.

When participants were tested on whether the ball was really behind the screen, they appeared to be faster when their beliefs lined up with the Smurf's. This experiment – and a followup with infants – gave apparently strong evidence for automaticity. Even though the Smurf was supposed to be completely "task-irrelevant" (see below), participants apparently couldn’t help "seeing the world through the Smurf’s eyes." They were slower to detect the ball, even when they themselves expected the ball, if the Smurf didn’t expect it to be there. (If this short description doesn't make everything clear, then take a look at our paper or KTE's original writeup. I found the paradigm quite hard to follow the first time I saw it.)

I was surprised and intrigued when I first read KTE's paper. I don't study theory of mind, but a lot of my research on language understanding intersects with this domain and I follow it closely. So a couple of years later, I added this finding to the list of projects to replicate for my graduate methods class (my class is based on the idea – originally from Rebecca Saxe – that students learning experimental methods should reproduce a published finding). Desmond Ong, a grad student in my department, chose the project. I found out later that Rebecca had also added this paper to her project list.

One major obstacle to the project, though, was that KTE had repeatedly declined to share their materials – in direct conflict with the Science editorial policy, which requires this kind of sharing. I knew that Jonathan Philips (Yale) and Andrew Surtees (Birmingham) had worked together to create an alternative stimulus set, so Desmond got connected with them and they generously shared their videos. Rebecca's group created their own Smurf videos from scratch. (Later in the project, we contacted KTE again and even asked the Science editors to intervene. After the intervention, KTE promised to get us the materials but never did. As a result, we still haven't run our experiments with their precise stimuli, something that is very disappointing from the perspective of really making sure we understand their findings, though I would stress that because of the congruence between the two new stimulus sets in our paper, we think the conclusions are likely to be robust across low-level variations.)

After we got Jonathan's stimulus set, Desmond created a MTurk version of the KTE experiment and collected data in a high-power replication, which reproduced all of their key statistical tests. We were very pleased, and got together with Jonathan to plan followup experiments. Our hope was to use this very cool paradigm to test all kinds of subtleties about belief representation, like how detailed the participants' encoding was and whether it respected perceptual access. But then we started taking a closer look at the data we had collected and noticed that the pattern of findings didn't quite make sense. Here is that graph – we scraped the values from KTE's figures and replotted their SEM as 95% CIs:

The obvious thing is the difference in intercept between the two studies, but we actually don't have a lot to say about that – RT calculations depend on when you start the clock, and we don't know when KTE started the clock in their study. We also did our study on the internet, and though you can get reliable RTs online, they may be a bit slower for all kinds of unimportant reasons.

We also saw some worrisome qualitative differences between the two datasets, however. KTE's data look like people are slower when the Smurf thinks the ball is absent AND when they themselves think the ball is absent too. In contrast, we see a crossover interaction – people are slow when they and the Smurf think the thing is absent, but they are also slow when they and the Smurf think the thing is present. That makes no sense on KTE's account – that should be the fastest condition. What's more, we can't be certain that KTE wouldn't have seen that result, because their overall effects were so much smaller and their relative precision given those small effects seemed lower.

I won't go through all the ways that we tried to make this crossover interaction go away. Suffice it to say, it was a highly replicable finding, across labs and across all kinds of conditions that shouldn't have produced it. Here's Figure 3 from the paper:

Somewhere during this process, we joined forces with Rebecca, and found that they saw the crossover as well (panels marked "1c: Lab" and "2b: Lab 2AFC"). So Desmond and Jonathan then led the effort to figure out the source of the crossover.

Here's what they found. The KTE paradigm includes an "attention check" in which participants have to respond that they saw the Smurf leave the room. But the timing of this attention check is not the same across belief conditions – in other words, it's confounded with condition. And the timing of the attention check actually looks a lot like the crossover we observed: In the two conditions where we observed the slowest RTs, the attention check is closer in time to the actual decision that participants have to make.

There's an old literature showing that making two reaction time decisions right after one another makes the second one slower. We think this is exactly what's going on in KTE's paper, and we believe our experiments demonstrate it pretty clearly. When we don't have an attention check, we don't see the crossover; when we do have the check, but it doesn't have a person in it at all (just a lightbulb), we still see the crossover; and when we control the timing of the check, we eliminate the crossover again.

Across studies, we were able to produce a double dissociation between belief condition and the attention check. To my mind, this dissociation provides strong evidence that the attention check – and not the agent's belief – is responsible for the findings that KTE observed. In fact, KTE and collaborators actually also see a flat RT pattern in a recent article that didn't use the attention check (check their SI for the behavioral results)! So their results are congruent with our own – this also partially mitigates our concern about the stimulus issue. In sum, we don't think the KTE paradigm provides any evidence on the presence of automatic theory of mind.

Thoughts on this process. 

1. Several people who are critical of the replication movement more generally (e.g. Jason Mitchell, Tim Wilson) have suggested that we pursue "positive replications," where we identify moderator variables that control the effects of interest. That's what we did here. We "debugged" the experiment – figured out exactly what went wrong and led to the observed result. Of course, the attention check wasn't a theoretically-interesting moderator, but I think we did exactly what Mitchell and Wilson are talking about.

But I don't think "positive replication" is a sustainable strategy more generally. KTE's original N=24 experiment took us 11 experiments and almost 900 participants to "replicate positively," though we knew much sooner that it wouldn't be the paradigm we'd use for future investigations (what we might have learned from the first direct replication, given that the RTs didn't conform to the theoretical predictions). 

The burden in science can't fall this hard on the replicators all the time. Our work here was even a success by my standards, in the sense that we eventually figured out what was going on! There are other experiments I've worked on, both replications and original studies, where I've never figured out what the problem was, even though I knew there was a problem. So we need to acknowledge that replication can establish – at least – that some findings are not robust enough to build on, or do not reflect the claimed process, without ignoring the data until the replicators figure out exactly what is going on.

2. The replication movement has focused too much for my taste on binary effect size estimation or hypothesis testing, rather than model- or theory-based analysis. There's been lots of talk about replication as a project of figuring out if the original statistical test is significant, or if the effect size is comparable. That's not what this project was about – I should stress that KTE's original paradigm did prove replicable. All of the original statistical tests were statistically significant on basically every replication we did. The problem was that the overall pattern of data still wasn't consistent with the proposed theory. And that's really what the science was about.

This sequence of experiments was actually a bit of a reductio ad absurdum with respect to null-hypothesis statistical testing more generally. Our paper includes 11 separate samples. Despite having planned to have >80% power for all of the tests we did, the sheer scope means that a good number of them would not come out statistically significant, just by chance. So we were in a bit of a quandary – we had predictions that "weren't satisfied" in individual experiments, but we'd strongly expect that to be the case just by chance! (The probability of 11 statistical tests, each with 80% power, all coming out significant is less than .1).

So rather than looking at whether all the p-values were independently below .05, we decided to aggregate the RT effect size on the key effect using meta-analysis. This analysis allowed us to see which account best predicted the RT differences across experiments and conditions. We aggregated the RT coefficients for the crossover interaction (panel A below) and the RT differences for the key contrast in the automatic theory of mind hypothesis (panel B below).  You can see the result here:

The attention check hypothesis clearly differentiates between conditions where we see a big crossover effect and conditions where we don't. In contrast, the automatic theory of mind hypothesis doesn't really differentiate the experiments, and the meta-analytic effect estimate goes in the wrong direction. So the combined evidence across our studies supports the attention check being the source of the RT effect.

Although this analysis isn't full Bayesian model comparison, it's a reasonable method for doing something similar in spirit – comparing the explanatory value of two different hypotheses. Overall, this experience has shifted me much more strongly to more model-driven analyses for large study sets, since individual statistical tests are guaranteed to fail in the limit – and that limit is much closer than I expected.

3. Where does this leave theory of mind, and the infant literature in particular? As we are at pains to say in our paper, we don't know. KTE's infant experiments didn't have the particular problem we describe here, and so they may still reflect automatic belief encoding. On the other hand, the experience of investigating this paradigm has made me much more sensitive to the issues that come up when you try to create complex psychological state manipulations while holding constant participants' low-level perceptual experience. It's hard! Cecilia Hayes has recently written several papers making this same kind of point (here and here). So I am uncertain about this question more generally.

That's a low note to end on, but this experience has taught me a tremendous amount about care in task design, the flaws of the general null-hypothesis statistical approach as we get to "slightly larger data" (not "big data" even!) in psychology, and replication more broadly. All of our data, materials, and code – as well as a link to our actual experiments, in case you want to try them out – are available on github. We'd welcome any feedback; I believe KTE are currently writing a response as well, and we look forward to seeing their views on the issue.

(Minor edit: I got Cecilia Heyes' name wrong, thanks to Dale Barr for the correction).