Thursday 16 January 2014

Tough love for fMRI: questions and possible solutions

Let me get this out of the way at the beginning so I don’t come across as a total curmudgeon. I think fMRI is great. My lab uses it. We have grants that include it. We publish papers about it. We combine it with TMS, and we’ve worked on methods to make that combination better. It’s the most spatially precise technique for localizing neural function in healthy humans. The physics (and sheer ingenuity) that makes fMRI possible is astonishing.

But fMRI is a troubled child. On Tuesday I sent out a tweet: “fMRI = v expensive method + chronically under-powered designs + intense publication pressure + lack of data sharing = huge fraud incentive.” This was in response to the news that a post doc in the lab of Hans Op de Beeck has admitted fraudulent behaviour associated with some recently retracted fMRI work. This is a great shame for Op de Beeck, who it must be stressed is entirely innocent in the matter. Fraud can strike at the heart of any lab, seemingly at random. The thought of unknowingly inviting fraud into your home is the stuff of nightmares for PIs. It scares the shit out of me.

I got some interesting responses to my tweet, but the one I want to deal with here is from Nature editor Noah Gray, who wrote: “I'd add ‘too easily over-interpreted.’ So what to do with this mess? Especially when funding for more subjects is crap?”

There is a lot we can do. We got ourselves into this mess. Only we can get ourselves out. But it will require concerted effort and determination from researchers and the positioning of key incentives by journals and funders.

The tl;dr version of my proposed solutions: work in larger research teams to tackle bigger questions, raise the profile of a priori statistical power, pre-register study protocols and offer journal-based pre-registration formats, stop judging the merit of science by the journal brand, and mandate sharing of data and materials.

Problem 1: Expense. The technique is expensive compared to other methods. In the UK it costs about £500 per hour of scanner time, sometimes even more.

Solution in brief: Work in larger research teams to divide the cost.

Solution in detail: It’s hard to make the technique cheaper. The real solution is to think big. What do other sciences do when working with expensive techniques? They group together and tackle big questions. Cognitive neuroscience is littered with petty fiefdoms doing one small study after another – making small, noisy advances. The IMAGEN fMRI consortium is a beautiful example of how things could be if we worked together.

Problem 2: Lack of power. Evidence from structural brain imaging implies that most fMRI studies have insufficient sample sizes to detect meaningful effects. This means they not only have little chance of detecting true positives, there is also a high probability that any statistically significant differences are false. It comes as no surprise that the reliability of fMRI is poor.

Solution in brief: Again, work in larger teams, combining data across centres to furnish large sample sizes. We need to get serious about statistical power, taking some of the energy that goes into methods development and channeling it into developing a priori power analysis techniques.

Solution in detail: Anyone who uses null hypothesis significance testing (NHST) needs to care about statistical power. Yet if we take psychology and cognitive neuroscience as a whole, how many studies motivate their sample size according to a priori power analysis? Very few, and you could count the number of basic fMRI studies that do this on the head of a pin. There seem to be two reasons why fMRI researchers don’t care about power. The first is cultural: to get published, the most important thing is for authors to push a corrected p value below .05. With enough data mining, statistical significance is guaranteed (regardless of truth) so why would a career-minded scientist bother about power? The second is technical: there are so many moving parts to an fMRI experiment, and so many little differences in the way different scanners operate, that power analysis itself is very challenging. But think about it this way: if these problems make power analysis difficult then they necessarily make the interpretation of p values just as difficult. Yet the fMRI community happily embraces this double standard because it is p<.05, not power, that gets you published.

Problem 3: Researcher ‘degrees of freedom’. Even the simplest fMRI experiment will involve dozens of analytic options, each which could be considered legal and justifiable. These researcher degrees of freedom provide an ambiguous decision space for analysts to try different approaches and see what “works” best in producing results that are attractive, statistically significant, or fit with prior expectations. Typically only the outcome that "worked" is then published. Exploiting these degrees of freedom also enables researchers to present “hypotheses” derived from the data as though they were a priori, a questionable practice known as HARKing. It’s ironic that the fMRI community has put so much effort into developing methods that correct for multiple comparisons while completely ignoring the inflation of Type I error caused by undisclosed analytic flexibility. It’s the same problem in different form.

Solution in brief: Pre-registration of research protocols so that readers can distinguish hypothesis testing from hypothesis generation, and thus confirmation from exploration.

Solution in detail: By pre-specifying our hypotheses and analysis protocol we protect the outcome of experiments from our own bias. It’s a delusion to pretend that we aren’t biased, that each of us is somehow a paragon of objectivity and integrity. That is self-serving nonsense. To incentivize pre-registration, all journals should offer pre-registered article formats, such as Registered Reports at Cortex. This includes prominent journals like Nature and Science, which have a vital role to play in driving better science. At a minimum, fMRI researchers should be encouraged to pre-register their designs on the Open Science Framework. It’s not hard to do. Here’s an fMRI pre-registration from our group.

Arguments for pre-registration should not be seen as arguments against exploration in science – instead they are a call for researchers to care more about the distinction between hypothesis testing (confirmation) and hypothesis generation (exploration). And to those critics who object to pre-registration, please don’t try to tell me that fMRI is necessarily “exploratory” and “observational” and that “science needs to be free, dude” while in same breath submitting papers that state hypotheses or present p values. You can't have it both ways.

Problem 4: Pressure to publish. In our increasingly chickens-go-in-pies-come-out culture of academia, “productivity” is crucial. What exactly that means or why it should be important in science isn’t clear – far less proven. Peter Higgs made one of the most important discoveries in physics yet would have been marked as unproductive and sacked in the current system. As long as we value the quantity of science that academics produce we will necessarily devalue quality. It’s a see saw. This problem is compounded in fMRI because of the problems above: it’s expensive, the studies are underpowered, and researchers face enormous pressure to convert experiments into positive, publishable results. This can only encourage questionable practices and fraud.

Solution in brief: Stop judging the quality of science and scientists by the number of publications they spew out, the “rank” of the journal, or the impact factor of the journal. Just stop.

Solution in detail: See Solution in brief.

Problem 5: Lack of data sharing. fMRI research is shrouded in secrecy. Data sharing is unusual, and the rare cases where it does happen are often made useless by researchers carelessly dumping raw data without any guidance notes or consideration of readers. Sharing of data is critical to safeguard research integrity – failure to share makes it easier to get away with fraud.

Solution in brief: Share and we all benefit. Any journal that publishes fMRI should mandate the sharing of raw data, processed data, analysis scripts, and guidance notes. Every grant agency that funds fMRI studies should do likewise.

Solution in detail: Public data sharing has manifold benefits. It discourages and helps unmask fraud, it encourages researchers to take greater care in their analyses and conclusions, and it allows for fine-grained meta-analysis. So why isn’t it already standard practice? One reason is that we’re simply too lazy. We write sloppy analysis scripts that we’d be embarrassed for our friends to see (let alone strangers); we don’t keep good records of the analyses we’ve done (why bother when the goal is p<.05?); we whine about the extra work involved in making our analyses transparent and repeatable by others. Well, diddums, and fuck us – we need to do better.

Another objection is the fear that others will “steal” our data, publishing it without authorization and benefiting from our hard work. This is disingenuous and tinged by dickishness. Is your data really a matter of national security? Oh, sorry, did I forget how important you are? My bad.

It pays to remember that data can be cited in exactly the same way papers can – once in the public domain others can cite your data and you can cite theirs. Funnily enough, we already have a system in science for using the work of others while still giving them credit. Yet the vigor with which some people object to data sharing for fear of having their soul stolen would have you think that the concept of “citation” is a radical idea.

To help motivate data sharing, journals should mandate sharing of raw data, and crucially, processed data and analysis scripts, together with basic guidance notes on how to repeat analyses. It’s not enough just to share the raw MR images – the Journal of Cognitive Neuroscience tried that some years ago and it fell flat. Giving someone the raw data alone is like handing them a few lumps of marble and expecting them to recreate Michelangelo’s David.


What happens when you add all of these problems together? Bad practice. It begins with questionable research practices such as p-hacking and HARKing. It ends in fraud, not necessarily by moustache-twirling villains, but by desperate young scientists who give up on truth. Journals and funding agencies add to the problem by failing to create the incentives for best practice.

Let me finish by saying that I feel enormously sorry for anyone whose lab has been struck by fraud. It's the ultimate betrayal of trust and loss of purpose. If it ever happens to my lab, I will know that yes the fraudster is of course responsible for their actions and is accountable. But I will also know that the fMRI research environment is a damp unlit bathroom, and fraud is just an aggressive form of mould.


  1. You forgot what is likely the biggest problem in fMRI - the data. The data and the processing of the data - AKA the methods. For example, subject motion is a huge problem that will not be solved by any of your proposed solutions. Start with the data first.

  2. In problem 3 you stated that "Even the simplest fMRI experiment will involve dozens of analytic options, each which could be considered legal and justifiable." The problem is that researchers in this field are not aware that your statement is incorrect. Most of the preprocessing steps are not "justifiable" in that many are known to be wrong in principle and many are simply unverified.

    1. Thanks that's a good point. To be clear, I didn't say they were justifiable - I said they could be considered to be justifiable. Big difference. And clearly they are considered to be justifiable by many scientists (and peer reviewers) or we wouldn't have a problem.

    2. Thanks for the clarification, Chris.

  3. This is a great post. Just wanted to give a heads up that it didn't load correctly for me: the text is aligned left so that a good chunk of it is over the dark background and almost unreadable. Anyone else have that problem? I'm using chrome.

    1. Damn, sorry about that. I don't use Chrome myself (Firefox on Mac). Blogspot sucks.

    2. Resize your window to closely frame the text- then the text should be entirely on the lighter background.

      Hope that helps!

  4. Great post. But I'm not sure I understand the solution to problem 4. We currently have a faculty search with 200 applicants. How can we efficiently create a shortlist of applicants without using heuristics. Read the papers? In practice, no. Most of us went into science because we love our research, mentoring and teaching, not because we want to spend all our evenings and weekends evaluating others. I agree on the problem I'm not sure I see an easy solution. I'd love to hear other ideas.

    1. Thanks for this comment. You make an important point.

      I'm not arguing against the use of metrics per se. The fact is that metrics are essential for judging some aspects of science (and scientists), particularly by non-specialists. But we need to recognise the limitations of metrics and choose the best possible ones. Journal level metrics are terrible indicators. There is no correlation, for instance, between journal impact factor (IF) and the citation rates of individual articles - but there is a correlation between IF and retraction rates due to fraud.

      In terms of shortlisting down from 200 applicants, then for research potential I would focus on article level metrics, h-index, and the m-value (the rate of increase in h-index). I might also ask candidates at the initial application stage to write a short section on how often, and in what contexts, their work has been independently replicated by other research groups.

      These aren't perfect indicators by any stretch. There really is no substitute for having a specialist read the work, but article level metrics are much better than assessing candidates based how often they publish in prestigious high IF journals that, more than anything, are slaves to publication bias.

  5. related to Problem 4:
    Einstein published 300 scientific papers... 60 years ago!!!
    So stop talking about publication pressure, pls...!

    1. Just look here: his papers in good chunk are quite short. Not saying that they are lacking in content, but physics is quite different form neuroscience.

  6. Hi Chris--

    Really nice post. I tweeted it, along with a question--did you think some of the same arguments can be made about my method of choice, EEG/ERPs? I think so, but I wondered if you have given this any thought?

    1. Thanks Michael. I think all of these arguments (except #1) hold for EEG. You might be interested in this comment by Matt Craddock:

      Others (including Dorothy Bishop) have also written about this.

    2. This post was on the tendency to do multi-way ANOVA in ERP studies, without correcting for the number of effects and interactions I've seen as many as 5 or 6-way ANOVAs in that field, which is really setting yourself up for finding spurious effects.
      The processing pipeline flexibility also applies in ERP: it's accepted practice by many to select your filter, time window, electrode for analysis, method for identifying peaks etc after scrutinising the data. Referencing method can also dramatically alter the results. It gets worse still if people start analysing frequency bands, where results can depend heavily on things like method of wavelet analysis, and there are lots of ways of defining frequency bands. This paper says a bit about this kind of issue in the context of MMN: A lot of people in the ERP field really don't recognise the problem: I've been asked by reviewers (and editors), for instance, to analyse a different electrode because 'it looks like something is going on there', when instead I've based my selection on prior literature.

  7. The first sentence in "Problem 2" does not make sense. You say that 'Evidence from structural brain imaging implies that most fMRI studies have insufficient sample sizes....' You then cite a paper that included a large number of neuroscience studies (many of them structural imaging studies). Structural and functional brain data are not modeled or analyzed in a similar way. The techniques are much different. I have no doubt that you could make an argument that fMRI studies are underpowered (the argument has been made before), but your current point is a little disingenuous.

  8. The biggest problem is the error in the data. The next biggest problem is the error generated by processing of the data. Statistical problems are tertiary.

    Fixing the data will mean more expensive scanners and peripheral hardware. We have to stop buying our hardware right off the shelf of diagnostic imaging manufacturers. Building costume hardware to meet the specifications necessary for sufficient sensitivity is the norm in science.

    Also fixing the data may mean that the subjects of investigation be restricted to a much smaller subset - ones that can tolerate head restraint sufficient to decrease motion generated error sufficiently. What is sufficient that? That needs much investigation.

  9. "if we worked together". Yes, and no. First, fewer and bigger teams and huge projects can also mean that we are putting all our eggs in one basket. The US has tons of megateams, for example, but if you look at the data, the UK is doing much better than the US in terms of money-spent/productivity ratio. In fact, the UK is doing better than anyone else. So, arguing that we have to save money by having fewer, bigger teams etc is misleading. What is needed is more funding. Second, we have greed and a credit allocation problem. The big fish want to be even bigger. Since they are big, they feel they can impose their wishes on everyone else. Therefore, there is no incentive for smaller and creative teams to join a mega team who would push to get control and credit for the entire project and subsequent funding stream. Third, what the Government, and most people, don't get is that science as a whole is the best algorithm we have to gather new knowledge about the world. What societies should fund is the algorithm itself, not select some of the little units (i.e., scientists) that implement the algorithm. All units are necessary, as whole, to implement knowledge discovery. Think of it as the algorithm implemented by a colony of ants to forage. The algorithm works as a whole, even though most explorer ants discover nothing at all. It's not their fault, but in doing their share, they contribute to the implementation of the algorithm. Societies need to understand that most experiments do not work in reality, but the system works as a whole. Some scientists will be lucky and will run into something important. Most won't. And it's nobody's fault, and it should not determine promotions, or redundancies, especially not on a short time scale. Without a system that understands this basic fact, anything else is just a band-aid.

    1. Thanks, great comment. There is much to agree with here. The main downside I see with small studies is low power, which limits the ability to answer any questions at all. So there is a trade off between, on the one hand, preserving a tapestry of creativity and innovation by supporting lots of small groups, and on the other hand answering anything at all. Answering questions is what big studies do best. But of course, whether those questions are the right questions - and whether large groups stifle innovation - is another issue entirely. I'm not sure they do, but I agree it is a question worth asking.

  10. Problem 2 -- fMRI reliability isn't that poor, the way people look at reliability is!
    in a paper last year we show that you can get descent reliability (and yes I think it is worth looking at 'raw' data, beta, T, thresholded maps rather than only one ; and ICC as the last useful measure)
    anyhow your link point to something of interest - as in our paper some paradigm have low reliability some don't ..and the causes can be different

  11. Sorry, I'm about two months late to come across this post. Web-wise, I'm still in the late 90's.

    I agree with most of your points, although I share the reservations about fewer larger-scale studies that other commenters pointed out. However, I was a bit surprised to see that a search for the word "theory" on this page returned zero hits. I think the statistical power and multiple comparisons issues will only ever get worse, and boosting the N won't really help. Consider that in a small number of years, a standard single-subject dataset may include 1+ million voxels sampled at 2 Hz, and standard data analyses will include mass-univariate, connectivity, space-based analyses like MVPA, and probably some others. Thus, the "effective power" will get much smaller even if the number of subjects increases. I write "effective" power because, to my knowledge, power analyses are done using a single statistic (e.g., t-test), but let's be realistic here: If you (don't worry, not the accusatory "you," the general "you") want a finding that doesn't show up in a mass-univariate analysis, you'll also try connectivity (perhaps PPI, perhaps DCM), MVPA, etc. Thus, I would argue that boosting the N won't really help address the issue of the unreliability, low power, gooey interpretability, and large multiple-comparisons problem.

    Instead, I believe the problem is theoretical. Cognitive neuroscience is largely (though not entirely) deprived of useful and precise theories. A "soft" theoretical prediction that brain area A should be involved in cognitive process X is easy to confirm and difficult to reject. The level of neuroscience in cognitive neuroscience has not increased much in the past two decades, despite some amazing discoveries in neuroscience. FMRI data will become richer and more complex, and the literature will receive more and more sophisticated data analyses. If theories are not improved to be more precise and more neurobiologically grounded, issues of low power and multiple comparisons will only increase.

    Thanks for keeping this blog, btw

    1. Thanks Mike, great comment. In full agreement.

  12. Another latecomer to this discussion. Interesting points raised but not all of them are unique to neuroimaging (more on that below).

    Cost: The idea that fMRI is expensive is often raised, but if you've ever opened a bioscience supplier catalogue you'd realize that in relative terms, fMRI is not all that costly (certainly not anywhere near the costs of particle physics, for instance), and indeed the bulk of most grant funding is in staff, not scanning. There is however a problem with traditional ways to charge for scanner time, in that most centres set prices high enough to recover their costs on the assumption that studies use only a small number of subjects - and doing so actively discourage researchers from including large enough samples. A much better model would be to charge a fixed price for a given project and within that allow unlimited (within reason) scanning toward that project. Such an approach would go some way toward addressing the lack of power. After all, a scanner costs no more to run than to keep on standby, and many MRI scanners are chronically underused (for example at night). Unfortunately I don't know of any scanning centre in the UK or elsewhere that use this model.

    Looking at the rest of the points, none of them really have anything to do specifically with fMRI. For example, there is no doubt that pressures to publish provide massive incentives toward fraud but as far as I can tell it is no more widespread in the fMRI community than elsewhere (but your point about the pressures to publish in high-impact journals is absolutely right and is definitely a major problem).

    In fact, most of the critique above relates to who does the research and the way they do it, not the technique itself. And much of it can be summarised as simply being "poor science" or "science poorly done". Pre-registration to me sounds like admitting that neuroimaging scientists are unable to do honest science unless they are shamed into it and sets a slightly disturbing precedent (ie if you choose not to pre-register your study, then that must mean you are a fraud). Surely it would be better to train people how to do statistics properly?

    But I think Mike Cohen made a very good point that the issue is the lack of neuroscience underpinning much of fMRI work is at the core of the problem. On that note, I was a bit surprised that there was no mention of what in my mind is by far the biggest issue with the method - that the signal measured is only indirectly related to neural activity, represents a population average, and is very difficult to link with more direct measurements - issues that no amount of data sharing or pre-registration will address. There is an urgent need for more research in this field, but unfortunately the bulk of fMRI researchers seem only too happy to ignore these issues.

    But even if one had a good and reliable way of linking fMRI data to neural activity, the application of cognitive science models to studying brain function is only going to be as useful as the extent to which those models actually map on to neural processing mechanisms. This implies that there needs to be a willingness to recognise that many of those models are likely to be fundamentally incorrect (other than as purely descriptive ones). Indeed it is probably not much of an exaggeration that the majority of the poor fMRI studies that people focus on are just those that blindly set out to test some favourite cognitive science model. Conversely, the best imaging tends to be that which is tightly linked to neuroscience. Fundamentally, this is a problem with psychology itself which needs to embrace neuroscience rather than turning its back on it (the neuron 'envy' that Ramachandran talks about), and that psychologists need to learn and understand neuroscience (but conversely, neuroscientists need to get over their knee-jerk disdain for neuroimaging as a method and accept that it has some small virtues).

  13. 1) Tim Hunt has, for decades, mentored and supported women in science. He has done more for women than 99.9% of those who called for his resignation.

    2) Tim Hunt "was always immensely supportive of the ERC’s work around gender equality" (Dame Athene Donald)

    3) Tim Hunt made an experience-based assertion, based on over half a century of experience, that men and women working together in labs can be emotionally distracting for both sexes.

    4) Tim Hunt commented that a problem he has had, working in labs in the past, is that women tend to cry more when confronted with criticism. Nevertheless he fully supports women in science. “No one seems to mention his main speech in Korea in which, according to the ERC President, he was ‘very supportive towards women in science and he said that he hoped there was nothing that barred women from science’” (Dame Athene Donald). He simply believes, based on his own considerable experience, that single sex labs are more conducive to good scientific research.

    5) We may disagree with what Tim says, but we should defend to the death his right to say it.

    Please read the other side of the story here and, if you agree, sign the petition to help reinstate Sir Tim Hunt:

    (Posted by an ordinary chap and advocate of human rights for both sexes).