Wednesday 18 July 2012

The Dirty Dozen: A wish list for psychology and cognitive neuroscience

It’s been quite a month in science. 

On the bright side, we probably discovered the Higgs boson (or at least something that smells pretty Higgsy), and in the last few days the UK Government and EU Commission have made a strong commitment to supporting open-access publishing. In two years, so they say, all published science in Britain will be freely available to the public rather than being trapped behind corporate paywalls. This is a tremendous move and I applaud David Willetts for his political courage and long-term vision.

On the not-so-bright side, we’ve seen a flurry of academic fraud cases. Barely a day seems to pass without yet another researcher caught spinning yarns that, on reflection, did sound pretty far-fetched in the first place. What’s that? Riding up rather than down an escalator makes you more charitable? Dirty bus stops make you more racist? Academic fraudsters are more likely to have ground-floor offices? Ok, I made that last one up (or rather, Neuroskeptic did) but if such findings sound like bullshit to you, well funnily enough they actually are. Who says science isn’t self-correcting?

We owe a great debt to Uri Simonsohn, the one-man internal affairs bureau, for judiciously uncovering at least three cases of fraudulent practice in psychological research. So far his investigations have led to two resignations and counting. Bravo. This is a thankless task that will win him few friends, and for that alone I admire him.

And as if to remind us that fraud is by no means unique to psychology, enter the towering Godzilla of mega-fraud – Japanese anaesthesiologist, Yoshitaka Fujii, who has achieved notoriety by becoming the most fraudulently productive scientist ever known.

(As an aside, has anyone ever noticed how the big frauds in science always seem to be perpetrated by men? Are women more honest or do they just make savvier fraudsters?)

Along with all the talk of fraud in psychology, we have had to tolerate the usual line-up of  ‘psychology isn’t science’ rants from those who ought to learn something before setting hoof to keyboard. Fortunately we have Dave Nussbaum to sort these guys out, which he does with a steady hand and a sharp blade. Thank you, Dave!

With psychological science facing challenges and shake-ups on so many different fronts, the time seems ripe for some self-reflection. I used to believe we had a firm grasp on methodology and best practice. Lately I’ve come to think otherwise.

So here’s a dirty dozen of suggested fixes for psychology and cognitive neuroscience research that I’ve been mulling over for some time. I want to stress that I deserve no credit for these ideas, which have all been proposed by others.

1.     Mandatory inclusion of raw data with manuscript submissions

No ifs. No buts. No hiding behind the lack of ethics approval, which can be readily obtained, or the vagaries of the Data Protection Act. Everyone knows data can be anonymised.

2.     Random data inspections

We should conduct fraud checks on a random fraction of submitted data, perhaps using the methodology developed by Uri Simonsohn (once it is peer reviewed and judged statistically sound – as I write this, the technique hasn’t yet been published). Any objective test for fraud must have a very low false discovery rate because the very worst thing would be for an innocent scientist to be wrongly indicted. Fraudsters tend to repeat their behaviour, so the likelihood of false positives in multiple independent data sets from the same researcher should (hopefully) be infinitesimally small.

3.     Registration of research methodology prior to publication

Some time ago, Neuroskeptic proposed that all publishable research should be pre-registered prior to being conducted. That way, we would at least know from the absence of published studies how big the file-drawer is. My first thoughts on reading this were: why wouldn’t researchers just game the system, “pre” registering their research after the experiments are conducted? And what about off-the-cuff experiments conjured up over a beer in the pub?

As Neuroskeptic points out, the first problem could be solved by introducing a minimum 6-month delay between pre-registration and data submission. Also, all prospective co-authors of a pre-registration submission would need to co-sign a letter stating that the research has not yet been conducted.

The second problem is more complicated, but also tractable. My favourite solution is one posed by Jon Brock. Empirical publications could be divided into two categories, Experiments and ObservationsExperiments would be the gold standard of hypothesis-driven research. They would be pre-registered with methods (including sample size) and proposed analyses pre-reviewed and unchangeable without further re-review. Observations would be publishable but have a lower weight. They could be submitted without pre-registration, and to protect against false positives, each experiment from which a conclusion is drawn would be required to include a direct internal replication.

4.     Greater emphasis on replication

It’s a tired cliché, but if we built aircraft the way we do psychological research, every new plane would start life exciting and interesting before ending in an equally exciting fireball. Replication in psychology is dismally undervalued, and I can’t really figure out why this is when everyone, even journal editors, admit how crucial it is. It’s as though we’re trapped in some kind of groupthink and can’t get out. One solution, proposed by Nosek, Spies and Motyl, is the development of a metric called the Replication Value (RV). The RV would tell us which effects are most worth replicating. To quote directly from their paper, which I highly recommend:

Metrics to identify what is worth replicating. Even if valuation of replication increased, it is not feasible – or advisable – to replicate everything. The resources required would undermine innovation. A solution to this is to develop metrics for identifying Replication Value (RV)– what effects are more worthwhile to replicate than others? The Open Science Collaboration (2012b) is developing an RV metric based on the citation impact of a finding and the precision of the existing evidence of the effect. It is more important to replicate findings with a high RV because they are becoming highly influential and yet their truth value is still not precisely determined. Other metrics might be developed as well. Such metrics could provide guidance to researchers for research priorities, to reviewers for gauging the “importance” of the replication attempt, and to editors who could, for example, establish an RV threshold that their journal would consider as sufficiently important to publish in its pages.

I think this is a great idea. As part of the manuscript reviewing process, reviewers could assign an RV to specific experiments. Then, on a rolling basis, the accepted studies that are assigned the highest weightings would be collated and announced. Journals could have special issues focusing on replication of leading findings, with specific labs invited to perform direct replications and the results published regardless of the outcome. This method could also bring in adversarial collaborations, in which labs with opposing agendas work together in an attempt to reproduce each other’s results.

5.     Standardise acceptable analysis practices

Neuroimaging analyses have too many moving parts, and it is easy to delude ourselves that the approach which ends up ‘working’ (after countless reanalyses) is the one we originally intended. Psychological analyses have fewer degrees of freedom but this is still a major problem. We need to formulate a consensus view on gold standard practices for excluding outliers, testing and reporting covariates, and inferential approaches in different situations. Where multiple legitimate options exist, supplementary information should include analyses of them all, and raw data should be available to readers (see point 1).

6.     Institute standard practices for data peeking

Data peeking isn't necessarily bad, but if we do it then we need to correct for it. Uncorrected peeking runs riot in psychology and neuroimaging because the pressure to publish and the dependence of publication on significant results has made chasing p-values the norm. We can see it in other areas of science too. Take the Higgs. Following initial hints at 3-sigma last year, the physicists kept adding data until they reached 5-sigma. The fact that their alpha is so stringent in the first place provides reassurance that they have genuinely discovered something. But if they peeked and chased then it simply isn’t the 5-sigma discovery that was advertised. (As a side note: how about we ditch Fisher-based stats altogether and go Bayesian? That way we can actually test that pesky null hypothesis)

7.     Officially recognise quality of publications over quantity

Everyone agrees that quality of publications is paramount, but we still chase quantity and value ‘prolific’ researchers. So how about setting a cap on the number of publications each researcher or lab can publish per year? That way we would truly have an incentive to make sure of results before publishing them. It would also encourage us to publish single papers with multiple experiments and more definitive conclusions.

8.     Ditch impact factor and let us never speak of it again

As scientists who purportedly know something about numbers, we should be collectively ashamed of ourselves for being conned by journal impact factors (IF). Nowhere is the ludicrous doublethink of the IF culture more apparent than in the current REF, where the advice from universities amounts to “IF of journals is not taken into account in assessing quality of your REF submissions” while simultaneously advising us to “ensure that your four submissions are from the highest impact journals”. Complete with helpful departmental emails reminding us which journals are going up in IF (which is all of them as far as I can tell), the situation really is quite stupid and embarrassing. Here’s a fact shown by Bjorn Brembs: IF correlates better with retraction rate than citation rate. We should replace IF with article-specific merits such as post-publication ratings, article citation count, or – shock horror – considered assessment of the article after reading the damn thing.

9. Open access publication

Much has been said and written in the last few days about open access, with the Government making important steps toward an open scientific future in the UK (I recommend following the blogs of Stephen Curry and Mike Taylor for the latest developments and analysis).  For my part, I think the sooner we eliminate corporate publishers the better. I simply don’t see what value they add when all of the reviewing and editing is done by us at zero cost.

10. Stop conflating research inputs with research outputs

Getting a research grant is great, but we need to stop counting grants as outputs. They are inputs. We need to start assessing the quality of science by balancing outputs against inputs, not by adding them together.

11. Rethink authorship

Academic authorship is antiquated and not designed for collaborative teams. By rank-ordering authors from first to last, we make it impossible for multiple co-authors to make a genuinely equal contribution (Ah, I hear you cry, what about that little asterisk that flags equal contributions? Well, sorry, but…um…nobody really takes much notice of those).

I think a better approach would be to list authors alphabetically on all papers and simply assign % contributions to different areas, such as experimental design, analysis, data collection, interpretation of results, and manuscript preparation. Some journals already do this in some form, but I would like to see this completely replace the current form of authorship.

12. Revise the peer review system

Independent peer review may the best mechanism we currently have for triaging science, but it still sucks. For one thing, it’s usually not independent. I often get asked to review papers by scientists I know or have even worked with. I’ve even been asked to review my own papers on occasion, and was once asked to review my own grant application! (You’ll be glad to know I declined all such instances of self-review). The review process is random and noisy, and based on such a pitifully small sample of comments that the notion of it providing meaningful information is, statistically speaking, quite ridiculous. 

I personally favour the idea of cutting down on the number of detailed reviewers per manuscript and instead calling on a larger number of ‘speed reviewers’, who would simply rate the paper according to various criteria, without having to write any comments. As a reviewer, I often find that I can form an opinion of an article relatively quickly – it is writing the review that takes the most time.

Last week, Paul Knoepfler wrote a provocative blog post proposing an innovation in peer review in which authors review the reviewers. Could this help improve quality of reviews? Unfortunately, I don’t think Paul’s system would work (see my comment on his post here), but perhaps some kind of independent meta-review of reviewers could also be a good idea in a limited number of cases. 

What do you think? Got better ideas? Please leave any comments below. 

** Update 18/7/12, 14:30: On the issue of the gender imbalance in academic fraud, Mark Baxter has kindly reminded me of this case involving Karen M. Ruggiero. 


  1. Great post! I especially like the division between experiments and observations. I believe strongly in exploring data, which leads to invaluable observations. Obviously, science would be completely lifeless without exploration! But such observations could certainly be evaluated using a different criteria than those borrowed from hypothesis testing (though, of course, correcting for multiple comparisons is already a part solution, if done properly). Also, any interesting observations should be subjected to further replication using the less flexible experiment paradigm (including pre-registration). This could also deal with the issue of parameter fitting in imaging. While it may be OK to explore lots of way of looking at your data, the final approach should also work with new unseen data. I.e., replicated using exactly the same data acquisition and analysis procedure.

    1. Yes I agree, exploratory work is crucial - just look at all the major discoveries that stemmed from serendipity. By splitting up the categories of publications like this we can have the best of both worlds...

  2. A great list, Chris!

    The big question to me is how do we implement these changes. One important starting place is to discuss and debate them and build consensus around which changes are necessary and how they are best executed. Posts like this one are an important step in that direction.

    One of the interesting things that Simmons, Nelson, and Simonsohn bring up is that even if journals, universities, and funding agencies move slowly to adopt reforms, researchers themselves can announce in their own papers the steps they've taken (obviously you can't randomly inspect your own data, but you can determine your sample size, outlier cutoffs, and data analysis strategy in advance). The hope is to create a social norm -- an environment in which it is understood that these are the expectations for good research, leading more people to adopt them and empowering reviewers to ask whether these steps have been followed.

    Lastly, I find your speed-reviewing technique intriguing. I have to admit I would have some reservations about simply introducing it. Still, it would be really interesting to do some pilot testing. For example, why not have normal review serve as a control group and compare the resulting acceptances, rejections, and ratings to a sample of speed reviews. If they come out the same then you've got a decent argument for considering a switch.

    1. Thanks Dave. I was very much inspired by your piece, which readers can find here:

      I wonder if any journals would consider trialling a speed-reviewing system. In the first instance, they could keep their initial approach untouched and simply contact additional speed reviewers. It would be interesting to see if the final decision (rendered purely based on the detailed reviews) was predictable from a sample of speedy reviews.

      I couldn't agree more that we need data before making any systematic changes to the peer review system. At the same time, my instinct is that we would gain more by having a greater number of less in-depth reviews than under the current system (which, let's face it, often results in fairly superficial reviews anyway)

  3. Hi Chris - very nice and thoughtful post. I agree on all 12 points. Of course, the hard part is getting these changes implemented. Many of the ideas your proposed have been around since the 1960s, and we have had the technology to implement them for over a decade, and yet we are still a long way from reform. We face a classic collective action problem: While the system as a whole would benefit from these changes, individual actors/journals/authors do not benefit from being the first mover.

    By far the best way to solve a collective action problem is if an external force changes the incentive structure so that individual actors benefit from reforming. It is clear to me that only the granting agencies can apply this force, by rewarding scientists (using grant preferences) who follow good practices and who submit to good-practice journals.

    1. Hi Chris, I just read your post and tweeted it. Very well said, I agree that these pressures need to come from outside, and what better external force than funding agencies! As you say, we've already proven as a community that we're incapable of making these changes on our own.

  4. A nice post, Chris, as always.

    Here's my two cents' worth:

    1. Agreed
    2. Agreed, but as you say, we must be very careful to guard against false accusations. It would be far better, in my opinion, to let fraudulent data be published (and it might then be subject to failed replications etc.) than to accuse someone innocent.
    3. Agreed
    4. Agreed – but sometimes this might be tricky in cases involving patients. What if patients of this particular type are very rare? And/Or the patient deteriorates/recovers after the first testing session (and how would you prove that rather than report that your replication attempt had failed?)? It’s not really practical in every case to provide a replication, in-house, or otherwise.
    5. YES! I have no problem with data exploration, but as said above we should replicate the “successful” results in a new set of participants, and justify arbitrary decisions (e.g. outlier removal) and provide supplemental data confirming that the results were qualitatively the same if different decisions were made.
    However, I’m not sure about making raw data freely available to all. This could mean re-analysis of my data to answer a question for which it is not suited, and without due acknowledgement of where the data came from, and could perhaps be in competition with my own on-going work. By all means submit the raw data with the publication – perhaps reviewers should be encouraged (required?) to check the analyses/alternatives. Individual readers should be allowed to request raw data, but perhaps only for particular uses or with permission from the original author(s). Or perhaps we could make raw data available only after some delay so that the original authors get first crack at any further analyses they may want to conduct (perhaps use their published data as a control group for some on-going work?).
    6. Agreed
    7. Your proposed cap makes me nervous. A PI might decide that he/she would prefer to (strategically) publish papers on various aspects of their lab’s work over others. Perhaps PIs have a grant they’re about to submit and want to give the impression that their lab is expert in this area. Or perhaps they’d have to choose between publishing grander work from their senior postdoc who is further ahead in their project than a new PhD student. I think this suggestion has the potential to disproportionately hurt junior scientists (PhDs/postdocs).
    8. Agreed
    9. Agreed
    10. Agreed
    11. I see your point, but I don’t really have a problem with the current system. I think it’s clear and does the job most of the time (and my surname does not begin with an A or a Z!).
    12. Reviewing does take time, and perhaps the current system isn’t perfect. But the speed reviewing you suggest might hurt the scientific process. I have (almost) always found reviewers’ suggestions to be helpful and IMPROVE my work. Speed-reviewing, as you describe it, could mean that I’d lose this. And junior researchers who are still finding their feet might lose out the most if we lost reviewers’ comments. I’d prefer to see reviewers acknowledged in some way for their time, and contribution to the paper (some Journals e.g. Frontiers already do something like this).

    1. Wow, thanks for the detailed feedback. Re the points of disagreement/discussion:

      2 - I agree; this is essentially Blackstone's principle in criminal law, and it is basically irrefutable in my book:

      4 - Good point. Perhaps there could be exceptions, as you point out, where Observations could be published without internal replication (provided the original experiment provides strong enough evidence)

      5 - This is just my personal view, but I think that if we're going to release data then we need to surrender ownership of it once published (I realise that is an extreme view that many will disagree with). I just feel that data should be in the public domain for all to see and use as they see fit. It's for reviewers to decide if the data was used appropriately, and naturally the source should be acknowledged whenever the data is used. To me this seems no different from citing other people's published papers. Using someone's data without acknowledging them would be tantamount to data plagiarism and would be academic misconduct.

      7 - That's a fair point. The last thing we want to do is disadvantage young scientists. So perhaps the cap could be per staff member/PhD student rather than per lab. Without imposing something that limits the quantity of output, we're never going to be able to give quality the attention it deserves.

      12 - Yes, reviewers can (and often do) help improve manuscripts. But they also get things wrong (often) and make mistakes which lead to erroneous rejections. By sampling a small number of error-prone reviews we guarantee a noisy selection mechanism. I'd like to see a combination of in-depth review and rapid-review. For instance, perhaps at first submission a larger number of speed reviewers could give ratings (including on replication value) and then if the average is high enough the paper is selected for in-depth review by one or two of those reviewers who - crucially - gave ratings that were closest to the mean.

  5. Great read! Thanks.
    My two cents:
    1. Top journals should make their names not through high IFs, but rather through openness. I propose they transform into databases where each article is an entry that has following contents: the paper itself, possibly reviewer comments, the raw data, replication study reports.
    Furthermore, they should sponsor replication studies: replications are actually perfect material for masters' theses or even graduate students in their first year learning the experimentation skills. Why not award prizes or small funds to such people?

    2. I have reservations against the speed reviewing as well. On the other hand, there should be some changes in the review system. First of all, not all reviewers are selected randomly. Nice articles (fashionable topics; written by high profile researchers; ...) seem to "attract" other reviewers than not-so-nice articles. That is a bias at the editor side.
    2b. Review/evaluation of reviewers, as suggested by others, is a great idea. And such evaluations should become an integral part of track records, next to the output you deliver as an author.

    1. Hi Tim, thanks for commenting.

      1 - couldn't agree more. Excellent idea to tie in with Masters research. As you say, we really need a dedicated research fund to sponsor direct replications. I'll keep this in mind, I'm sure it is something we can push!

      2 - I expected speed-reviewing to be the most controversial suggestion. As scientists we've come to both love and hate in-depth review. We love it when a reviewer helps us improve a manuscript; and we hate it when they make fatal errors and kill our paper unfairly. What I'm suggesting is a kind of middle ground where trade off some of this in-depth reviewing for crowd wisdom.

      Re 2b - I really like the idea of reviewers being reviewed, but it needs to be done carefully and independently. Reviewers already do this as a favour and it would be easy for a mechanism of meta-assessment to deter them from doing anything at all. At the same time, such assessment is important!

  6. one-man internal affairs bureau

    Even reading this line makes me shudder. Has his paper outlining his method come out yet? Even if he is 3/3 in catching real fraud it's not a tenable situation that he be the sole arbiter of which studies are investigated. I will admire his work when he is open about it.

    1. As far as I know Simonsohn's method hasn't been published yet. Ed Yong's article (linked above) mentions that the paper on the technique will soon be submitted, so I imagine it will be at least a couple of months before we see it in accepted form - and that's assuming it is accepted quickly.

      I think your reservations are completely sensible. In Simonsohn's defence, the 'one-man IRB' quip is me (being facetious) rather than any kind of position he's given himself.

      Will be interesting to see how it all pans out.

  7. Great post Chris!
    I agree with most of your points you raised except for the speed review thing. I can't see a point & click system working. How about limiting the length of reviews to one page with clear section to fill in?.
    I can't see the MCQ version working if reviewers don't justify their choice (even briefly).

    I really enjoyed reading it!

  8. It is great to see the increased awareness about problems in psychological research and suggestions for improvements.

    The problems are not new (Sterling, 1959) and one solution is also quite old, namely to increase statistical power (Cohen, 1962).

    Whereas a priori power analysis can reduce the need for data fudging, post-hoc poewr analysis can be used to detect data fudging.

  9. How would these items, particularly 1 and 4, work for qualitative studies? Making these datasets anonymous can be difficult, and sometimes impossible, for example. I am worried these kind of stringent measures would work towards further blocking this important side of psychological research, though it does merit consideration. Perhaps we should work towards fostering an atmosphere of responsibility rather than trying to police it. I'm not sure of the stats, but what is the ratio of known research frauds committed today to research article published, compared to years past?