NeuroChambers: The Dirty Dozen: A wish list for psychology and cognitive neuroscience

Wednesday, 18 July 2012

The Dirty Dozen: A wish list for psychology and cognitive neuroscience

It’s been quite a month in science.

On the bright side, we probably discovered the Higgs boson (or at least something that smells pretty Higgsy), and in the last few days the UK Government and EU Commission have made a strong commitment to supporting open-access publishing. In two years, so they say, all published science in Britain will be freely available to the public rather than being trapped behind corporate paywalls. This is a tremendous move and I applaud David Willetts for his political courage and long-term vision.

On the not-so-bright side, we’ve seen a flurry of academic fraud cases. Barely a day seems to pass without yet another researcher caught spinning yarns that, on reflection, did sound pretty far-fetched in the first place. What’s that? Riding up rather than down an escalator makes you more charitable? Dirty bus stops make you more racist? Academic fraudsters are more likely to have ground-floor offices? Ok, I made that last one up (or rather, Neuroskeptic did) but if such findings sound like bullshit to you, well funnily enough they actually are. Who says science isn’t self-correcting?

We owe a great debt to Uri Simonsohn, the one-man internal affairs bureau, for judiciously uncovering at least three cases of fraudulent practice in psychological research. So far his investigations have led to two resignations and counting. Bravo. This is a thankless task that will win him few friends, and for that alone I admire him.

And as if to remind us that fraud is by no means unique to psychology, enter the towering Godzilla of mega-fraud – Japanese anaesthesiologist, Yoshitaka Fujii, who has achieved notoriety by becoming the most fraudulently productive scientist ever known.

(As an aside, has anyone ever noticed how the big frauds in science always seem to be perpetrated by men? Are women more honest or do they just make savvier fraudsters?)

Along with all the talk of fraud in psychology, we have had to tolerate the usual line-up of ‘psychology isn’t science’ rants from those who ought to learn something before setting hoof to keyboard. Fortunately we have Dave Nussbaum to sort these guys out, which he does with a steady hand and a sharp blade. Thank you, Dave!

With psychological science facing challenges and shake-ups on so many different fronts, the time seems ripe for some self-reflection. I used to believe we had a firm grasp on methodology and best practice. Lately I’ve come to think otherwise.

So here’s a dirty dozen of suggested fixes for psychology and cognitive neuroscience research that I’ve been mulling over for some time. I want to stress that I deserve no credit for these ideas, which have all been proposed by others.

1. Mandatory inclusion of raw data with manuscript submissions

No ifs. No buts. No hiding behind the lack of ethics approval, which can be readily obtained, or the vagaries of the Data Protection Act. Everyone knows data can be anonymised.

2. Random data inspections

We should conduct fraud checks on a random fraction of submitted data, perhaps using the methodology developed by Uri Simonsohn (once it is peer reviewed and judged statistically sound – as I write this, the technique hasn’t yet been published). Any objective test for fraud must have a very low false discovery rate because the very worst thing would be for an innocent scientist to be wrongly indicted. Fraudsters tend to repeat their behaviour, so the likelihood of false positives in multiple independent data sets from the same researcher should (hopefully) be infinitesimally small.

3. Registration of research methodology prior to publication

Some time ago, Neuroskeptic proposed that all publishable research should be pre-registered prior to being conducted. That way, we would at least know from the absence of published studies how big the file-drawer is. My first thoughts on reading this were: why wouldn’t researchers just game the system, “pre” registering their research after the experiments are conducted? And what about off-the-cuff experiments conjured up over a beer in the pub?

As Neuroskeptic points out, the first problem could be solved by introducing a minimum 6-month delay between pre-registration and data submission. Also, all prospective co-authors of a pre-registration submission would need to co-sign a letter stating that the research has not yet been conducted.

The second problem is more complicated, but also tractable. My favourite solution is one posed by Jon Brock. Empirical publications could be divided into two categories, Experiments and Observations. Experiments would be the gold standard of hypothesis-driven research. They would be pre-registered with methods (including sample size) and proposed analyses pre-reviewed and unchangeable without further re-review. Observations would be publishable but have a lower weight. They could be submitted without pre-registration, and to protect against false positives, each experiment from which a conclusion is drawn would be required to include a direct internal replication.

4. Greater emphasis on replication

It’s a tired cliché, but if we built aircraft the way we do psychological research, every new plane would start life exciting and interesting before ending in an equally exciting fireball. Replication in psychology is dismally undervalued, and I can’t really figure out why this is when everyone, even journal editors, admit how crucial it is. It’s as though we’re trapped in some kind of groupthink and can’t get out. One solution, proposed by Nosek, Spies and Motyl, is the development of a metric called the Replication Value (RV). The RV would tell us which effects are most worth replicating. To quote directly from their paper, which I highly recommend:

Metrics to identify what is worth replicating. Even if valuation of replication increased, it is not feasible – or advisable – to replicate everything. The resources required would undermine innovation. A solution to this is to develop metrics for identifying Replication Value (RV)– what effects are more worthwhile to replicate than others? The Open Science Collaboration (2012b) is developing an RV metric based on the citation impact of a finding and the precision of the existing evidence of the effect. It is more important to replicate findings with a high RV because they are becoming highly influential and yet their truth value is still not precisely determined. Other metrics might be developed as well. Such metrics could provide guidance to researchers for research priorities, to reviewers for gauging the “importance” of the replication attempt, and to editors who could, for example, establish an RV threshold that their journal would consider as sufficiently important to publish in its pages.

I think this is a great idea. As part of the manuscript reviewing process, reviewers could assign an RV to specific experiments. Then, on a rolling basis, the accepted studies that are assigned the highest weightings would be collated and announced. Journals could have special issues focusing on replication of leading findings, with specific labs invited to perform direct replications and the results published regardless of the outcome. This method could also bring in adversarial collaborations, in which labs with opposing agendas work together in an attempt to reproduce each other’s results.

5. Standardise acceptable analysis practices

Neuroimaging analyses have too many moving parts, and it is easy to delude ourselves that the approach which ends up ‘working’ (after countless reanalyses) is the one we originally intended. Psychological analyses have fewer degrees of freedom but this is still a major problem. We need to formulate a consensus view on gold standard practices for excluding outliers, testing and reporting covariates, and inferential approaches in different situations. Where multiple legitimate options exist, supplementary information should include analyses of them all, and raw data should be available to readers (see point 1).

6. Institute standard practices for data peeking

Data peeking isn't necessarily bad, but if we do it then we need to correct for it. Uncorrected peeking runs riot in psychology and neuroimaging because the pressure to publish and the dependence of publication on significant results has made chasing p-values the norm. We can see it in other areas of science too. Take the Higgs. Following initial hints at 3-sigma last year, the physicists kept adding data until they reached 5-sigma. The fact that their alpha is so stringent in the first place provides reassurance that they have genuinely discovered something. But if they peeked and chased then it simply isn’t the 5-sigma discovery that was advertised. (As a side note: how about we ditch Fisher-based stats altogether and go Bayesian? That way we can actually test that pesky null hypothesis)

7. Officially recognise quality of publications over quantity

Everyone agrees that quality of publications is paramount, but we still chase quantity and value ‘prolific’ researchers. So how about setting a cap on the number of publications each researcher or lab can publish per year? That way we would truly have an incentive to make sure of results before publishing them. It would also encourage us to publish single papers with multiple experiments and more definitive conclusions.

8. Ditch impact factor and let us never speak of it again

As scientists who purportedly know something about numbers, we should be collectively ashamed of ourselves for being conned by journal impact factors (IF). Nowhere is the ludicrous doublethink of the IF culture more apparent than in the current REF, where the advice from universities amounts to “IF of journals is not taken into account in assessing quality of your REF submissions” while simultaneously advising us to “ensure that your four submissions are from the highest impact journals”. Complete with helpful departmental emails reminding us which journals are going up in IF (which is all of them as far as I can tell), the situation really is quite stupid and embarrassing. Here’s a fact shown by Bjorn Brembs: IF correlates better with retraction rate than citation rate. We should replace IF with article-specific merits such as post-publication ratings, article citation count, or – shock horror – considered assessment of the article after reading the damn thing.

9. Open access publication

Much has been said and written in the last few days about open access, with the Government making important steps toward an open scientific future in the UK (I recommend following the blogs of Stephen Curry and Mike Taylor for the latest developments and analysis). For my part, I think the sooner we eliminate corporate publishers the better. I simply don’t see what value they add when all of the reviewing and editing is done by us at zero cost.

10. Stop conflating research inputs with research outputs

Getting a research grant is great, but we need to stop counting grants as outputs. They are inputs. We need to start assessing the quality of science by balancing outputs against inputs, not by adding them together.

11. Rethink authorship

Academic authorship is antiquated and not designed for collaborative teams. By rank-ordering authors from first to last, we make it impossible for multiple co-authors to make a genuinely equal contribution (Ah, I hear you cry, what about that little asterisk that flags equal contributions? Well, sorry, but…um…nobody really takes much notice of those).

I think a better approach would be to list authors alphabetically on all papers and simply assign % contributions to different areas, such as experimental design, analysis, data collection, interpretation of results, and manuscript preparation. Some journals already do this in some form, but I would like to see this completely replace the current form of authorship.

12. Revise the peer review system

Independent peer review may the best mechanism we currently have for triaging science, but it still sucks. For one thing, it’s usually not independent. I often get asked to review papers by scientists I know or have even worked with. I’ve even been asked to review my own papers on occasion, and was once asked to review my own grant application! (You’ll be glad to know I declined all such instances of self-review). The review process is random and noisy, and based on such a pitifully small sample of comments that the notion of it providing meaningful information is, statistically speaking, quite ridiculous.

I personally favour the idea of cutting down on the number of detailed reviewers per manuscript and instead calling on a larger number of ‘speed reviewers’, who would simply rate the paper according to various criteria, without having to write any comments. As a reviewer, I often find that I can form an opinion of an article relatively quickly – it is writing the review that takes the most time.

Last week, Paul Knoepfler wrote a provocative blog post proposing an innovation in peer review in which authors review the reviewers. Could this help improve quality of reviews? Unfortunately, I don’t think Paul’s system would work (see my comment on his post here), but perhaps some kind of independent meta-review of reviewers could also be a good idea in a limited number of cases.

What do you think? Got better ideas? Please leave any comments below.

** Update 18/7/12, 14:30: On the issue of the gender imbalance in academic fraud, Mark Baxter has kindly reminded me of this case involving Karen M. Ruggiero.

15 comments:

StokesBlog18 July 2012 at 14:33
Great post! I especially like the division between experiments and observations. I believe strongly in exploring data, which leads to invaluable observations. Obviously, science would be completely lifeless without exploration! But such observations could certainly be evaluated using a different criteria than those borrowed from hypothesis testing (though, of course, correcting for multiple comparisons is already a part solution, if done properly). Also, any interesting observations should be subjected to further replication using the less flexible experiment paradigm (including pre-registration). This could also deal with the issue of parameter fitting in imaging. While it may be OK to explore lots of way of looking at your data, the final approach should also work with new unseen data. I.e., replicated using exactly the same data acquisition and analysis procedure.
ReplyDelete
Replies
Dave Nussbaum18 July 2012 at 14:49
A great list, Chris!

The big question to me is how do we implement these changes. One important starting place is to discuss and debate them and build consensus around which changes are necessary and how they are best executed. Posts like this one are an important step in that direction.

One of the interesting things that Simmons, Nelson, and Simonsohn bring up is that even if journals, universities, and funding agencies move slowly to adopt reforms, researchers themselves can announce in their own papers the steps they've taken (obviously you can't randomly inspect your own data, but you can determine your sample size, outlier cutoffs, and data analysis strategy in advance). The hope is to create a social norm -- an environment in which it is understood that these are the expectations for good research, leading more people to adopt them and empowering reviewers to ask whether these steps have been followed.

Lastly, I find your speed-reviewing technique intriguing. I have to admit I would have some reservations about simply introducing it. Still, it would be really interesting to do some pilot testing. For example, why not have normal review serve as a control group and compare the resulting acceptances, rejections, and ratings to a sample of speed reviews. If they come out the same then you've got a decent argument for considering a switch.
ReplyDelete
Replies
Anonymous18 July 2012 at 15:30
Hi Chris - very nice and thoughtful post. I agree on all 12 points. Of course, the hard part is getting these changes implemented. Many of the ideas your proposed have been around since the 1960s, and we have had the technology to implement them for over a decade, and yet we are still a long way from reform. We face a classic collective action problem: While the system as a whole would benefit from these changes, individual actors/journals/authors do not benefit from being the first mover.

By far the best way to solve a collective action problem is if an external force changes the incentive structure so that individual actors benefit from reforming. It is clear to me that only the granting agencies can apply this force, by rewarding scientists (using grant preferences) who follow good practices and who submit to good-practice journals.
http://filedrawer.wordpress.com/2012/04/17/its-the-incentives-structure-people-why-science-reform-must-come-from-the-granting-agencies/
ReplyDelete
Replies
Anonymous18 July 2012 at 15:33
A nice post, Chris, as always.

Here's my two cents' worth:

1. Agreed
2. Agreed, but as you say, we must be very careful to guard against false accusations. It would be far better, in my opinion, to let fraudulent data be published (and it might then be subject to failed replications etc.) than to accuse someone innocent.
3. Agreed
4. Agreed – but sometimes this might be tricky in cases involving patients. What if patients of this particular type are very rare? And/Or the patient deteriorates/recovers after the first testing session (and how would you prove that rather than report that your replication attempt had failed?)? It’s not really practical in every case to provide a replication, in-house, or otherwise.
5. YES! I have no problem with data exploration, but as said above we should replicate the “successful” results in a new set of participants, and justify arbitrary decisions (e.g. outlier removal) and provide supplemental data confirming that the results were qualitatively the same if different decisions were made.
However, I’m not sure about making raw data freely available to all. This could mean re-analysis of my data to answer a question for which it is not suited, and without due acknowledgement of where the data came from, and could perhaps be in competition with my own on-going work. By all means submit the raw data with the publication – perhaps reviewers should be encouraged (required?) to check the analyses/alternatives. Individual readers should be allowed to request raw data, but perhaps only for particular uses or with permission from the original author(s). Or perhaps we could make raw data available only after some delay so that the original authors get first crack at any further analyses they may want to conduct (perhaps use their published data as a control group for some on-going work?).
6. Agreed
7. Your proposed cap makes me nervous. A PI might decide that he/she would prefer to (strategically) publish papers on various aspects of their lab’s work over others. Perhaps PIs have a grant they’re about to submit and want to give the impression that their lab is expert in this area. Or perhaps they’d have to choose between publishing grander work from their senior postdoc who is further ahead in their project than a new PhD student. I think this suggestion has the potential to disproportionately hurt junior scientists (PhDs/postdocs).
8. Agreed
9. Agreed
10. Agreed
11. I see your point, but I don’t really have a problem with the current system. I think it’s clear and does the job most of the time (and my surname does not begin with an A or a Z!).
12. Reviewing does take time, and perhaps the current system isn’t perfect. But the speed reviewing you suggest might hurt the scientific process. I have (almost) always found reviewers’ suggestions to be helpful and IMPROVE my work. Speed-reviewing, as you describe it, could mean that I’d lose this. And junior researchers who are still finding their feet might lose out the most if we lost reviewers’ comments. I’d prefer to see reviewers acknowledged in some way for their time, and contribution to the paper (some Journals e.g. Frontiers already do something like this).
ReplyDelete
Replies
Unknown18 July 2012 at 16:24
Great read! Thanks.
My two cents:
1. Top journals should make their names not through high IFs, but rather through openness. I propose they transform into databases where each article is an entry that has following contents: the paper itself, possibly reviewer comments, the raw data, replication study reports.
Furthermore, they should sponsor replication studies: replications are actually perfect material for masters' theses or even graduate students in their first year learning the experimentation skills. Why not award prizes or small funds to such people?

2. I have reservations against the speed reviewing as well. On the other hand, there should be some changes in the review system. First of all, not all reviewers are selected randomly. Nice articles (fashionable topics; written by high profile researchers; ...) seem to "attract" other reviewers than not-so-nice articles. That is a bias at the editor side.
2b. Review/evaluation of reviewers, as suggested by others, is a great idea. And such evaluations should become an integral part of track records, next to the output you deliver as an author.
ReplyDelete
Replies
Bashir19 July 2012 at 14:13
one-man internal affairs bureau

Even reading this line makes me shudder. Has his paper outlining his method come out yet? Even if he is 3/3 in catching real fraud it's not a tenable situation that he be the sole arbiter of which studies are investigated. I will admire his work when he is open about it.
ReplyDelete
Replies
Fred19 July 2012 at 17:08
Great post Chris!
I agree with most of your points you raised except for the speed review thing. I can't see a point & click system working. How about limiting the length of reviews to one page with clear section to fill in?.
I can't see the MCQ version working if reviewers don't justify their choice (even briefly).

I really enjoyed reading it!
ReplyDelete
Replies
Ulrich Schimmack20 July 2012 at 04:30
It is great to see the increased awareness about problems in psychological research and suggestions for improvements.

The problems are not new (Sterling, 1959) and one solution is also quite old, namely to increase statistical power (Cohen, 1962).

Whereas a priori power analysis can reduce the need for data fudging, post-hoc poewr analysis can be used to detect data fudging.

http://www.utm.utoronto.ca/~w3psyuli/PReprints/IC.pdf
ReplyDelete
Replies
Anonymous23 July 2012 at 15:29
How would these items, particularly 1 and 4, work for qualitative studies? Making these datasets anonymous can be difficult, and sometimes impossible, for example. I am worried these kind of stringent measures would work towards further blocking this important side of psychological research, though it does merit consideration. Perhaps we should work towards fostering an atmosphere of responsibility rather than trying to police it. I'm not sure of the stats, but what is the ratio of known research frauds committed today to research article published, compared to years past?
ReplyDelete
Replies

Add comment