It’s been quite a
month in science.
On the bright side, we
probably discovered the Higgs boson (or at least something that smells
pretty Higgsy), and in the last few days the UK
Government and EU
Commission have made a strong commitment to supporting open-access
publishing. In two years, so they say, all published science in Britain will be
freely available to the public rather than being trapped behind corporate
paywalls. This is a tremendous move and I applaud David Willetts for his
political courage and long-term vision.
On the not-so-bright
side, we’ve seen a flurry of academic fraud cases. Barely a day seems to pass without yet another researcher caught
spinning yarns that, on reflection, did sound pretty far-fetched in the first
place. What’s that? Riding up rather than down an escalator makes you
more charitable? Dirty bus stops make you more racist? Academic fraudsters are
more likely to have ground-floor offices? Ok, I made that last one up (or
rather, Neuroskeptic
did) but if such findings sound like bullshit to you, well funnily enough
they actually are. Who says science isn’t self-correcting?
We owe a great debt to
Uri Simonsohn, the one-man internal affairs bureau, for judiciously uncovering at
least three cases of fraudulent practice in psychological research. So far
his investigations have led to two resignations and counting. Bravo. This is a
thankless task that will win him few friends, and for that alone I admire him.
And as if to remind us
that fraud is by no means unique to psychology, enter the towering Godzilla of
mega-fraud – Japanese anaesthesiologist, Yoshitaka Fujii, who
has achieved notoriety by becoming the most
fraudulently productive scientist ever known.
(As an aside, has
anyone ever noticed how the big frauds in science always seem to be perpetrated by
men? Are women more honest or do they just make savvier fraudsters?)
Along with all the
talk of fraud in psychology, we have had to tolerate the usual line-up of ‘psychology
isn’t science’ rants from those who ought to learn something before setting hoof to keyboard.
Fortunately we have Dave Nussbaum to sort these guys out, which he does
with a steady hand
and a sharp blade. Thank you, Dave!
With psychological
science facing challenges and shake-ups on so many different fronts, the time seems ripe for
some self-reflection. I used to believe we had a firm grasp on methodology and best practice. Lately I’ve
come to think otherwise.
So here’s a dirty
dozen of suggested fixes for psychology and cognitive neuroscience research
that I’ve been mulling over for some time. I want to stress that I deserve no
credit for these ideas, which have all been proposed by others.
1. Mandatory inclusion of raw data with manuscript submissions
No ifs. No buts. No hiding
behind the lack of ethics approval, which can be readily obtained, or the
vagaries of the Data Protection Act. Everyone knows data can be anonymised.
2. Random data inspections
We should conduct
fraud checks on a random fraction of submitted data, perhaps using the
methodology developed by Uri Simonsohn (once it is peer reviewed and judged
statistically sound – as I write this, the technique hasn’t yet been
published). Any objective test for fraud must have a very low false
discovery rate because the very worst thing would be for an
innocent scientist to be wrongly indicted. Fraudsters tend to repeat their
behaviour, so the likelihood of false positives in multiple independent data
sets from the same researcher should (hopefully) be infinitesimally small.
3. Registration of research methodology prior to publication
Some time ago, Neuroskeptic
proposed that all publishable research should be pre-registered prior to being
conducted. That way, we would at least know from the absence of published
studies how big the file-drawer is. My first thoughts on reading this were: why
wouldn’t researchers just game the system, “pre” registering their research
after the experiments are conducted? And what about off-the-cuff experiments conjured up over a beer in the pub?
As Neuroskeptic points
out, the first problem could be solved by introducing a minimum 6-month delay
between pre-registration and data submission. Also, all prospective co-authors of a
pre-registration submission would need to co-sign a letter stating that the research
has not yet been conducted.
The second problem is
more complicated, but also tractable. My favourite solution is one posed by Jon Brock. Empirical publications could be divided
into two categories, Experiments and Observations. Experiments
would be the gold standard of hypothesis-driven research. They would be pre-registered
with methods (including sample size) and proposed analyses pre-reviewed and unchangeable
without further re-review. Observations would be publishable but have a lower weight. They could be submitted without
pre-registration, and to protect against false positives, each experiment from
which a conclusion is drawn would be required to include a direct internal
replication.
4. Greater emphasis on replication
It’s a tired cliché,
but if we built aircraft the way we do psychological research, every new plane
would start life exciting and interesting before ending in an equally exciting fireball. Replication
in psychology is dismally undervalued, and I can’t really figure out why this is when
everyone, even journal editors, admit how crucial it is. It’s as though
we’re trapped in some kind of groupthink and can’t get out. One solution,
proposed by Nosek, Spies and Motyl,
is the development of a metric called the Replication Value (RV). The RV would
tell us which effects are most worth replicating. To quote directly from their
paper, which I highly recommend:
Metrics to
identify what is worth replicating. Even if valuation of replication increased, it is not feasible – or
advisable – to replicate everything. The resources required would undermine
innovation. A solution to this is to develop metrics for identifying Replication
Value (RV)– what effects are more worthwhile to replicate than others? The
Open Science Collaboration (2012b) is developing an RV metric based on the
citation impact of a finding and the precision of the existing evidence of the
effect. It is more important to replicate findings with a high RV because they
are becoming highly influential and yet their truth value is still not
precisely determined. Other metrics might be developed as well. Such metrics
could provide guidance to researchers for research priorities, to reviewers for
gauging the “importance” of the replication attempt, and to editors who could,
for example, establish an RV threshold that their journal would consider as
sufficiently important to publish in its pages.
I think this is a
great idea. As part of the manuscript reviewing process, reviewers could assign an RV to
specific experiments. Then, on a rolling basis, the accepted studies that are
assigned the highest weightings would be collated and announced. Journals could
have special issues focusing on replication of leading findings, with specific labs
invited to perform direct replications and the results published regardless of
the outcome. This method could also bring in adversarial collaborations, in which
labs with opposing agendas work together in an attempt to reproduce each other’s results.
5. Standardise acceptable analysis practices
Neuroimaging analyses
have too
many moving parts, and it is easy to delude ourselves that the approach which ends up ‘working’ (after countless reanalyses) is the one we originally intended. Psychological analyses have fewer degrees of freedom but this is still a
major problem. We need to formulate a consensus view on gold standard
practices for excluding outliers, testing and reporting covariates, and
inferential approaches in different situations. Where multiple legitimate options exist, supplementary
information should include analyses of them all, and raw data should be
available to readers (see point 1).
6. Institute standard practices for data peeking
Data peeking isn't necessarily bad, but if we do it then we need to correct for it. Uncorrected peeking runs riot in
psychology and neuroimaging because the pressure to publish and the dependence
of publication on significant results has made chasing p-values the norm. We can
see it in other areas of science too. Take the Higgs. Following initial
hints at 3-sigma last year, the physicists kept adding data until they reached
5-sigma. The fact that their alpha is so stringent in the first place provides
reassurance that they have genuinely discovered something. But if they peeked
and chased then it simply isn’t the 5-sigma discovery that was advertised. (As
a side note: how about we ditch Fisher-based stats altogether and go Bayesian? That way we can actually test that pesky null hypothesis)
7. Officially recognise quality of publications over quantity
Everyone agrees
that quality
of publications is paramount, but we still chase quantity and value
‘prolific’ researchers. So how about setting a cap on the number of publications each
researcher or lab can publish per year? That way we would truly have an incentive to make sure of results before publishing them. It would also encourage us
to publish single papers with multiple experiments and more definitive
conclusions.
8. Ditch impact factor and let us never speak of it again
As scientists who
purportedly know something about
numbers, we should be collectively ashamed of ourselves for being conned by journal impact factors
(IF). Nowhere is the ludicrous doublethink of the IF culture more apparent
than in the current REF, where the advice from
universities amounts to “IF of journals is not taken into account in assessing
quality of your REF submissions” while simultaneously advising us to “ensure that
your four submissions are from the highest impact journals”. Complete with
helpful departmental emails reminding us which journals are going up in IF (which is all of them as far as I can tell), the situation really is quite stupid
and embarrassing. Here’s a fact shown by Bjorn Brembs: IF
correlates better with retraction rate than citation rate. We should replace
IF with article-specific merits such as post-publication ratings, article citation
count, or – shock horror – considered assessment of the article after reading the
damn thing.
9. Open access publication
Much has been said and
written in the last few days about open access, with the Government making
important steps toward an open scientific future in the UK (I recommend following the blogs of Stephen Curry and Mike Taylor for the latest developments and
analysis). For my part, I think the
sooner we eliminate corporate publishers the better. I simply don’t see what
value they add when all of the reviewing and editing is done by us at zero cost.
10. Stop conflating research inputs with research outputs
Getting a research grant is great, but we need to stop
counting grants as outputs. They are inputs. We need to start assessing the quality
of science by balancing outputs against inputs, not by adding them together.
11. Rethink authorship
Academic authorship is
antiquated and not designed for collaborative teams. By rank-ordering authors
from first to last, we make it impossible for multiple co-authors to make a genuinely equal
contribution (Ah, I hear you cry, what about that little asterisk that flags
equal contributions? Well, sorry, but…um…nobody really takes much notice of those).
I think a better
approach would be to list authors alphabetically on all papers and simply
assign % contributions to different areas, such as experimental design,
analysis, data collection, interpretation of results, and manuscript
preparation. Some journals already do this in some form, but I would like to
see this completely replace the current form of authorship.
12. Revise the peer review system
Independent peer
review may the best mechanism we currently have for triaging science, but it still
sucks. For one thing, it’s usually not independent. I often get asked to review
papers by scientists I know or have even worked with. I’ve even been asked to
review my own papers on occasion, and was once asked to review my own grant
application! (You’ll be glad to know I declined all such instances of
self-review). The review process is random and noisy, and based on such a pitifully
small sample of comments that the notion of it providing meaningful information is, statistically speaking, quite ridiculous.
I personally favour
the idea of cutting down on the number of detailed reviewers per manuscript and
instead calling on a larger number of ‘speed reviewers’, who would simply rate
the paper according to various criteria, without having to write any comments.
As a reviewer, I often find that I can form an opinion of an article relatively
quickly – it is writing the review that takes the most time.
Last week, Paul Knoepfler wrote a provocative
blog post proposing an innovation in peer review in which authors review
the reviewers. Could this help improve quality of reviews? Unfortunately, I don’t think Paul’s system would work (see my
comment on his post here),
but perhaps some kind of independent
meta-review of reviewers could also be a good idea in a limited number of cases.
__
What do you think? Got
better ideas? Please leave any comments below.
** Update 18/7/12, 14:30: On the issue of the gender imbalance in academic fraud, Mark Baxter has kindly reminded me of this case involving Karen M. Ruggiero.
** Update 18/7/12, 14:30: On the issue of the gender imbalance in academic fraud, Mark Baxter has kindly reminded me of this case involving Karen M. Ruggiero.
Great post! I especially like the division between experiments and observations. I believe strongly in exploring data, which leads to invaluable observations. Obviously, science would be completely lifeless without exploration! But such observations could certainly be evaluated using a different criteria than those borrowed from hypothesis testing (though, of course, correcting for multiple comparisons is already a part solution, if done properly). Also, any interesting observations should be subjected to further replication using the less flexible experiment paradigm (including pre-registration). This could also deal with the issue of parameter fitting in imaging. While it may be OK to explore lots of way of looking at your data, the final approach should also work with new unseen data. I.e., replicated using exactly the same data acquisition and analysis procedure.
ReplyDeleteYes I agree, exploratory work is crucial - just look at all the major discoveries that stemmed from serendipity. By splitting up the categories of publications like this we can have the best of both worlds...
DeleteA great list, Chris!
ReplyDeleteThe big question to me is how do we implement these changes. One important starting place is to discuss and debate them and build consensus around which changes are necessary and how they are best executed. Posts like this one are an important step in that direction.
One of the interesting things that Simmons, Nelson, and Simonsohn bring up is that even if journals, universities, and funding agencies move slowly to adopt reforms, researchers themselves can announce in their own papers the steps they've taken (obviously you can't randomly inspect your own data, but you can determine your sample size, outlier cutoffs, and data analysis strategy in advance). The hope is to create a social norm -- an environment in which it is understood that these are the expectations for good research, leading more people to adopt them and empowering reviewers to ask whether these steps have been followed.
Lastly, I find your speed-reviewing technique intriguing. I have to admit I would have some reservations about simply introducing it. Still, it would be really interesting to do some pilot testing. For example, why not have normal review serve as a control group and compare the resulting acceptances, rejections, and ratings to a sample of speed reviews. If they come out the same then you've got a decent argument for considering a switch.
Thanks Dave. I was very much inspired by your piece, which readers can find here:
Deletehttp://www.davenussbaum.com/crimes-and-misdemeanors-reforming-social-psychology/
I wonder if any journals would consider trialling a speed-reviewing system. In the first instance, they could keep their initial approach untouched and simply contact additional speed reviewers. It would be interesting to see if the final decision (rendered purely based on the detailed reviews) was predictable from a sample of speedy reviews.
I couldn't agree more that we need data before making any systematic changes to the peer review system. At the same time, my instinct is that we would gain more by having a greater number of less in-depth reviews than under the current system (which, let's face it, often results in fairly superficial reviews anyway)
Hi Chris - very nice and thoughtful post. I agree on all 12 points. Of course, the hard part is getting these changes implemented. Many of the ideas your proposed have been around since the 1960s, and we have had the technology to implement them for over a decade, and yet we are still a long way from reform. We face a classic collective action problem: While the system as a whole would benefit from these changes, individual actors/journals/authors do not benefit from being the first mover.
ReplyDeleteBy far the best way to solve a collective action problem is if an external force changes the incentive structure so that individual actors benefit from reforming. It is clear to me that only the granting agencies can apply this force, by rewarding scientists (using grant preferences) who follow good practices and who submit to good-practice journals.
http://filedrawer.wordpress.com/2012/04/17/its-the-incentives-structure-people-why-science-reform-must-come-from-the-granting-agencies/
Hi Chris, I just read your post and tweeted it. Very well said, I agree that these pressures need to come from outside, and what better external force than funding agencies! As you say, we've already proven as a community that we're incapable of making these changes on our own.
DeleteA nice post, Chris, as always.
ReplyDeleteHere's my two cents' worth:
1. Agreed
2. Agreed, but as you say, we must be very careful to guard against false accusations. It would be far better, in my opinion, to let fraudulent data be published (and it might then be subject to failed replications etc.) than to accuse someone innocent.
3. Agreed
4. Agreed – but sometimes this might be tricky in cases involving patients. What if patients of this particular type are very rare? And/Or the patient deteriorates/recovers after the first testing session (and how would you prove that rather than report that your replication attempt had failed?)? It’s not really practical in every case to provide a replication, in-house, or otherwise.
5. YES! I have no problem with data exploration, but as said above we should replicate the “successful” results in a new set of participants, and justify arbitrary decisions (e.g. outlier removal) and provide supplemental data confirming that the results were qualitatively the same if different decisions were made.
However, I’m not sure about making raw data freely available to all. This could mean re-analysis of my data to answer a question for which it is not suited, and without due acknowledgement of where the data came from, and could perhaps be in competition with my own on-going work. By all means submit the raw data with the publication – perhaps reviewers should be encouraged (required?) to check the analyses/alternatives. Individual readers should be allowed to request raw data, but perhaps only for particular uses or with permission from the original author(s). Or perhaps we could make raw data available only after some delay so that the original authors get first crack at any further analyses they may want to conduct (perhaps use their published data as a control group for some on-going work?).
6. Agreed
7. Your proposed cap makes me nervous. A PI might decide that he/she would prefer to (strategically) publish papers on various aspects of their lab’s work over others. Perhaps PIs have a grant they’re about to submit and want to give the impression that their lab is expert in this area. Or perhaps they’d have to choose between publishing grander work from their senior postdoc who is further ahead in their project than a new PhD student. I think this suggestion has the potential to disproportionately hurt junior scientists (PhDs/postdocs).
8. Agreed
9. Agreed
10. Agreed
11. I see your point, but I don’t really have a problem with the current system. I think it’s clear and does the job most of the time (and my surname does not begin with an A or a Z!).
12. Reviewing does take time, and perhaps the current system isn’t perfect. But the speed reviewing you suggest might hurt the scientific process. I have (almost) always found reviewers’ suggestions to be helpful and IMPROVE my work. Speed-reviewing, as you describe it, could mean that I’d lose this. And junior researchers who are still finding their feet might lose out the most if we lost reviewers’ comments. I’d prefer to see reviewers acknowledged in some way for their time, and contribution to the paper (some Journals e.g. Frontiers already do something like this).
Wow, thanks for the detailed feedback. Re the points of disagreement/discussion:
Delete2 - I agree; this is essentially Blackstone's principle in criminal law, and it is basically irrefutable in my book: http://en.wikipedia.org/wiki/Blackstone%27s_formulation
4 - Good point. Perhaps there could be exceptions, as you point out, where Observations could be published without internal replication (provided the original experiment provides strong enough evidence)
5 - This is just my personal view, but I think that if we're going to release data then we need to surrender ownership of it once published (I realise that is an extreme view that many will disagree with). I just feel that data should be in the public domain for all to see and use as they see fit. It's for reviewers to decide if the data was used appropriately, and naturally the source should be acknowledged whenever the data is used. To me this seems no different from citing other people's published papers. Using someone's data without acknowledging them would be tantamount to data plagiarism and would be academic misconduct.
7 - That's a fair point. The last thing we want to do is disadvantage young scientists. So perhaps the cap could be per staff member/PhD student rather than per lab. Without imposing something that limits the quantity of output, we're never going to be able to give quality the attention it deserves.
12 - Yes, reviewers can (and often do) help improve manuscripts. But they also get things wrong (often) and make mistakes which lead to erroneous rejections. By sampling a small number of error-prone reviews we guarantee a noisy selection mechanism. I'd like to see a combination of in-depth review and rapid-review. For instance, perhaps at first submission a larger number of speed reviewers could give ratings (including on replication value) and then if the average is high enough the paper is selected for in-depth review by one or two of those reviewers who - crucially - gave ratings that were closest to the mean.
Great read! Thanks.
ReplyDeleteMy two cents:
1. Top journals should make their names not through high IFs, but rather through openness. I propose they transform into databases where each article is an entry that has following contents: the paper itself, possibly reviewer comments, the raw data, replication study reports.
Furthermore, they should sponsor replication studies: replications are actually perfect material for masters' theses or even graduate students in their first year learning the experimentation skills. Why not award prizes or small funds to such people?
2. I have reservations against the speed reviewing as well. On the other hand, there should be some changes in the review system. First of all, not all reviewers are selected randomly. Nice articles (fashionable topics; written by high profile researchers; ...) seem to "attract" other reviewers than not-so-nice articles. That is a bias at the editor side.
2b. Review/evaluation of reviewers, as suggested by others, is a great idea. And such evaluations should become an integral part of track records, next to the output you deliver as an author.
Hi Tim, thanks for commenting.
Delete1 - couldn't agree more. Excellent idea to tie in with Masters research. As you say, we really need a dedicated research fund to sponsor direct replications. I'll keep this in mind, I'm sure it is something we can push!
2 - I expected speed-reviewing to be the most controversial suggestion. As scientists we've come to both love and hate in-depth review. We love it when a reviewer helps us improve a manuscript; and we hate it when they make fatal errors and kill our paper unfairly. What I'm suggesting is a kind of middle ground where trade off some of this in-depth reviewing for crowd wisdom.
Re 2b - I really like the idea of reviewers being reviewed, but it needs to be done carefully and independently. Reviewers already do this as a favour and it would be easy for a mechanism of meta-assessment to deter them from doing anything at all. At the same time, such assessment is important!
one-man internal affairs bureau
ReplyDeleteEven reading this line makes me shudder. Has his paper outlining his method come out yet? Even if he is 3/3 in catching real fraud it's not a tenable situation that he be the sole arbiter of which studies are investigated. I will admire his work when he is open about it.
As far as I know Simonsohn's method hasn't been published yet. Ed Yong's article (linked above) mentions that the paper on the technique will soon be submitted, so I imagine it will be at least a couple of months before we see it in accepted form - and that's assuming it is accepted quickly.
DeleteI think your reservations are completely sensible. In Simonsohn's defence, the 'one-man IRB' quip is me (being facetious) rather than any kind of position he's given himself.
Will be interesting to see how it all pans out.
Great post Chris!
ReplyDeleteI agree with most of your points you raised except for the speed review thing. I can't see a point & click system working. How about limiting the length of reviews to one page with clear section to fill in?.
I can't see the MCQ version working if reviewers don't justify their choice (even briefly).
I really enjoyed reading it!
It is great to see the increased awareness about problems in psychological research and suggestions for improvements.
ReplyDeleteThe problems are not new (Sterling, 1959) and one solution is also quite old, namely to increase statistical power (Cohen, 1962).
Whereas a priori power analysis can reduce the need for data fudging, post-hoc poewr analysis can be used to detect data fudging.
http://www.utm.utoronto.ca/~w3psyuli/PReprints/IC.pdf
How would these items, particularly 1 and 4, work for qualitative studies? Making these datasets anonymous can be difficult, and sometimes impossible, for example. I am worried these kind of stringent measures would work towards further blocking this important side of psychological research, though it does merit consideration. Perhaps we should work towards fostering an atmosphere of responsibility rather than trying to police it. I'm not sure of the stats, but what is the ratio of known research frauds committed today to research article published, compared to years past?
ReplyDelete