My first encounter with Moïra Mikolajczak was a couple of years ago. As I was writing a review on oxytocin and trust, I stumbled upon her work in Psychological Science. I was intrigued by her finding, that giving intranasal oxytocin to human subjects made them “trust” a computer more than it made them trust a human. This result contradicted a previous high profile publication claiming that oxytocin’s effects were specific to interactions between humans. I asked Moïra to share her data.
Apart from the data, Moïra also shared her recent adventures as a behavioral oxytocin researcher in humans. After what had probably been beginners’ luck, her lab could not replicate some of their initial findings. They had also failed to detect several of the effects expected from prominent behavioral oxytocin theories. She had just got back from a conference where she desperately tried to discuss her findings with several high-profile figures in the field. None were willing to share information about the amount of unpublished null findings in their labs.
The conversation with Moïra made me angry. Apart from socializing and drinking, the point of going to conferences is having researchers talk to each other about their work. True, in some cases scientists might have fears of getting scooped by others. But this is not the case for null results. Sharing your failures with other researchers is especially important: it can save them the time, money and frustration of chasing non-existing effects.
The chat with Moïra also made me hopeful. I was fortunate to meet an oxytocin researcher who was committed to finding the truth and understood the importance of transparent, collaborative scientific conduct. It sparked collaboration with Moïra and two of her colleagues in the university of Louvain, Anthony Lane and Olivier Luminet. Our project is published in this week’s special issue on oxytocin in the journal of Neuroendocrinology.
Moïra’s lab conducted quite a few oxytocin studies over the past decade. The experiments were methodologically similar to most other works in the field with respect to sample size and oxytocin administration procedures. But only one type of studies was publishable. Positive results were greeted with enthusiasm by editors and reviewers and found their homes in top journals. Negative (or “null”) results got rejected time after time and were sent to the file drawer.
In one case, a high impact journal published a suspiciously large behavioral effect of oxytocin found in a single blind study (i.e., the experimenter interacting with the subjects knew what the hypothesis was, and which subjects received oxytocin / placebo) conducted in Moïra’s lab. Her group failed to replicate their own findings twice, with larger samples and using a double blind protocol. The same exact journal rejected their (failed) replication paper.
Moïra saw a publication bias rapidly emerging in front of her own eyes and it became obvious that her publication profile no longer reflected what was really going on in the lab. She made a decision to clear her lab’s file drawer, and publish all of her lab’s studies in a single article. The numbers below summarize her lab’s behavioral oxytocin research.
Only a single task out of 25 produced a main effect of intranasal oxytocin: the task that failed to replicate twice (mentioned earlier). Five out of 25 experiments (20%) found significant interaction effects. It’s important to keep in mind that the probability of a type I errors is quite high when exploring interactions, unless the p-values are corrected for multiple hypothesis testing. None of the interaction effects survived such correction.
Our newly published paper summarizes the results obtained in Moïra’s lab in three meta-analyses that test the overall effects of intranasal oxytocin on (1) affective, (2) behavioral and (3) cognitive variables in isolation. Neither of these meta-analyses found a reliable effect. Finally, we meta-analyzed the studies based on their relation to three prominent theories: (1) oxytocin is an “affiliation” hormone; (2) oxytocin enhances saliency of social stimuli; (3) oxytocin facilitates approach behaviors. We could not reject the null hypothesis in either case.
To date, it is unclear how many unpublished oxytocin studies are lying in other labs’ file drawers. Keeping in mind that the average statistical power in behavioral oxytocin literature in humans is extremely low (12%-16%), that seminal works has failed to replicate, and given the uncertainty surrounding the physiological effects of intranasal oxytocin administration (and whether the substance gets to brain regions involved in cognition and behavior), I find it hard to believe that Moïra’s lab is an outlier. It is possible that hundreds (if not thousand) unpublished intranasal oxytocin studies are still lying in many labs’ file drawers.
Moïra’s lab has set exemplary standards of scientific transparency. I hope that other researchers will adhere to these standards and release their unpublished oxytocin data as soon as possible – and turn oxytocin research into a transparent scientific endeavor. This is crucial for estimating the true effects of oxytocin administration on various behavioral outcomes. Oxytocin has dramatic influences on social behaviors of animals, and despite all of the methodological difficulties some of the effects may reflect real phenomena. Opening the oxytocin file drawer will also save time, money and frustration for many oxytocin researchers.
 Some dependent variables were estimated using more than a single behavioral paradigm, and some studies included more than a single task – a practice that is common in many labs, which goal is maximizing the knowledge gained from each subject that is going through a pharmacological treatment.
A few months ago, a Science paper reported the results of an attempt to replicate 100 studies published in three top psychology journals between 2011-2014. The authors concluded:
“A large portion of replications produced weaker evidence for the original findings despite using materials provided by the original authors, review in advance for methodological fidelity, and high statistical power to detect the original effect sizes.”
“Indeed, the data are consistent with the opposite conclusion, namely, that the reproducibility of psychological science is quite high.”
The authors of the psych replication paper have also published a response.
Together with these two letters, a research article co-authored by yours truly reports a multi-lab study attempting to replicate 18 experiments published in two leading economics journals, American Economic Review (AER) and the Quarterly Journal of Economics (QJE) between 2011-2014.
Below are the results of the two projects juxtaposed. There are many possible metrics for measuring the replicability of a study. Here, I counted a “successful replication” as either a statistically significant (p<0.05) replication effect (green) or a statistically significant meta-analytic effect when combining the original study and the replication (blue).
As can be clearly seen from the results above, replication rates in the econ project were much higher than those of the psych project, albeit the number of studies was grater in the psych project.
In this post, I will try to articulate what I see as the main point of Gilbert et al.’s commentary and explore what the econ project can teach us about the state of the psych literature.
Causes of failure to replicate
A replication attempt can fail because of two main reasons:
- The original effect is not real. In other words, the original study is a false positive, or a “type I error”.
- The original effect is real, but the replication could not detect it. In other words, the replication study was a false negative, or a “type II error”.
We should update our beliefs regarding the true state of the world (e.g., “there is/ isn’t a replication crisis”) differently in these two cases.
The chance of successfully replicating in case (A) depends only on the original study; it will be low when replicating a false positive, regardless of the power of the replication study. This however is not the case for (B), where success depends on the replication’s statistical power: the chance of successfully rejecting the null hypothesis given that the effect is real.
For a given effect size, increasing statistical power (and reducing the chance of case B’s) can be achieved by increasing the number of participants in the replication. As running more subjects costs money and takes time, researchers typically set the replication’s sample size to the minimum required for achieving sufficient statistical power (a standard power is between 80% and 90%). In practice, this is done using a mathematical formula, which is implemented by power calculators estimating the required sample size given the expected effect size.
But… what is the expected effect size in the replication?
It is difficult to determine exactly what effect size we should expect in the replication. If we knew, we wouldn’t have to run the study, right? As a proxy, researchers typically use the effect size of the original study. This is reasonable, as the replication attempts to be as close as possible to the original study.
Gilbert et al.’s critique
All replications might differ from the original studies in many ways, from geographical locations and weather conditions to the point in history when they were conducted. A fresh example from the econ replication project is a study from 2011, reporting that inducing happiness decreases temporal discounting. The original study induced happiness by having participants watch a video of a performance by Robin Williams, who since then had committed suicide. The video manipulation failed to induce happiness in the replication study and the results did not replicate.
Since different subject pools are used across a study and its replication, there is plenty of reason to suspect that the pools might differ in important ways. Sometimes the behavioral task is also not perfectly replicated. As such, even if an effect were real, the effect size in a replication may be different than that in the original study – and relying on the original effect size for calculating statistical power would yield biased power calculations. This could potentially lead to more type II errors in the replication (case B).
Gilbert et al. bring several specific examples (that are addressed in the reply letter) for substantial differences between the original studies and the replication attempts of the psych project, that they call “infidelities”. They claim that these “infidelities”, together with differences between the subject pools, cause a bias that often works in one direction: against the chance of finding an effect.
Following the critique, I ran a few power calculations. The average effect size of replication studies in the psych project was about a half of the original, and in the econ project it was two thirds of it. Whatever the source of this shrinkage of effects in the replications may be  this suggests that reliance on the effect size of the original study might lead to underestimating the necessary sample size in the replication. This could lead to a greater chance of “case B” replication failures.
I calculated the chance of failing to detect a true effect when miscalibrating the power calculation with an overly optimistic effect size estimation (table below). The outcome is statistical power of only 58% if the (true) replication effect size is only two thirds of the original study and 38% if it were a half of it. These numbers closely match the replication rates of the econ and psych projects, correspondingly .
|Original effect size||Required sample for .9 power||Power achieved for a 2/3 effect size||Power achieved for a 1/2 effect size|
Moving forward: is there a replication crisis?
If we ignore several problems in the literature that we already know of (such as p-hacking and low statistical power) and accept the proposition that changes in the protocol more often introduce noise (rather than reduce it), Gilbert et al.’s interpretation holds water and the number of “successfully replicated” studies might be under-estimated. Does this mean that the “reproducibility of psychological science is quite high”, as they claim?
Much of the discussion of replication results has focused on various different metrics of replication that are concerned with estimating the percentage of studies that “successfully” replicate according to various criteria. In the previous section, I have (hopefully) convinced you that if we can live with an average replication effect size that equals a half of the original study, replication rates of the psychology project were actually going as expected if all of their effects were real.
The psych project replicated studies published in three of the most prestigious and influential psych journals. Most of its authors are psychology PhDs, some are prominent professors with a long publication list in top journals. They have invested a lot of time and unprecedented amounts of money to make the replications as close as they could to the original studies. The project was completely transparent and all of the data was made publicly available. It seems obvious that its sincere goal was estimating the true state of the literature, rather than finding a “crisis”.
So, if we accept Gilbert et al.’s criticism, there are only two possible conclusions.
- The original studies were too difficult to replicate directly without too many “infidelities”. As replication is a crucial part of every scientific process, this is a major problem.
- Very little deviations from the original protocol have shrunk the (true) effects of studies published in three top psych journals by a half on average.
Now comes the key question: is this the type of robustness that social scientists can live with, or can we do much, much better?
If there’s no much room for improvement, then the reproducibility of psychological science is quite high. But keep in mind, that psychological experiments are conducted in many different labs across the world and in many different points of time in history. The (ambitious) goal of psychological science is to find robust phenomena that inform general theories that tell us something about human behavior in the noisy world outside the lab. In order to make scientific progress, researchers must rely on the previous findings of their peers. If we can live with either of the two above conclusions, doesn’t it mean that the entire endeavor of psychological science is a waste of time and resources?
This makes the econ replication project important.
Like in the psych project, replications were conducted in different sites, with different subject pools, different experimenters and in different points in history compared to the original studies. The a-priori power calculations that were used were the same as in the psych project. This means that the econ replications also relied on the original over-estimated effect sizes when calculating the required sample size. But the econ results were different. And they suggest that something else might be going on – that is beyond a statistical artifact caused by inevitable “infidelities” in the replication process.
There is a misconception that the word “crisis” in Chinese is composed of two characters, one representing danger and the other opportunity. I will adopt this misconception here. Contrasting the results of the psych and econ replication projects showes that we can do much better. The replication crisis does exist, and it gives us hope for a brighter future.
In the next few posts, I will illuminate the differences in research practices between economics and psychology, and discuss whether they might have contributed to the differences in replication rates.
Disclosure: the author is neither an economist nor a psychologist
 The reason could be “infidelities”, but also case (A) errors (original studies are false positives) and publication bias.
 Some of the original effects are likely false positives (case A), and therefore the power calculations are somewhat pessimistic.
OK, we get it. There is (still) no solid evidence that intranasally administrated oxytocin gets into the brain. Many studies used flawed techniques to measure oxytocin in the blood. The seminal works that inspired human oxytocin research do not replicate well. Still, there are so many reports of relationships between oxytocin and social behaviors. It is impossible that not a single one of these effects is real. Isn’t it?
It is possible that some effects are real. But, in a world where only 36% of psychological experiments replicate, and prominent newspapers are easily fooled into making catchy headlines out of flawed studies published in obscure journals – it is a possibility we should seriously consider. And here’s why.
In one of my favorite psychological studies, Shaul Shalvi and colleagues investigated the drivers of dishonest behavior. Subjects had a single opportunity to privately roll a 6-sided die and then report the outcome to the experimenter. Subjects had been instructed that they would get paid, in cash, based on their reports: for example, if one reported “5”, she or he would be paid $5. This gave subjects financial incentives to lie and report high numbers, regardless of the true outcomes.
Some subjects indeed lied, but the distribution of reported outcomes showed that only a small fraction of them did. Despite the financial incentive, and even without a chance of getting caught, subjects reported high outcomes only slightly more often than low outcomes.
But it only took a slight change in the instructions to turn many subjects into liars. In a follow up study, subjects were instructed to roll the die three times – and only report the first outcome. The frequency of “high outcomes” was strikingly greater this time. A simple analysis revealed that subjects reported the highest of the three throws, instead of reporting the first outcome. Giving people a self-justification opportunity (they did see a high outcome after all, right?) made it psychologically easier to break the rules.
When scientists run a study that examines a false hypothesis and use a statistical test to evaluate its outcomes, what they do is statistically equivalent to rolling a 20-sided die (D&D, anyone?). The norm is that scientists must predict (before rolling the die) what the outcome would be (for example, “20”) and report a positive finding only if this was indeed what they had found. This practice is supposed to assures that only 5% of the false results will be reported.
Glimpsing through the scientific literature in most fields (oxytocin research is no different), one can easily tell that most of the reports are of positive discoveries. This reflects a well-known publication bias: scientists selectively report only positive findings – that are more likely to get published in top journals– and store their negative findings in their file drawers. This means that the number of positive reports does not depend only on the true state of the world, but also on the overall number of studies that are conducted. If researchers roll the 20-sided dice many times, selectively report the “20”s, but store all of the “1”s, “2s” and the rest of the outcomes deep in their file drawers, we are assured to get many positive reports of false findings.
How bad is this problem? Depends on how many studies are conducted. If enough studies are carried out, every hypothesis will eventually be supported by some reports of experimental “evidence”. Ironically, the more trendy a theory becomes – the more likely researchers will test it, and the field as a whole will find false effects just by mere chance. The chart below shows that the probability of finding at least one “statistically significant” effect (the red line) grows fast as a function of the number of studies conducted (the exact probability is 1 – (1-α)N, where α=.05). The chances of finding at least two, three and four results follow similar patterns.
This problem is enhanced in oxytocin research. The experimental procedure typically includes either blood draws or pharmacological treatments, and in such cases, there’s an ethical justification to maximize the knowledge gained from each subject. Therefore, it is the norm to run several tasks, some of them measure more than a single behavioral outcome. More data is better, but that depends of how it is being used. In practice, testing multiple hypothesis using statistical tests that were designed for a single study, greatly increases the of probability of finding a false association. But wait, there’s more.
Hormonal systems are complex to study, and few simple cause and effect relationships exist in nature. Therefore, different subjects might respond differently to pharmacological treatments. Intranasal oxytocin administration might make males more trusting– but have an opposite effect on females. The spray might not work when interacting with someone that seems unreliable. Or perhaps one must be betrayed first before the nose-spray exhibits its trust-enhancing effects.
Back in 2011, a well-cited review (over 500 citations in 4 years!) pointed out that only half of oxytocin administration studies had reported main effects of the treatment. The rest only found effects in sub-populations of the subject pool, or under specific environmental conditions. The authors suggested that the field could benefit from exploration of these factors. Behind this well-intentioned proposition hides a potential disaster.
Mining data for associations without an explicit hypothesis, in search of true predictive power, is not straightforward. An entire discipline in computer science is dedicated to dealing with such problems. But in practice, oxytocin researchers responded to the review’s proposition by collecting many variables that could potentially moderate the effects of oxytocin (sex, age, health status, relationship status, personality, genetics, environmental conditions and more), post-hoc selected the moderators that their interactions with oxytocin treatment were “statistically significant” and reported them as positive results – using statistical tests that were designed to keep the false discovery rate of a single hypothesis test low.
At the absence of an explicit guiding theory, every variable was a potential suspect. But here’s the catch: each test of whether oxytocin works only under specific conditions is statistically equivalent to rolling the 20-sided die again. With the collection of more and more variables, it becomes hard not to find a false effect by mere chance.
Suppose one only collects the big five personality traits (which are considered statistically independent of each other) and gender, to test whether these factors moderate the effects of oxytocin on one target behavior. By doing so, he or she actually rolls the die 18(!) times: when testing for a main effect of oxytocin (one), testing for effects in each gender (two more), testing for personality dependent effects (5 more) and finally testing for every possible gender-personality interaction (10 tests). As the chart above has shown, if one uses the standard procedure of statistical testing, that was designed to test a single hypothesis, the chance of a false discovery has already reached 60%. When running a battery of three tasks for every subject who underwent oxytocin treatment – we end up rolling the dice 54(!) times and finding false effects becomes almost guaranteed – over 90%. The chance of finding at least four different effects is over 50%. Given these numbers, it is not surprising that interaction effect studies in social psychology are less likely to replicate (22% success) even compared to the (already low) replicability rates in the rest of the field.
With promising initial findings, extreme media hype and nearly guaranteed chances of finding significant results that end up in top academic journals, researchers had strong incentives to jump on the intranasal oxytocin bandwagon and speed up the accumulation of the literature. Ironically, this method is also cheaper than studying other hormones: because one cannot verify whether the substance really gets into the subjects’ brains, there was no justification to invest in neither collecting saliva or blood samples nor performing any type of expensive hormonal assays for the sake of a manipulation check.
Researchers also had several reasons that justified keeping doing what they were doing. There is an ethical rationale to make the best effort to publish results that were obtained from human subjects that underwent a pharmacological treatment; we still don’t know much about the time course and the exact administration procedures required to make the spray “work” well, so it was easy to justify why only some of the studies worked as predicted; running many tests and only reporting the positive results might have been the norm in other fields. Last but not least, most of the results made sense on hindsight, even if there were great chances that they were just statistical artifacts.
The future will tell whether my conspiracy theory reflects what was really going on in oxytocin research. Over the past few months I have come to believe that the likelihood has increased. A recent meta-analysis co-authored by Larry Young, the scientist who made some of the most important discoveries relating oxytocin with social behavior in animals – estimated that the average statistical power (the chances of finding true effects if these truly exist) in oxytocin research is between 12% and 16%. This means that even if all of the effects were real, the studies investigating them did not use enough subjects to meaningfully study them.
This leaves us in a difficult place: in order to reliably extract true signals from the literature, we should directly replicate all studies using much larger samples – at the order of hundreds, sometimes thousands of subjects. Given that we are unsure whether intranasal oxytocin gets into the brain in the first place and have no means to conduct a manipulation check, it is not obvious that pursing this direction is worth scientists’ time, effort and money.
As I often find myself in conversations about oxytocin – in conferences, cocktail parties or flights – I’ve collected several frequently asked questions in this post. I will try to update it regularly, please feel free to post new questions (and alternative answers) of your own.
Q: Is there an association between Oxytocin and trust in humans?
A: According to the scientific literature (as for late 2015), there is no strong evidence in favor this hypothesis. It doesn’t seem like spraying oxytocin onto one’s nose makes him or her more trusting, and there is no robust link between trust and either blood levels of oxytocin or oxytocin-related genetic variable. The absence of evidence, however, is not necessarily evidence of absence – it is possible that there are real effects that are too small to be detected by the current research methods, or that the effects depend on various unknown factors, such as gender, environment or other demographic variables.
Q: What about other findings linking oxytocin with things like romantic love, parenting or cooperation?
A: The oxytocin-trust studies are the seminal works that inspired most of oxytocin behavioral research and are among the few studies that underwent direct replication attempts by independent research groups. The failure to replicate these results, to my view, reduces the likelihood that other behavioral oxytocin findings (most of which never underwent replication attempts) reflect the true state of the world. It is possible that real discoveries are out there – but at the absence of independent replication attempts it is hard to say which ones. Also note that a recent meta-analysis concluded that most oxytocin studies are underpowered, and therefore the estimated false discovery rate is very high – around 80 (!!) percent. Moreover, promising well-cited results relating oxytocin to “mind reading” also do not seem as robust as one might hope.
Q: I would like to study the effects of oxytocin on behavior and got discouraged. What should I do?
There’s an urge for direct replications of previously published findings. My personal opinion is that this is a low-hanging fruit, given the interest of many fields (and the popular media) in the potential effects of oxytocin on social behavior (that are supported by animal research). Here is what you should do:
- Pick an oxytocin paper with an interesting result, preferably highly cited and from a top journal (Science, Nature, PNAS, Psychological Science).
- Conduct power analysis (you can use this calculator) based on the effect size of the original study. A standard power is 0.8, but the larger you go the better (even true effects are often over-estimated in the literature).
- Contact the authors of the original study and ask their collaboration. Ask them to provide all experimental materials, in order to make sure that your replication is as similar as possible to the original study. Don’t give up if they are non-cooperative, and remember to mention their levels of cooperation in your final paper, for good or bad.
- Submit a registered replication report or pre-register your study.
- Conduct your experiment.
- Write down a short paper and publish it (regardless of your result) as a short commentary / research report. You can also upload your data to databases such as psychology file drawer. Beware that not getting a p-value which is smaller than 0.05 does not necessarily mean a failure to replicate. I recommend using the “small telescope” approach in order to test such conclusion.
- Go back to step 1.
Q: Does oxytocin reach the brain following intranasal administration?
We are not sure. Even though intranasal oxytocin has been around for quite some time, there is no consensus among researchers. My personal belief is that some molecules of oxytocin do cross the blood brain barrier after intranasal administration, but (and that’s an big but) we don’t know for sure if, how and when they reach target brain areas that plausibly control behavior and cognition. The current oxytocin literature relied on a single study from 2002 (hasn’t been replicated yet) with intranasal Arginine Vasopressin administration (a similar, but not identical molecule), and an equivalent study in oxytocin that was recently conducted (using a tiny sample size) did not find elevated levels of oxytocin in the cerebral and spinal fluid 45 minutes after administration (which is the time between administration and the behavioral task in most oxytocin studies).
Q: How have we come to so strongly believe that oxytocin = trust?
The seminal oxytocin-trust papers used novel behavioral paradigms and were published in high profile journals. The papers are beautifully narrated and tell a surprising, yet simple and coherent story, which is based on animal research and gained them a lot of popular media attention. Media coverage included very strong claims of scientists about the link between oxytocin and trust without mentioning that the results were preliminary or any other caveats. I’ll discuss more about this topic in future posts.
Q: How should one measure peripheral Oxytocin?
For blood plasma, you can use RIA or ELISA – but make sure you conduct extraction as recommended in the kit’s manual. Remember to report in detail which method you have used: although their outputs correlate – it is not identical. Some scholars have used oxytocin measurements from urine or saliva, but there several researchers have raised concerns regarding the bio-analytical validity of these measures (example 1, example 2).
IMPORTANT: correlating biological measures with behaviors that are allegedly influenced by oxytocin, genetic variables or other invalid oxytocin measures (such as unextracted plasma oxytocin) DO NOT count as a bio-analytical validation.
Q: I have an a-priori hypothesis about a specific environmental factor that might interact with oxytocin to influence target behavior. What should I do?
Pre-register your hypothesis here and run your study. Make sure you are well powered statistically, and report your result whether you found an effect or not.
Q: How can I explore the interaction between oxytocin, environment, personality and behavior without inflating the rate of false discoveries?
Be careful. Using conventional hypothesis tests that are designed to test a single hypothesis, and then reporting only the “positive” results at the standard p-value of 0.05, will dramatically inflate the false discovery rate. Here is one possible approach for exploration without inflating the false-positive rates:
- Write down all of the environmental and personality factors that you would like to explore. Let’s call them E and P.
- Calculate N = (E+1) x (P+1): this is the number of hypotheses you are testing.
- Adjust your designated p-value using correction for multiple hypotheses testing, such as Bonferonni correction. Denote the new p-value by q.
- Conduct power analysis (you can use this calculator) based on the effect size you expect to find, and q (two-sided). A sample size that allows a power of 0.8 is standard, but the larger the better (effect sizes are often over-estimated) – go as large as you can.
- Run your study. Report your finding as “positive” only if your p-value is less than q. Publish your data with the necessary caveats otherwise.
- If you believe that some patterns in the data are worth further exploration – pre-register them as new a-priori hypotheses, and try to replicate these patterns in an independent sample of participants.