Lorne Campbell wrote an interesting blog post where he compared the impact factor of select journals with their post-hoc power scores, as calculated by friend and colleague Uli Schimmack’s automated method of text analysis and statistical extraction. This correlation revealed no association, which led Lorne to conclude that “top journals do not seem to publish studies with relatively more [harvested] post-hoc power and thus results more likely to replicate compared to lower tier journals (at least according to the R-index).” In other words, so-called top journals do not publish results that are any more (or less) replicable, at least according to the harvested post-hoc power method.

I think the key to the above is “according to the harvested post-hoc power method”. Does this method produce a reliable and valid measure of replicability? My impression is that right now we simply don’t know. It seems like it might, and the logic seems solid, but I haven’t seen any evidence confirming (or disconfirming) this assertion.  So should we accept his method as the arbiter of what can and cannot replicate; as a diviner of which journals produce the best or least replicable results; or as the judge of which psychology departments produce the most or least replicable papers? In my view, it is simply too early and without positive proof, such use of harvested post-hoc power is premature.

Whence comes my skepticism? It comes from two sources. First, no one has actually shown that this method predicts replicability, that harvested post-hoc power has predictive validity. Yes, I buy the logic of the method, and it makes sense in theory, but I have yet to see it predict anything in practice. The closest thing to predictive validity we have seen is when Uli made individual predictions of each of the 100 studies from the OSF reproducibility project. But the problem here is that instead of examining the correlation between his predictions and the actual replication status of each study, he calculated an aggregate replication rate of 50%. The fact that this aggregate was not far off from the actual replication rate of36% does not actually tell us much. That is simply one data point (even if it consists of aggregated data) and we’d need 199 more data points if we follow Simine Vazire’s new rule of thumb for adequate power.

As an aside, when I went into the data from the OSF reproducibility project and calculated post-hoc power for each of the 100 studied (by transforming p-values) and then correlated these with replication status, I found a rather unimpressive association of r = 0.23, p = 0.02. So, post-hoc power—which is one step closer to true power than harvested post-hoc power—predicted a mere 5% of the variance in rates of replication. Although post-hoc power clearly predicts what we want, it doesn’t do it particularly well. Admittedly, one problem with the above approach is that post-hoc power is only useful in the aggregate—it’s relation to true power is very noisy and thus much aggregation is needed for the signal to drown out noise. This difficulty, however, should not preclude a thoughtful attempt to validate the method, something that has been lacking.

My second source of skepticism is the levels separating what we want from what is being measured. What we really want to measure is rate of replication. This is impossible to measure without actually going out and replicating various studies multiple times, so we need some proxy. Statistical power—which can be defined as the ability detect an effect if the effect actually exists—seems like a good proxy here. If an effect is real (and thus replicable), the more power you have, the more likely you are to detect its presence. Even though the relationship between power and replicability is not isomorphic, I am willing to accept that it is a good proxy.  However, we are not being asked to accept power or true power as a proxy for replicability, we are being asked to go one step further by accepting post-hoc power (which is not perfectly related to true power, hence the need for aggregation) as a proxy for replicability. But, then, in another step we are being asked to go further again by accepting the opaque method of automatically harvesting post-hoc power as a proxy for replicability. I am simplifying, but this method involves automated text analysis that pulls in all statistics, related or not to the focal hypothesis, which are then aggregated and converted to estimate post-hoc power.

Now, what is the relationship between harvested post-hoc power and actual post-hoc power? More critical, what is relationship between harvested post-hoc power and true power or replicability? In short, we don’t really know yet. Nonetheless, we are being asked to accept harvested post-hoc power as a proxy for post-hoc power, which is a proxy for true power, which is a proxy for replicability! There are lots of steps in between what is actually being measured and what we want to measure or what we want to infer. 

I would love to have a measure of replicability without bothering to replicate papers. I would also love a ranking of journals based on replicability; or a ranking of department’s rate of replicability for that matter. I also see great hope in what Uli has created and think that some version of it might indeed help us in this regard. However, just because a tool is called the replication index does not mean it indexes replicability. I am not willing to accept this method right now. Until I see a lot more detail about the method—including tests comparing its results to experts coders—and until I see the method predict actual replicability, I think using it to rank anything is premature.