Statistical Inference is Policy
Ask not how we can *do* the best statistics, ask how we can *get* the best statistics.
Choices
Ask an econometrician maybe 20 years ago what distinguishes econometrics from the rest of statistics and they’d probably mention something about how econometricians have borrowed insights from economics more broadly and thus respect the impact of choice on data. We work with things like endogenous choice models and auction models and cutoff manipulation where we acknowledge that the relationships we’re interested in are not just confounded by back-door associations but that our data points actively choose their own values based often on a knowledge of their own system and a set of goals they are trying to pursue.
Of course, for any economist there is little limit to the arenas where we ask ourselves to consider “ah, but why would they choose to behave that way?” and perhaps “how should the environment be designed while respecting that people will make their own choices?”
Certainly, when dealing with data about humans and other sentient beings, the models we base our inferences on should take account of the behavioral patterns and sentience of our subjects of study. But researchers are sentient people too. Our methods of inference necessarily depend on our behavioral patterns and choices. We make choices about what kinds of research to do and how to do it, and once we have a result we make choices about how to interpret and present those results.
Just like an economist would say it’s silly to design policy without taking into account how people will change their behavior in response to it, the policy and methods of inference should take into account how we behave.
The rules we set for how we perform research, and how we can make inferences from that research, is policy. Perhaps it’s not policy set by some governmental agency, but it’s a set of rules that shape our behavior.
I’m certainly not the first person to think about this,1 but a lot of the time when we’re thinking about researcher incentives we’re thinking about things like funding structures, publication strategies, and tenure. I’m thinking purely about the statistics. What are the behavioral, rather than mathematical or logical, implications of our statistical policies?
Oh and Also
As a brief aside, you may have seen my most recent post about GiveDirectly and Egger et al. (2022). This post was a part of a Substack Giving Tuesday campaign to raise money for three villages in Rwanda via GiveDirectly, an organization I have long donated all the subscription proceeds from this Substack to.
In the first three days of the campaign, GiveDirectly had raised over $900,000 from 1,925 Substack readers - I’m not sure how many came from Data on Average rather than the many other participating Substacks but let’s assume it was at least a few! Thank you!
This puts their funding drive at very close to their $1.2M goal for the month. If you can, please consider donating to GiveDirectly now to send cash directly to families in Rwanda, to use it as they see best fit.
Back to the post.
Researchers are People
What do researchers want, and how can they get it?
It’s easy to take the cynical route here and say that they want publications and professional success, but I don’t think that’s entirely accurate. I think researchers in general do want to get the right answer, or at least it’s a prominent motivation.
We have other motivations too. The cynical desire to get an eye-catching result and publish well is one of them. So is a desire to confirm our own beliefs, and a desire to not work harder than we have to. We are in a profession where none of those things are supposed to be part of our motivation, and wanting to avoid being too in pursuit of those goals is a part of our motivation too.
Those are our goals. We make choices in pursuit of those goals, but we face restrictions in our ability to do so. Some of those restrictions include limits on our funding or time to do research. Others include limitations in our own abilities - this latter part is especially true in regards to statistics. There are plenty of researchers who are great, say, biologists or economists or whatever, but not really statisticians. When forced to do their own statistics, it’s their own statistical limitations that will come into play.
So we make our choices about how to pursue research and how to make inferences from it. That’s the behavior we can think about researchers as engaging in. They do it in pursuit of some mix of professional success, actual truth-seeking behavior, desire to reduce effort, and other psychological goals having to do with truth-seeking (like confirmation bias), and in a context where funding, time, and ability are all to some degree limited or at least not infinite.
Those restrictions are things that can be changed given policy surrounding research. Policy changes incentives. For an extreme example, consider a researcher working in a country where research simply won’t be released if it contradicts the government’s opinions. A truth-seeking researcher may choose to focus on topics that the government doesn’t care about, knowing that they won’t be able to truthfully report results about controversial topics. Changes in publication policy can change research behavior in that way.
Less directly, rigorous and difficult statistical papers and proofs are often boiled down into a set of best-practices for researchers. In the context of statistics, these simplifications mean that we are often start with “this is the correct thing to do when X and Y are true” and end up with “this is the correct thing to do.” So how do we, say, adjust for confounding variables? Economists (and some other fields) tend to use regression, which is the correct thing to do when linearity holds (I’m simplifying, there are other conditions too), while other social sciences tend to use matching, which is the correct thing to do when dimensionality is low (again, simplifying).
Simply asking researchers to remember the italicized parts at all times is the logically and mathematically appropriate thing. But we know that is perhaps asking too much. So we end up with policies like “use regression” in economics and “use matching” outside of it. Even if these are pretty imperfect pieces of advice (obviously), that advice is still pretty good as policy if linearity is more likely to hold in economics and dimensionality tends to be low outside of it. Whether or not that’s true is another question, but you can see how improper statistical advice could still be good statistical policy, or at least why the conditions for those two things is different.
Let’s keep this in mind as we consider an example of real-world statistical policy.
Case in Point: Hypothesis Testing
What is hypothesis testing as statistical policy? Again, we’re not defining it as you’d see it in a textbook (several of the steps below are actively incorrect if you’re going by the textbook), we’re defining it as an applied set of rules for quantitative research. Loosely, it is this:
Get some data.
Perform some calculations on that data (for example, fitting a regression).
Calculate a test statistic on that calculation, comparing a parameter to 0 (like a t-stat).
Calculate a p-value from that test statistic.
If that p-value is below (usually) .05, consider your test statistic to be non-zero and interpret as the non-zero value you estimated.
If that p-value is above (usually) .05, consider your test statistic to basically be zero, and either interpret it as a zero, or if you really needed that statistic to be non-zero for it to be interesting, perhaps abandon the analysis.
No specific agency chose these policies, they developed somewhat naturally and through years of actual practice and decision-making. In fact, no specific agency has the ability to choose these policies, as they are rules held together by social norms, as Elinor Ostrom might think of them. As an example, the journals published by the American Economics Association stopped publishing significance stars in 2016. This had no effect on p-hacking or publication bias. I suspect the reason it had no effect is because we all mentally insert those stars anyway, since that’s the actual policy, no matter what the AEA says.
There are some weird steps in here, right? Like, why test against 0? Why .05? Those seem like arbitrary values, and they are! These numbers don’t pop out of the statistics naturally. Statistics generally has little to do with specific values like that.
But what does have arbitrary cutoff values? Policy! There’s no particular reason why a single individual needs to earn less than specifically $2,608 to qualify for food aid in Washington State or why the retirement age in Spain is specifically 66 years and 8 months. But policies that make decisions love cutoffs. Maybe it didn’t have to be that number but it has to be some number. We can already see how hypothesis testing in practice behaves a bit more like a decision-making policy than it does a statistical practice. Speculatively, we can wonder whether this is why hypothesis testing became so popular and remained so dominant - maybe we wanted a policy more than we wanted something logical.
What choices does this policy encourage? Because it requires us to pass a hurdle related to precision, it should discourage choices that are likely to reduce precision (like using small samples or using highly collinear models). Because it tests against 0, it should discourage study of small effects. Because it uses a predefined significance cutoff, it should discourage attempts to sell results that are just noise as meaningful (since it requires changing the analysis rather than just changing the goalposts).2
Granted, like any policy, especially a simple one, hypothesis testing can be gamed. Small effects estimated imprecisely can absolutely get significant results and get published. We can manipulate our analyses to pass the cutoff, doing p-hacking. We can all think of examples of small or imprecise results being sold as meaningful.
But the test of a good policy isn’t whether it is accurate, it’s whether it’s accurate enough to use, and perhaps more accurate than alternatives. Do we sometimes publish noisy results on small effects? Sure! But does the policy make fewer of those studies get published while letting more of the precise-results-on-large-effects studies go through? I think it probably does!3
Does this policy make us publish fewer noisy results on small effects than if we used a different policy? Depends what your alternate policy is! In many parts of the private sector and business analytics you see little attention paid to statistical precision, and the this-effect-is-just-noise problem is way worse there. We can instead imagine what academic research would look like if everyone was Bayesian or something, but contrasting hypothesis-testing-in-practice against Bayesian-practice-in-theory or Bayesian-as-practiced-by-really-good-statisticians is not a good policy comparison.
Perhaps the most salient feature of hypothesis testing is that it gives an up-or-down designation. We then often choose not to publish insignificant results. This means the set of published studies is full only of significant results. Behaviorally, since there is randomness in sampling, this encourages an approach of doing lots of more imprecise studies over a few precise studies. THIS is the behavioral issue that is really concerning with hypothesis testing, since it means not just that some noisy false positives are slipping through, but that they have the opportunity to swamp the more precise studies. If not for that behavioral response, then the up-or-down designation might be the right sieve policy, especially if what we’re concerned with is that readers picking a random article are more likely (but not guaranteed) to get something accurate. But again, how does this compare to the alternative? Is there a statistical policy in a world where studies are either published or not published that discourages a scattershot approach? I’m not sure Bayes really fixes this!
Mini-Case: Covariate Selection
In the causal-inference world, there is a divide between those who use causal diagrams and those who don’t. You all know I’m on team diagram. I think they’re helpful! However, there’s a pervasive sense on that team that diagrams will keep applied researchers from using flimsy or inaccurate identification schemes. I think this ignores how covariate selection is determined by policy, not by textbook.
The current policy surrounding causal identification is to say something like “To get the effect of X on Y we need to assume that thing A and thing B are unrelated conditional on covariates C. I think they are (and ideally: here’s why).”
This allows all sorts of sloppy work, like not justifying why you think two things are unrelated, or making assumptions more because you need them to be true rather than actually believing them, or using things like instrumental variables without justifying that they’re really exogenous, or just picking all the variables you have the most convenient access to and saying “well that’s probably enough.”
So what’s the alternate policy here? Unlike in the hypothesis testing case it’s more clear: the alternate policy is to build a causal diagram of your model, then use a causal identification algorithm to have it tell you need to control for.
Now, causal diagrams can help avoid some errors, like including control variables that are related to the question of interest but shouldn’t actually be controlled for. But the idea that they force researchers to avoid sloppy work seems way off base to me! If you want to be sloppy with a causal diagram you absolutely can - just draw a sloppy diagram! Want to conveniently assume two things are unrelated? Just don’t draw a line between them on your diagram, or leave a variable off entirely. Want to assume an instrument is valid? Just draw it as valid. Only draw your diagram with back doors that you are capable of resolving, and conveniently leave the rest off. As a formal set of rules, causal diagrams disallow all this stuff. But as a policy it doesn’t have a great way of discouraging it, at least not any more than non-diagram policy does.
Sure, perhaps having to draw a diagram might draw attention to the iffy assumption you’ve made, but in my experience we’re already decent as readers at picking up those sorts of issues. We might also wonder whether drawing an authoritative-looking graph might make it easier to ignore that these assumptions are iffy! To assume this sloppiness can’t carry over again compares current-covariate-selection-in-practice against causal-diagrams-in-theory. That’s not how policy works!
Ideal Policy and Barriers
Statistical methods are many things. Two of those things are: (a) a subset (or cousin) of mathematics that explains the proper way to use observation to infer things about the real world, and (b) a set of practices that people actually use to infer things about the real world from data. Most of our actual inference comes from people doing (b), not (a). (a) may not be possible to do on a grand scale.
So the question is, what is the optimal way to structure (b) so as to get us the best inferences about the real world? What is the optimal policy as economists like to put it? It would be the set of statistical practices that best encourages existing researchers to choose accurate inferences from their work. A logically-accurate set of statistical practices doesn’t mean anything if people don’t use it properly, and a flawed or even gamable set of practices might lead to better outcomes than any realistic alternative.
I can’t claim to actually know what the optimal policy is for any of these cases, but I think this is a clearer way of thinking about the statistical rules and methods we follow than asking what is actually the correct thing to do. Sorry to be an annoying economist about it (I just can’t help it!) but choices are made under constraints. If we want actual statistical practice to improve, then advice we give (and the standards and policies we try to influence, if we can) must respect those constraints.
As a tangent, I may stand alone in really liking significance stars, on principle. I think the p-value by itself is actually pretty informative, as long as the null hypothesis is meaningful. It tells you how likely the data is, if the null is true! Picking a cutoff for this value doesn’t really make sense. Lower is “better”, higher is “worse”, regardless of a specific cutoff. But having three or four cutoffs, which significance stars allow you to do, gets you a bit closer to just thinking directly about the p-value, rather than a single cutoff. I know it’s not accurate but I think the whole “marginally significant / more significant / less significant” thing makes more sense and leads to better inference than asking whether the p-value clears a single .05 hurdle or not as a binary thing. Would looking directly at the p-value and skipping cutoffs entirely be even better? Yes! But is that an option? I dunno, it seems like it isn’t!
If it didn’t then we wouldn’t have publication bias in the first place! Publishing noisy estimates wouldn’t be an issue for the literature as a whole if we published all the noisy estimates (at least not as long as there was a meta-analysis waiting at the end so we didn’t just read one noisy estimate and then conclude victory and stop reading).

