What Should School Teach Teenagers About Statistics?
The folly of too much, the agony of not enough.
Dangerous Territory
Unlike most of my posts, and in general most of my internet presence, this post is almost entirely opinion-based. I won’t back anything here up with studies or mathematics, partially because the relevant questions would require evidence, and the data available on these topics is sparse where it even exists. In the absence of that, I will rely on good ol’ experience. I have spent a lot of time teaching people about statistics and the applied use of statistics, and when it comes to teenagers in the 16-18 range specifically, I’ve spent a fair amount of time tutoring statistics and mathematics, and also browsing the curricular requirements and talking to teachers who teach the statistics curriculum in states like California where it is common or mandatory. I will not be directly citing these as I go, but they do inform my thinking. This is an entirely reasonable amount of background for an opinion piece for most outlets, to the extent that it probably shouldn’t require a disclaimer like this, but at least in my particular case here you go. In previous generations they’d have called this “neurotic” but in the age of the internet it’s more a combination of properly situating my point of view with a hopeless plea to not read tiresome comments.
There Should Be a Class
One of the Sisyphean tasks of the modern age is the constant need to deal with the fact that you know something that other people do not, and no matter how long you live, even as some people learn that thing (or reject its teachings), new people are being born all the time that do not know it. This is a fact of life that is, clearly, deeply painful to many people.
This pain manifests itself in the common refrain that “there should be a class” about whatever topic it is - mandatory, one assumes - that would teach people personal finance, civics, the specific niche of history we’re particularly interested in, etc. etc., so that the adults of the world would not be making the incredibly stupid mistakes that are so deeply affecting us at all times.
This is a pretty optimistic view, the idea that having a class on a topic would automatically resolve, or even meaningfully reduce, errors on that topic among adults who took the class decades ago. After all, we have classes on mathematics, on literacy, on history, and people make mistakes about that stuff all the time. The question of what a class should teach is very distinct from the question of what we’d like people to know.
The tendency to resolve questions of what people should know with a class gets a chance to brush up against reality, though, when it comes to statistics. All kinds of people wish for a class in statistics, any time we see a statistic being discussed incorrectly on social media. There should be a class in statistics, we say, to keep people from continually interpreting all these statistics wrong.
And these days, in the United States at least, there is a class in statistics. Statistics is being added to an increasing number of states’ public-school curricula. This is one case where “there should be a class”… worked! There is a class now. Desire manifest! The Secret works.
I’m very curious to see the impact of these classes. I do expect them to make some things that were obscure more common-knowledge. Did you know that a decent share of US high school students now learn about correlation, linear regression, means, and medians? Well, they do!1
Those are some of the things that are in that class. But… what should be in that class? If the goal is to add statistics to a common curriculum so as to improve the way that adults interact with the world, as it seems like is actually our goal, what should that class actually teach?
The Usual Suspects, and Some Unusual Ones
Probability
A common first-principles suggestion for inclusion in a high school statistics class is probability. This makes a whole lot of intuitive sense, since statistics is built entirely on top of probability. So whatever you think needs to be in the class, you’ve got to start there, right? Combinatorics, die-roll calculations, and so on.
My verdict: No. I actually think that learning raw probability at this level is of limited value. A benefit is that it’s something you really can teach at this level… but only in a sort of rote way of limited usefulness. What’s the probability of getting an 8 from adding up the rolls of two six-sided dice? I can teach that to a high schooler, but really, what can they do with that?
A common complaint about other kinds of high school math, like algebra, trigonometry, or even calculus, is that they exist purely as workbook problems with limited real-world application. This is way overstated, though, and these methods, taught at the high school level, can get you basically all the way to real world applications that actually arise. Or at least algebra and trigonometry can, in business or building applications. Real-world applications of high school calculus certainly exist, but they tend to be the kind of things that come up way less often or don’t really need the calculation, like figuring out compound growth.
Real-world applications of probabilistic calculations, though, almost always rely on more advanced stuff than you can really teach a high schooler. That’s because the calculations you can actually teach a broad high school audience rely on independence, and since when is anything independent in the real world outside of a casino? Probabilistic calculations at this level do more harm than good, and lead to conclusions like “it’s basically impossible that this highly-correlated event occurred multiple times in a row - if I take the independent probability and raise it to the Nth power it’s tiny!”
This is a conclusion I came to through teaching, of all things, behavioral economics. I teach probabilistic calculations in that class (and the associated probabilistic reasoning errors that accompany them). Coming up with examples of “correct” answers that our judgement can deviate from is basically impossible without relying on coins, dice, cards, etc.. Otherwise, our “incorrect” judgment may well be accurate, and the probabilistic statement only “correct” because it is contrived.
Probability calculations themselves are of limited use, and if our goal is to create a high school class that teaches people how to use statistics in the real world, they are not a high-priority inclusion. But surely we need to include them anyway, since all the other stuff is built on top of them?
I doubt that, actually. If you’ve ever taught econometrics, how often do you fall back on basic probabilistic calculations or even reasoning? Is it necessary to get across those higher-level insights? I haven’t found it to be necessary, or even helpful, even as those joint probability distributions are in fact doing all the work behind the scenes.
There is one area where I do think basic probability should definitely make its way into the class, and that’s an understanding of how likely probabilities are, exactly. What does something that happens 10% of the time feel like? 50%? Hearing “a 20% chance of rain tomorrow” or “1% chance of a catastrophic earthquake in the next 10 years” and being able to roughly put that in perspective is something teachable and useful at the high school level.
Properties of Distributions
By “properties of distributions” I mean two things.
First, I mean a coverage of some commonly-seen probability distributions, like the normal distribution, the Poisson, and a power-law distribution (I might even say only those three, maybe even dropping Poisson if we want to get spicy). What are they, where do they tend to pop up, and what kinds of things can we expect to happen if we are following one: extremely high values are rare for normally-distributed things, but common for power-law distributed things, extreme events become much more likely if the mean of a normal shifts, comparing normals with wide vs. narrow spreads, and so on.
Second, I mean basic summary statistics that we might use to try to describe those distributions. Means, medians, and other percentiles, what they mean, and when to use each, in particular, as well as a few useful percentile-based statistics like IQR or a 90/10 ratio. Perhaps standard deviations as well, but I’m not sure - I’m wary about these only because I’d have a hard time making these understandable in a useful way to high school students. The high school statistics materials I’ve seen that do try to do standard deviations haven’t quite nailed it yet, I think. Perhaps someone else will figure out how to thread that particular needle. But is it really necessary? Some normals are wide, others are narrow - that’s useful, but does putting an exact number on “wideness” give a high school student that much more information?
My verdict: Yes. I certainly don’t think we need to get into the mathematical weeds of these distributions, but the basic idea that different variables are distributed, say, with or without skew, is highly useful for understanding not just the data we see around us, but realities of things like wealth distributions. And for that reason, it’s fairly understandable. It’s easy to see that, say, Bradley Cooper is wildly more successful than any of us reading this will ever be, but nobody is as tall as Bradley Cooper is successful.
Learning about means and other summary statistics is useful, too, simply because we actually see these in our everyday lives. The news is full of them, as will be many of our occupations, from President to McDonald’s manager. Understanding what these summary statistics are for will pay off quickly (and I think this is why these commonly are included in high school statistics curricula - this isn’t just me talking here).
Even better would be if we could teach an understanding of how noise and change affects these averages. What does it mean if average sales at the retail location you worked at dropped 5% last month? Is that a lot? Is that typical? Would it be different if we were dealing with a median or a percentile?
Data Literacy
Data literacy is less about doing statistics as it is about, say, reading news articles that include data, or a set of political claims, and understanding what they say, as well as being able to critique them.
This covers a wide range of stuff, and in one sense is more a way that all the other stuff on this page could be applied and tested rather than an actual topic in itself.
My verdict: Maybe. This is a case where I’d love people to learn data literacy, but I’m a little skeptical how teachable it is. Should we use examples from the news and real life when teaching all kinds of statistical topics at the high school level? Absolutely! But beyond learning those other topics, is “data literacy” as its own subject useful?
I’m uncertain. In my experience, when statistics are widely mis-cited, or misleading, the reason we can tell why they’re wrong is highly bespoke. Understanding how it’s possible that, say, someone says “the labor market is improving even though the unemployment rate is rising” (because lots of people are re-joining the labor force) isn’t so much a statistical literacy question as it is a labor economics question. Lots of poor statistics we read about aren’t even mistakes on the part of the reader either, they’re just pure fabrications (the More or Less podcast has a great cavalcade of these every week!).
I think most other cases where we see bad data literacy are similar, where it’s not so much actually a problem of bad data literacy as it is an issue of the underlying topic or measurement being opaque or difficult to understand. This is the kind of thing that’s really hard to teach in a general way, and impossible to cover in any reasonable breadth in a specific way (you’d have to teach about every widely used statistic and all of its caveats!).
The best argument I could make for this would be to have this really be a topic about “just be skeptical about all that data you see, and apply all the other stuff we know, like selection, means and medians, and so on,” but again, that’s just applying the other stuff in the class that’s there already. And I don’t think we’ve quite figured out how to teach skepticism about this stuff without it just turning into “now I have the tools to find an excuse to disbelieve everything, especially the stuff I don’t like anyway, and give the stuff I do like a pass.” Media literacy classes offer us a warning here.
Identification
Since I’m a causal inference guy, I want to be clear: what I mean by identification in this section is not causal identification (although that would be a subset). I don’t think we need to teach high school students matching estimators or RDD or whatever. Nor is it statistical identification (too much math!). By “identification” I mean the much broader concept of: “we did a calculation in order to answer a question. Does this calculation actually answer that question?”
As an example of what I mean, and how this concept applies in a non-causal way:2 let’s say that you’re trying to decide whether to open up a Taco Bell franchise location. You look at the data and notice that Taco Bell’s total revenue has gone up for the past few years.3 Seems like Taco Bell is doing well. Does this tell you anything about whether you should open a Taco Bell? No! You’re not going to own all of Taco Bell, you’re going to own a single store. You should instead look at revenue per location. Perhaps revenues are going up because they opened a bunch of new locations, and each individual store is actually doing worse!
The calculation that identifies how much a Taco Bell location is likely to sell is sales per-store, not total Taco Bell sales.
My verdict: Yes. I think that this kind of identification is both teachable and highly useful in the real world, even in cases where you don’t have your own hands on data and are just seeing someone talk about data, or just seeing someone make a claim based on their own observations and no data at all.
How is identification teachable? I think most errors of identification tend to be obvious after they’re pointed out, so giving students practice carefully considering whether a given measure actually relates to a question, a set of questions to ask themselves when evaluating a data-backed claim, and even a checklist of common ways that identification goes wrong would actually go a long way and be of everyday use.
That checklist include: construct validity (does the data we have actually represent the concept we’re interested in, for example in a study on “happiness levels”), selection bias (as economists use the term; basically, the presence of confounders that make a correlation not represent a causal effect of interest, for example whenever we hear about something rich people do being correlated with good health outcomes), appropriately scaled comparisons (figuring out whether a 50% increase is big or small depends on the base rate and what variable we’re talking about, and a claim of “outcome Y just fell, this must be the fault of policy X I don’t like” should probably check whether Y also fell in areas that don’t have policy X) and appropriate transformation (is the data calculated in the right format or observation level to answer the question we have, such as in the Taco Bell example above, or in choosing an absolute vs. relative increase in a rate). Some of these terms would probably be renamed into less-wonky terminology.
Statistical Estimators
By “statistical estimators” I mean things like correlation coefficients, and in particular linear regression, which is actually present in many high school statistics curricula (although of course they don’t go into it as deeply as a college course would).
My verdict: No. I think there’s relatively little value in presenting high school students with this stuff. Linear regression is a powerful tool… for performing research…. if you know what you’re doing with it. I barely trust the output of a linear regression performed by an undergrad (heck, I distrust the output of a linear regression performed by many a tenured professor). There’s simply no way we can give enough context to a high school student to get them to properly use linear regression in the real world. And if we could, what would they be using it for? Need for a linear regression, or even interpreting a linear regression, hardly comes up in the world. I don’t think learning about concepts like minimizing least squares tells you much either. Even when you see a best-fit trendline in, say, a newspaper, knowing about linear regression does little to help you understand it or be able to critique it.
How about other calculations, like correlation coefficients, or ANOVAs? I think the same problem largely applies. I have a very hard time thinking of a general-purpose application for these tools that isn’t more likely to be wrong than right without a lot more statistics education that most students in a high school class will never get.
Will stuff like linear regression give a leg up to the students going on to take statistics in college, and in the end produce better statistics-using professionals at the end of their higher-education journey? Sure, maybe. But that’s offset by a likely negative effect on the other students in the class (the majority). So, I disagree with the wider high-school statistics curriculum here; I’d take linear regression out.
Hypothesis Testing
Hypothesis testing is another common feature of high school statistics classes, with coverage of p-values and t-statistics. This makes sense, as it’s a common feature of applied statistics, and you certainly wouldn’t be able to, say, read most statistics-based research papers without an understanding of hypothesis testing.
My verdict: No. I like the idea of students learning about noise and sampling distributions, I really do. But hypothesis testing is an awful way to do it, if only because hypothesis testing is deeply misunderstood just about everywhere. The average person gets it wrong, sure, but also it’s pretty common for actual researchers or even statistical textbook authors to get hypothesis testing wrong, both on the finer details and also on the broader implications on what we should infer from getting a significant or insignificant result.
I’ve tried as hard as I possibly can to teach hypothesis testing accurately to college seniors in my econometrics class. But the wrong interpretations of hypothesis testing (hypothesis testing reveals the truth, significance/insignificance = correct/incorrect, insignificant effects are zero, and so on and so on) are so darn tempting and intuitive that students come back to them again and again no matter how many points I dock or times I remind them. For high schoolers (and, let’s be honest, for many of the math teachers roped in to teach the statistics module), I think teaching hypothesis testing is likely to make students worse at doing statistical inference and understanding data in the world, not better.
Now, that said, I do think it’s valuable for students to think about noise and precision of a result. “Group A is 2 units higher than Group B” means a lot more when both groups have 1000 people in them than when they each have 1. But we can teach about variation without having to introduce the error-inducing framework of hypothesis testing. We don’t need high school students to be able to make yes-or-no pronouncements of this stuff. Why would we?
Although…
I admit one exception to my negative view on statistical estimators and hypothesis tests. I think there is one particular estimator, and an accompanying test, that is actually of real-world use to high school students, and is doable at a high school level, and can be properly taught in the constraints of a high school statistics class.
That’s the two-sample test of means. Real-world applications abound, it’s not that hard to do close-enough-to-right, and it’s a pretty-useful application of hypothesis testing and standard deviation calculations.
The problem is, that’s really the only case I’d want any of that stuff included. Is it possible to teach a two-sample test of means properly without having to build up to it by covering hypothesis testing and standard deviation calculations in grand detail? I’m not sure. So I’m of two ways about this one. I’d like students to learn it, if only there were a way to teach it well without the other stuff I generally think shouldn’t be in the class! Maybe someone else can figure it out (or already has). Maybe teach it as a comparison of two overlapping normal distributions? That’s a pretty visual application!
In The End, It’s Still a Class
If it feels like I’ve cut too much out, let’s take a look at what’s left in my recommended list: an intuitive understanding of probability values, properties of distributions, basic summary statistics, maybe a two-sample test of means, and identification. We’ll probably learn a few things like “how to read a scatterplot and a line graph” along the way. Take a 15-week semester. Giving each of those subjects an average of three weeks, at high school pace, is definitely feasible (with distributions and summary statistics getting above-average time), even for students with zero prior statistics exposure, but it isn’t exactly roomy. Especially if that’s really a 10-week quarter and two weeks apiece, or a four-week module inside a bigger mathematics class. If the idea is a class that provides ideas that stick in people’s brains and they can actually use later, you really can’t crowd the syllabus. You really want regression in there? Or combinatorics? Well, what are you cutting?
I suspect this is one culprit behind why every discussion of, say, median wealth on Twitter devolves into commenters complaining that the average is sensitive to outliers. They’re half-remembering a lesson from high school! This is what I said about what-is-taught not exactly matching what-is-learned. Such is life.
You could perhaps reframe this example as a causality problem but IMO it would be a bit of a stretch.
I’m making this up, I have no idea. Is Taco Bell even on a franchise model?