Naked Statistics

Charles Wheelan, 2013

By Silvana Acosta in Books Statistics

January 8, 2022

“Charles Wheelan strips away the superflous outergarments and exposures the underlying beauty of the subject”. Hal Varian, Chief Economist at Google.

“He makes statistics interesting and fun. His book strips the subject of its complexity to expose the sexy stuff underneath”. The Economist.

You can go to Highlights at the end for a few two-liner bulletpoints on the main messages of the book.

Framing

Lies, Damned Lies, and Statistics. “Even in the best of circumstances, statistical analysis rarely unveils \(\color{purple}{\textbf{'the truth'}}\). We are usually building a circumstantial case based on imperfect data…At the most basic level, we may disagree on the question that is being answered. Sports enthusiasts will be arguing for all eternity over ‘the best baseball player ever’ because there is no objective definition of ‘best’”.

The first three chapters are about what is statistics and how is it used, with many real life examples from sports, business, and communication. Simple, neat, and conveying the point. He goes from explaining the difference between the average income in America and the income of the average American, and how growth in the tail or top 1% can alter the former while remaining the latter unchanged, to explain how percentages can exaggerate by making growth look explosive due to a very low starting point, and more.

Management

Wheelan remarks that we use statistics aiming to present a meaningful picture of what we care about and that we usually hope to act on those numbers. The thing is: “You can’t manage what you can’t measure. True. But you had better be darn sure that what you are \(\color{purple}{\textbf{measuring}}\) is really what you are trying to manage”. And he proceeds to give as an example school quality and test scores and how these are not a great measure of that, one reason being that students getting in each school have different backgrounds.

On top of this, ideally, “statistics measure the outcomes that matter; \(\color{purple}{\textbf{incentives}}\) give us a reason to improve those outcomes. Or, in some cases, just to make the statistics look better. That’s the bad news”. For instance, when you evaluate school administrators based on the graduation rate: they can focus their efforts just on boosting the number of students who graduate. This is dangerous.

Another example he gives is that some school rankings use as statistic the financial resources per student. That says nothing about how well spent it is. Also, that then institutions have an incentive to encourage a large number of students to apply. This also makes them look more selective. Thing is that this is a waste of resources, as students with no chance of getting accepted are applying and also schools have to process their applications. A statistic can give perverse incentives. What is the thing with rankings too? That “people love \(\color{purple}{\textbf{easy answers}}\). What is the best place? Number 1.”

Correlations

I liked his example on why correlation does not imply causation. There might be a positive correlation between a student’s test scores and the number of televisions that his family owns. Which does not mean that watching lots of television is good for academic achievement, nor that parents can boost their children’s test scores by buying more televisions. It is just probably that highly educated parents - \(\color{purple}{\textbf{confounding}}\) variable - can afford more televisions and tend to have children who test better.

Still, calculating correlations is useful in many scenarios. His example is Netflix \(\color{purple}{\textbf{recommendations}}\). “Netflix compares my ratings with those of other customers to identify those whose ratings are highly correlated with mine. These customers tend to like the films that I like. Once that is stablished, Netflix can recommend films that like-minded customers have rated highly but that I have not yet seen. That is the ‘big picture’. The actual methodology is much more complex…You tend to like what I like, and to dislike what I dislike, so what did you think of the new George Clooney film?. That is the essence of correlation”.

Probability

How to make it play in your favor? I enjoyed his example of the \(\color{purple}{\textbf{marketing}}\) campaing of Schlitz beers. If the typical drinker cannot tell a Schlitz from a Budweiser, a blind taste is a coin flip. On average, half will pick Schlitz and the other half the Bud. This won’t make it an interesting campaing. But conducting the taste test exclusively among people that were chosing a Bud (what Schlitz did) is great: “Half of all Bud drinkers like Schlitz better!”. They got something out of what is actually coin flip decision.

Related to it, he also covered why you should not play the \(\color{purple}{\textbf{lottery}}\), explaining the idea of randomness and the meaning of probability. “This is one of the crucial lessons of probability. Good decisions - as measured by the underlying probabilities - can turn out badly. And bad decisions - like spending 1$ on the Illinois lottery - can still turn out well, at least in the short run. But probability triumphs in the end.”

\(\color{purple}{\textbf{Insurance}}\) is like a lottery. Companies make money if they price correctly their premiums (their expected cost of covering your potential disaster). “You should recognize that insurance will not save you money in the long run. What it will do is prevent some unacceptably high loss…Buying insurance is a ‘bad bet’ from a statistical standpoint since you will pay to the insurance company, on average, more than what you get back…Ironically, someone as rich as Buffet can save money by not purchasing a car insurance”.

Not that much alike are the underlying risks in \(\color{purple}{\textbf{financial}}\) markets. They are not as predictable as a coin flip or beers blind taste, as Wheeland points out. He has a fun and interesting take on VaR models. “The false precision embedded in the models created a false sense of security. The VaR was like a faulty speedometer, which is arguably worse than no speedometer at all. If you place too much faith in the broken speedometer, you will be oblivious to other signs that your speed is unsafe. In contrast, if there is no speedometer at all, you have no choice but to look around for clues as to how fast you are really going.”. This can easily also be the case with many other statistical models running around different business.

Probability doesn’t make mistakes; people using probability make \(\color{purple}{\textbf{mistakes}}\). Wheelan keeps going with his take on VaR and CVaR, Greenspan, and where quants in Wall Street got it wrong. To summarize at the end what are “some of the most common probability-related errors, misunderstandings and ethical dilemmas”. There were three that I particularly enjoyed and summarize below.

Mistakes

One is a flipping \(\color{purple}{\textbf{coin}}\) excersice with his students. He asks them to stand up and flip a coin, and anyone that gets heads has to sit. Lots of times there is a student still standing after 6 tosses and he asks stuff like “What are the best training excercises for flipping so many tails in a row? Is there a special diet that helped you pull off this impressive accomplishment?”. They laugh. “When we see an anomalous event like that out of context, however, we assume that something besides randomness must be responsible”.

Another one is the \(\color{purple}{\textbf{prosecutors fallacy}}\), teaching us to not ignore the context of the statistical evidence. Imagine the police finding DNA sample at the murder scene matching a sample taken from the defendant. But he had no relationship with the victim and was nowhere near the scene, but was convicted of a crime years ago when his sample was taken and included in the database. He hopes you won’t convict. “The chances of finding a coincidental one in a million DNA match are relatively high if you run the sample through a database with samples from a million people”.

Last one is about statistical discrimination and \(\color{purple}{\textbf{profiling}}\). “When is it okay to act on the basis of what probability tells us is likely to happen, and when is it not okay?”. Take preventing crimes. “How should we react when our probability-based models tell us that methamphetamine smugglers from Mexico are most likely to be Hispanic men aged 18-30 and driving red pick up trucks between 9:00 p.m. and midnight when we also know that the vast majority of Hispanic men who fit that profile are not smuggling?”.

Biases

Garbage in, Garbage out. He covers publication bias, recall bias, survivorship bias, healthy user bias, and of course, selection bias, with funny historical real-life examples. I particularly enjoyed:

\(\color{purple}{\textbf{Selection Bias}}\). An influential weekly news magazine ran a poll to figure out if Landon, a Republican, or Roosvelt, a Democrat, was going to win the elections. They included 10 million prospective voters, a huge sample, which decreases the margin of error, but “as polls with bad samples get larger, the pile of garbage just bigger and smellier”. The magazine suscribers were wealthier than the average American, were households with telephones, more likely to vote Republican, and they predicted Landon will win.

\(\color{purple}{\textbf{Publication Bias}}\). A core aspect of statistics is that unusual things happen every once in a while. “If you conduct 100 studies, one of them is likely to turn up results that are pure nonesense - like a statistical association between playing video games and a lower incidence of colon cancer. Here is the problem: The 99 studies that found no link will not get published, because they are not very interesting.”

\(\color{purple}{\textbf{Recall Bias}}\). He goes about a study in which they found that “the diagnosis of breast cancer had not just changed a woman’s present and future; it had altered her past. Women with breast cancer had (unconsciously) decided that a higher-fat diet was a likely predisposition for their disease and (unconsciously) recalled a high-fat diet.” A reason to prefer longitudinal over cross-sectiional data.

\(\color{purple}{\textbf{Survivorship Bias}}\). A high-school principal reports that the test scores of a particular cohort of students has risen steadily for four years. It could be that a lot of students are learning nothing and drop off, and these are the ones that would score less. Nobody is actually getting smarter. Something to think about when you see companies with good employee survey results but also with a huge turnover.

\(\color{purple}{\textbf{Healthy User Bias}}\). Health officials promulgating a theory saying that using purple pijamas as a kid stimulates brain development! Years later, a longitudinal research confirms its large positive association with success in life, such as 98% of Harvard students having done this as kids. “Of course, the purple pijamas do not matter; but having the kind of parents who put their children in purple pijamas does matter. Even when we try to control for factors like parental education, we are still going to be left with unobservable difference between those parents who obsess about purple pijamas and those who don’t.”

Inference

He covers the Central Limit Theorem (CLT), hypothesis testing, and devotes a full chapter to a real-life example with polling. His CLT chapter, though based on a fictional example, was entertaining and clear. He’s tries to figure out if a crash bus was going to a sausage festival or to a marathon, based on the sample of heights and weights of the passengers and on what we know about these for the whole population of the two types of people. The passage where he drops as a James Bond on the bus with the tools needed to weight and measure, as some sort of statistical hero, reporting back to team, was funny.

\(\color{purple}{\textbf{Central Limit Theorem}}\). He gives a short and good summary at the end of the chapter. Pretty much quoting, for the CLT to apply, the sample sizes need to be relatively large. We also need a relatively large sample if we are going to assume that the standard deviation of the sample is roughly the same as the standard deviation of the population from which is drawn. The ‘big picture’ here is powerful:

(1) If you draw large, random samples from any population, the means of those samples will be distributed normally around the population mean (regardless of what the distribution of the underlying population looks like). Most sample means will lie reasonably close to the population mean; the standard error (the standard deviation of the sample mean) is what defines ‘reasonably close’.

(2) So, the CLT tells us the probability that a sample mean will lie within a certain distance of the population mean. It says it is relatively unlikely that a sample mean will lie more than two standard errors from the population mean (5%), and extremely unlikely that it will lie three or more standard errors from the population mean (0.3%).

Finally, then note that the less likely it is that an outcome has been observed by chance, the more confident we can be in saying “some other factor is in play”.

\(\color{purple}{\textbf{Hypothesis Testing}}\). I liked that even though he covers some basic statistics and their distribution around the end of the chapter, he focuses mainly in explaining what is hypothesis testing, what we can conclude and what not, and the meaning of significance, p-values, power, error types, etc, all in a nutshell, just building up in his previous fun and simple examples, like the crash bus and alike.

\(\textbf{Idea}\). “It may seem counterintuitive, but researchers often create a null hypothesis in hopes of being able to reject it…Statistics hanrnesses the same basic idea [courtroom], but ‘guilty beyond a reasonable doubt’ is defined quantitatively instead. Researchers typically ask, if the null hypothesis is true, how likely is it that we would observe this pattern of data by chance?”.

\(\textbf{Example}\). The Atlanta standardized cheating scandal. “The official who analyzed the data described the probability of the Atlanta pattern [students happened to erase massive number of wrong answers and replace them with correct answers] occurring without cheating as roughly equal to the chance of having 70,000 people show up for a football game at the Georgia Dome who all happen to be over seven feet tall. Could it happen? Yes. Is it likely? Not much”.

\(\textbf{Conclusion}\). “Officials still could not convict anybody of wrongdoing, just as my professor could not (and should not) have had me thrown out because my final exam grade in statistics was out of sync with my midterm grade. Officials could not prove that cheating was going on. They could, however, reject the null hypothesis that the results were legitimate. And they could do so with a ‘high degree of confidence’”.

\(\color{purple}{\textbf{Polling}}\). He has a full chapter on it to illustrate inference. I liked his analogy to explain why representative samples do the trick and work well to know about the whole population: “If you taste a spoonful of soup, stir the pot, and then taste again, the two spoonfuls are going to taste similar”. And by similar, we mean that there is a standard error, or variation from sample to sample (poll to poll).

\(\textbf{Confidence}\). How can you get more confident about a poll result? I liked how he explained it: “Suppose you tell a friend that you are ‘pretty sure’ that Jefferson was the 3rd or 4th president. How can you become more confident of your historical knowledge? By being less specific. You are ‘absolutely positive’ that Jefferson was one of the first 5 presidents”. You know from the CLT that roughly 68% of sample proportions will lie within one standard error, but 95% lie within two.

\(\textbf{Sample}\). He wraps his polling chapter mentioning that “According to Frank Newport, editor in chief of the Gallup Organization, a poll of 1000 people can offer meaningful and accurate insights into the attitudes of the entire country. Statistically speaking, he’s right. But to get to those meaningful and accurate results, we have to conduct a proper poll and then interpret those results correctly, both of which are much easier said than done. Bad polling results do not typically stem from bad math when calculating the standard errors. Bad polling results typically stem from a biased sample, or bad questions, or both. The mantra ‘garbage in, garbage out’ applies in spades when it comes to sampling public opinion.”

He invites us to think at the very least of the following issues when polling: 1) Is this an accurate sample of the population whose opinions we are trying to measure?; 2) Have the questions been posed in a way that elicits accurate information on the topic of interest?; 3) Are respondents telling the truth?.

Regression

He covers many examples of confounding variables, reverse causality, how rejecting the null of a beta insignificant does not imply that we confirmed an effect, etc. And of course, also what does it mean to minimize the sum of squared residuals, with a graphical interpretation, a bit of math, and all. But what caught my interest was his take on linear regression around the beginning of the chapter, that I really liked.

He focuses on when \(\color{purple}{\textbf{done properly}}\). “Given adequate data and access to a personal computer, a six-year old could use a basic statistics program to generate regression results. Personal computing has made the mechanics of regression analysis almost effortless. The problem is that the mechanics of regression analysis are not the hard part; the hard part is determining which variables ought to be considered in the analysis and how that can best be done. Regression analysis is like one of those fancy power tools. It is relatively easy to use, but hard to use well - and potentially dangerous when used improperly.”

\(\textbf{Point}\). I liked his emphasis on betas being \(\color{purple}{\textbf{estimates}}\) and what does significance mean. “As with polling and other forms of inference, we can calculate a standard error for the regression coefficient. [It] is a standard measure of the likely dispersion we would observe in the coefficient if we were to conduct the regression analysis on repeated samples drawn from the same population.” Basically, the relationship between both variables or beta is expected to vary between samples but not widely, assuming these samples are large and properly drawn from the same population.

\(\textbf{Relevance}\). Another good take is that sometimes a coefficient can be statistically but not \(\color{purple}{\textbf{socially}}\) significant. Example: a finding that whiter teeth are associated with 86$ additional annual income, and says: 1) it is not a life-changing amount; 2) it’s probably less than what it would cost to whiten a person’s teeth every year (so not a recommended investment to an individual or policy maker); 3) probably having perfect teeth may be associated with other personality traits that explain the higher income: “earnings may be caused by the kind of people who care about their teeth, not the teeth themselves”.

\(\textbf{Partial}\). I also liked that he covered the interpretation of betas as partial coefficients and how controlling works. “We can use regression analysis to separate out the independent effect of each of the potential explanatory factors described above. For example, we can isolate the association between race and weight, holding constant other socioeconomic factors like educational background and poverty. Among people who are high school graduates and elegible for food stamps, what is the statistical association between weight and being black?”

And his way of thinking about \(\color{purple}{\textbf{controlling}}\) was clear and simple. “To get your mind around how we can isolate the effect on weight of a single variable, say education, imagine the following situation. Assume that all participants [of a study] are convened in one place…Now assume that the men and women are separated. And then assume that men and women are separated. And then assume that both the men and the women are further divided by height. There will be a room of six-foot tall men. Next door, there will be a room of 6-foot 1-inch men, and so on for both genders.

If we have enough participants in our study, we can further subdivide each of those rooms by income. Eventually, we will have lots of rooms, each of which contains individuals who are identical in all respects except for education and weight, which are the two variables we care about…There will still be some variation in weight in each room; people who are the same sex and height and have the same income will still weigh different amounts -though presumably there will be much less variation in weight in each room than there is for the overall sample.

Our goal now is to see how much of the remaining variation in weight in each room can be explained by education. In other words, what is the best linear relationship between education and weight in each room? The final challenge, however, is that we do not want different coefficients in each ‘room’. The whole point of this exercise is to calculate a single coefficient that best expresses the relationship between education and weight for the entire sample, while holding other factors constant.

What we would like to calculate is the single coefficient for education that we can use in every room to minimize the sum of the squared residuals for all the rooms combined. What coefficient for education minimizes the square of the unexplained weight for every individual across all the rooms? That becomes our regression coefficient because it is the best explanation of the linear relationship between education and weight for this sample when we hold sex, height and income constant.

As an aside, you can see why large data sets are so useful. They allow us to control for many factors while still having many observations in each ‘room’. Obviously a computer can do all of this in a split second without herding thousands of people into different rooms.”

Finally, he has a whole chapter dedicated to common \(\color{purple}{\textbf{regression mistakes}}\), where he covers: using regression to analyze a nonlinear relationship, mistaking correlation for causation, reverse causality, omitted variable bias, highly correlated explanatory variables (multicollinearity), extrapolating beyond data, and data mining (too many variables).

And the last chapter of the book is a brief but good summary of the main approaches in \(\color{purple}{\textbf{program evaluation}}\). He covers when, why, and what are examples of natural experiments, cases of nonequivalent controls, that is, when you have treatment and control groups but non-randomized and how to deal with biases, and difference in differences also giving some examples of research and studies. Puts in simple words all the issues that can arise, without using potential outcome notation or any math.

Highlights

Statistics can be plainly lying, or measuring not what you want, or doing it wrongly, or creating wrong or bad incentives for people.
Correlations are useful and not due to causality. Knowing something or someone moves along with something or someone is helpful for recs.
You can make a coin flip decision to look as it has a favorable outcome. Just focus on the segment of the population that favors your point.
Something that looks extraordinary, as 10 heads in a row, can still be the result of randomness. Absolutely no other explanation for it.
Don’t get cocky about an observed fact that you know has a small chance of happening, if you actually have a huge sample at hand.
Bad models are worse than no models. It’s like faith in a broken speedometer. On top: your trust makes you not even look around.
When you read studies, keep in mind that unusual things happen. That could be the case with the single study finding that you are reading!
Wearing purple pijamas as a kid (say a rec by doctors) doesn’t take you to a Harvard. But having the parents that follow recs, might!
If you are less ambitious in your claim, you can be more confident about what you’re saying. Be more vague and you can be more certain.
Bad polling results don’t stem from bad math but from biased sample, bad questions or both. One case of “garbage in, garbage out”.
A kid can get regression results. Mechanics aren’t hard. Knowing which variables ought to be considered and how that can best be done, is!
Controlling and partial coefficients can be thought as splitting people into rooms and minimize residuals or unexplained across all rooms.