Introduction
Today, arguably more than ever before, the world is governed by the science of probability and statistics. “Big data” is now the norm in scientific research, with terabytes of data streaming into research centers from satellites and experimental facilities, analyzed by supercomputers. “Data mining” is now an essential part of mathematical finance and business management. Numerous public opinion polls, expertly analyzed, guide the political arena. Covid-19 infection rates, immunization levels and r0 factors are a staple of nightly newscasts.
Yet the public at large remains mostly ignorant of the basic principles of probability and statistics. More generally, the public is largely unaware that arguments based on probability and statistics are fraught with numerous potential fallacies and errors, and that unless considerable care is taken, such arguments are almost certainly invalid. As a result, many in the public are increasingly vulnerable to impressive-sounding propaganda that by any standard is false and misleading.
Sadly, even technically trained people often make serious errors in this area. Some common mistakes include:
- Failing to rigorously define the underlying probability space.
- Failing to justify why individual events have equal probability or are independent.
- Failing to correctly reckon conditional probabilities.
- Computing probabilities post facto (after the fact) and claiming remarkable results.
- Selecting only the data and/or tests that confirm a hypothesis and ignoring the rest.
- Overfitting a model, i.e., testing numerous variations of a model on a limited-size dataset and only utilizing the best variation.
- Failing to employ an appropriate statistical measure.
- Failing to properly cite sample size error bounds.
We will present examples of several of these errors below.
One lesson of probability and statistics, when rigorously applied, is that seemingly improbable “coincidences” can and do happen, to a much greater extent than most people realize. For instance, a common classroom exercise is to inquire how likely it is that in a class say of 30 students, two or more of the students have the same birthday (assuming 365 equally likely birthdays in the year). Most students presume this is rather unlikely, but the correct probability is 1 – (1-1/365)x(1-2/365)x … x(1-29/365) = 0.706316…. In general, one or more duplicate birthdays are likely whenever the class has 23 or more students. For numerous other examples of how seemingly “improbable” events can happen, see [Hand2014, Mlodinow2009] and [Pinker2021].
Lies, damned lies and statistics
One remarkable example of the misuse of probability and statistics in the public arena was seen in the 2020 U.S. presidential campaign. As Steven Pinker explains in his 2021 book Rationality: What It Is, Why It Seems Scarce, Why It Matters:
Another howler in calculating [probability] conjunctions had a cameo in the bizarre attempt by Donald Trump and his supporters to overturn the results of the 2020 presidential election based on baseless claims of voter fraud. In a motion filed with the US Supreme Court, the Texas attorney general Ken Paxton wrote: “The probability of former Vice President Biden winning the popular vote in the four Defendant States — Georgia, Michigan, Pennsylvania, and Wisconsin — independently given President Trump’s early lead in those States as of 3 a.m. on November 4, 2020, is less than one in a quadrillion, or 1 in 1,000,000,000,000,000. For former Vice President Biden to win these four States collectively, the odds of that happening decrease to less than one in a quadrillion to the fourth power.”
Paxton’s jaw-dropping math assumed that the votes being tallied over the course of the counting were statistically independent, like repeated rolls of a die. But urbanites tend to vote differently from suburbanites, who in turn vote differently from country folk, and in-person voters differ from those who mail in their ballots (particularly in 2020, when Trump discouraged his supporters from voting by mail). Within each sector, the votes are not independent, and the base rates differ from sector to sector. Since the results from each precinct are announced as they become available, and the mail-in ballots are counted later still, then as the different tranches are added up, the running tally favoring each candidate can rise or fall, and the final result cannot be extrapolated from the interim ones. The flapdoodle was raised to the fourth power when Paxton multiplied the bogus probabilities from the four states, whose votes are not independent either: whatever sways voters in the Great Lake State [Michigan] is also likely to sway them in America’s Dairyland [Wisconsin].
Another egregious example was mentioned in Leonard Mlodinow’s book The Drunkard’s Walk: How Randomness Rules Our Lives, commenting on the widely publicized 1995 trial of former U.S. football star O.J. Simpson:
The prosecution made a decision to focus the opening of its case on O.J.’s propensity toward violence against Nicole. Prosecutors spent the first ten days of the trial entering evidence of his history of abusing her and claimed that this alone was a good reason to suspect him of her murder. As they put it, “a slap is a prelude to homicide.” The defense attorneys used this strategy as a launchpad for their accusations of duplicity, arguing that the prosecution had spent two weeks trying to mislead the jury and that the evidence that O.J. had battered Nicole on previous occasions meant nothing. Here is Dershowitz [Simpson’s lead defense attorney]’s reasoning: 4 million women are battered annually by husbands and boyfriends in the United States, yet in 1992, according to the FBI Uniform Crime Reports, a total of 1,432, or 1 in 2,500, were killed by their husbands or boyfriends. Therefore, the defense retorted, few men who slap or beat their domestic partners go on to murder them.
True? Yes. Convincing? Yes. Relevant? No. The relevant number is not the probability that a man who batters his wife will go on to kill her (1 in 2,500) but rather the probability that a battered wife who was murdered was murdered by her abuser. According to the Uniform Crime Reports for the United States and Its Possessions in 1993, the probability Dershowitz (or the prosecution) should have reported was this one: of all the battered women murdered in the United States in 1993, some 90% were killed by their abuser. That statistic was not mentioned at the trial.
Statistical errors in medicine
The field of medical research and practice is waking up to the fact that in many cases, reckonings of conditional probabilities based on medical statistics are either invalid or at least misleading in how they are typically presented to patients. Consider the following example, which is taken from a paper by David Colquhoun:
Imagine that we wish to screen persons for potential dementia. Let’s assume that 1% of the population has dementia, and that we have a test for dementia that is 95% accurate (i.e., it is accurate with p = 0.05), in the sense that 95% of persons without the condition will be correctly diagnosed, and assume also that the test is 80% accurate for those who do have the condition. Now if we screen 10,000 persons, 100 presumably will have the condition and 9900 will not. Of the 100 who have the condition, 80% or 80 will be detected and 20 will be missed. Of the 9900 who do not, 95% or 9405 will be cleared, but 5% or 495 will be incorrectly tested positive. So out of the original population of 10,000, 575 will test positive, but 495 of these 575, or 86%, are false positives.
Needless to say, a false positive rate of 86% is distressingly high, yet this is entirely typical of many applications of statistics in the medical literature. Either way, crucial life-and-death decisions are being made based on these reckonings, so they must be correctly analyzed and reported to patients. Leonard Mlodinow, in his book mentioned above [Mlodinow2009], gave a personal example of an erroneous reckoning of this type:
I went to my doctor on a hunch and took an HIV test. It came back positive. … I later learned that he had derived my 1-in-1000 chance of being healthy from the following statistic: the HIV test produced a positive result when the blood was not infected with the AIDS virus in only 1 in 1,000 blood samples. That might sound like the same message he passed on, but it wasn’t. My doctor had confused the chances that I would test positive if I was not HIV-positive with the chances that I would not be HIV-positive if I tested positive. …
Suppose we consider an initial population of 10,000. We can estimate, employing statistics from the Centers for Disease Control and Prevention, that in 1989 about 1 in those 10,000 heterosexual non-IV-drug-abusing white male Americans who got tested were invected with HIV. Assuming that the false-negative rate is near 0, that means that about 1 person out of 10,000 will test positive due to the presence of the infection. In addition, since the rate of false positives is, as my doctor had quoted, 1 in 1,000, there will be about 10 others who are not infected with HIV but will test positive anyway. The other 9,989 of the 10,000 men in the sample space will test negative.
Now let’s prune the sample space to include only those who tested positive. We end up with 10 people who are false positives and 1 true positive. In other words, only 1 in 11 people who test positive are really infected with HIV. My doctor told me that the probability that the test was wrong — and I was in fact healthy — was 1 in 1,000. He should have said “Don’t worry, the chances are better than 10 out of 11 that you are not infected.” In my case the screening test was apparently fooled by certain markers that were present in my blood even though the virus this test was screening for was not present.
Statistical errors in finance
The field of finance is also coming to grips with the fact that the field is rife with the misuse of probability and statistics. Indeed, such errors are now thought to be a leading reason why investment strategies and funds that look great on paper often fall flat when actually fielded.
A leading reason for such failures is backtest overfitting, namely the deplorable practice, conscious or not, of using historical market data to develop an investment model, fund or strategy, where too many variations are tried, relative to the amount of data available. Models, funds and strategies suffering from this type of statistical overfitting typically target the random patterns present in the limited in-sample test-set on which they are based, and thus often perform erratically when presented with new, truly out-of-sample data. The sobering consequence is that a significant portion of the models, funds and strategies employed in the investment world, including many of those marketed to individual investors, may be merely statistical mirages. See this 2014 article and this 2021 preprint for further details.
Other areas of finance that are rife with statistical errors include:
- Technical analysis. Although widespread in the field of finance, “technical analysis” is every bit as pseudoscientific as astrology. Does anyone really believe that low-tech analysis of “trends,” “waves,” “breakout patterns,” “triangle patterns,” “shoulders” and “Fibonacci ratios” (none of which withstand rigorous statistical scrutiny) can possibly compete with the mathematically and statistically sophisticated, big-data-crunching computer programs, operated by successful hedge funds and other large organizations, that troll financial markets for every conceivable trading opportunity? Think again. The bottom line is that technical analysis does not work in the market. See this Mathematical Investor article for additional details.
- Day trading. Another unpleasant truth is that day trading, namely the widespread practice of frequent buying and selling of securities by amateur investors through the trading day, does not work either. Study after study has shown that the large majority of day traders lose money, many with spectacular losses; only a tiny fraction regularly earn profits. For example, a 2017 U.C. Berkeley-Peking University study found that even the most experienced day-traders lose money, and nearly 75% of day-trading activity is by traders with a history of losses. See this Mathematical Investor article for additional details.
- Market forecasters. The statistical record of market forecasters is, in a word, dismal. According to Hickey’s analysis of market forecasts since 2000, for instance, the average gap between the median forecast and the actual S&P 500 index was 4.31 percentage points, or an error of 44%. In 2008 the median forecast was for a rise of 11.1% The actual performance? A fall of 38.5%, i.e., a whopping error of 49.6 percentage points. Similarly, Nir Kaissar lamented that the forecasts have been least useful when they mattered most. Jeff Sommer, a financial writer for the New York Times, recently summarized the dismal record of 2020 stock market forecasters as follows: “[A]s far as predicting the future goes, Wall Street’s record is remarkable for its ineptitude.” A recent study of 68 market forecasters by the present author and colleagues found accuracy results no better than chance. See this Mathematical Investor article for additional details and discussion.
The reproducibility crisis in science
As mentioned above, even technically trained persons can sometimes be fooled by invalid arguments based on probability and statistics, or can use statistically questionable methods in their own research. One manifestation of this is the growing crisis of reproducibility in various fields of science. Here are just a few of recent cases that have attracted widespread publicity:
- In 2012, Amgen researchers reported that they were able to reproduce fewer than 10 of 53 cancer studies.
- In 2013, in the wake of numerous recent instances of highly touted pharmaceutical products failing or disappointing when fielded, researchers in the field began promoting the All Trials movement, which would require participating firms and researchers to post the results of all trials, successful or not.
- In March 2014, physicists announced with great fanfare that they had detected evidence of primordial gravitational waves from the “inflation” epoch shortly after the big bang. However, other researchers subsequently questioned this conclusion, arguing that the twisting patterns in the data could be explained more easily by dust in the Milky Way.
- In 2015, in a study by the Reproducibility Project, only 39 of 100 psychology studies could be replicated, even after taking extensive steps such as consulting with the original authors.
- Also in 2015, a study by the U.S. Federal Reserve was able to reproduce only 29 of 67 economics studies.
- In an updated 2018 study by the Reproducibility Project, only 14 out of 28 classic and contemporary psychology experimental studies were successfully replicated.
- In 2018, the Reproducibility Project was able to replicate only five of ten key studies in cancer research, with three inconclusive and two negative; eight more studies are in the works but incomplete.
P-hacking
Many of these difficulties with reproducibility derive from the regrettably widespread practice by scientific researchers of p-hacking, namely the practice (conscious or not) of: (a) selecting experimental data that confirms a hypothesis to the desired level of significance, and ignoring other data that do not; or (b) testing numerous hypotheses until one is found that meets the desired level of significance, and ignoring others that do not. As an example, backtest overfitting in finance, mentioned above, can be thought of as the finance field’s version of p-hacking — analyzing many variations of a model, but only fielding the one that scores best on a historical dataset.
The p-test, which was introduced by the British statistician Sir Ronald Fisher in the 1920s, assesses whether the results of an experiment are more extreme that what would one have given the null hypothesis. However, Fisher never intended for the p-test to be a single figure of merit. Indeed, the p-test, used alone, has significant drawbacks. To begin with, the typically used level of p = 0.05 is not a particularly compelling result. In any event, it is highly questionable to reject a result if its p-value is 0.051, whereas to accept it as significant if its p-value is 0.049.
The prevalence of the classic p = 0.05 value has led many in the field to wonder which research studies, across a broad range of fields, have been subjected to the sort of post-hoc data manipulation mentioned above. Such suspicions are justified given the results of a study by Jelte Wilcherts of the University of Amsterdam, who found that researchers whose results were close to the p = 0.05 level of significance were less willing to share their original data than were others that had stronger significance levels (see also this summary from Psychology Today).
Along this line, it is clear that a sole focus on p-values can muddle scientific thinking, confusing significance with size of the effect. For example, a 2013 study of more than 19,000 married persons found that those who had met their spouses online are less likely to divorce (p < 0.002) and more likely to have higher marital satisfaction (p < 0.001) than those who met in other ways. Impressive? Yes, but the divorce rate for online couples was 5.96%, only slightly down from 7.67% for the larger population, and the marital satisfaction score for these couples was 5.64 out of 7, only slightly better than 5.48 for the larger population (see also this Nature article).
Physician, heal thyself
The level of statistical illiteracy in the modern world is truly disheartening. As a result of this widespread ignorance, millions of people are easily mislead by propaganda that from a technical point of view is demonstrably false and misleading.
What can be done? In the short term, sadly, not much. In the long term, it is clear that basic literacy in probability and statistics needs to be a primary goal of a good education. One bright spot is that many more high schools are now offering courses in statistics and data science. At least 30 high schools in California have offered data science classes for juniors and seniors, in some cases as an alternative to Algebra 2. Carole Sailer, a mathematics teacher at North Hollywood High School in California explains:
Data science taps into students’ natural reasoning abilities and helps them understand the world. … It doesn’t matter what they want to be — a nurse, a police officer — data science exposes students to state-of-the-art technology and helps them develop their powers of reasoning. It really does inspire kids.
But before those of us in the scientific world (broadly speaking) point too many fingers at the ignorance of the general public in probability and statistics, it is clear from the above examples that scientists have much cleaning up to do on their own in this arena. Fortunately, many research fields are making an effort to tighten up their standards, such as requiring independent expert review of the methods used for data collection and statistical analysis.
Although there is a dearth of published material on the basics of probability and statistics targeted to the general reader, two excellent recent treatments include Steven Pinker’s 2021 book Rationality: What It Is, Why It Seems Scarce, Why It Matters (especially Chapter 4) and Leonard Mlodinow’s 2009 book The Drunkard’s Walk: How Randomness Rules Our Lives, both quoted from above. Another useful reference is The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives, by Stephen Thomas Ziliak and Deirdre N. McCloskey, who outline numerous ways in medicine and economics, in particular, where statistical methods are misapplied.
Along this line, the American Statistical Association (ASA) has issued a Statement on statistical significance and p-values. The ASA did not recommend that p-values be banned outright, but it strongly encouraged that the p-test be used in conjunction with other methods and not solely relied on as a measure of statistical significance, and certainly not viewed as a probability value. The ASA statement concludes,
Good statistical practice, as an essential component of good scientific practice, emphasizes principles of good study design and conduct, a variety of numerical and graphical summaries of data, understanding of the phenomenon under study, interpretation of results in context, complete reporting and proper logical and quantitative understanding of what data summaries mean. No single index should substitute for scientific reasoning.
[This article also appeared on the Math Scholar blog.]