Using Passive Prediction Questions#

In the past few chapters, we explored the role of Exploratory Questions in helping data scientists better understand the problems they seek to solve and to prioritize subsequent efforts. While useful, however, answering Exploratory Questions is rarely sufficient to solve a stakeholder’s problem in and of itself. To really solve problems, often data scientists must turn to answering Passive Prediction Questions — the focus of this chapter — and/or Causal Questions (a topic we will return to in future chapters).

As discussed in the introduction of this book, Passive Prediction Questions are questions about the future or potential outcomes of individual entities (customers, patients, stores, etc.). “How likely is Patient X to experience a heart attack in the next two years?,” for example, or “How likely is it Mortgage Holder Y will fail to make their mortgage payment next month?”

Unlike Exploratory Questions, data scientists don’t generally come up with “an answer” to Passive Prediction Questions; rather, data scientists answer Passive Prediction Questions by developing statistical models that take the attributes of an entity as inputs and spit out a unique answer for each entity. A company interested in spam detection, for example, might hire a data scientist to develop a model that takes the content of an email as input and, for every email a person receives, answers the question “If the recipient of this email looked at it, would they consider it spam?” The exact statistical machinery one uses will vary across applications, but answering these questions is the realm where terminology like “supervised machine learning” is most likely to be used.

Since Passive Prediction Questions don’t usually have “an answer,” a data scientist faced with a Passive Prediction challenge will often start by considering the feasibility of developing a model to give individual-level answers to a Passive Prediction Question. “Given data on new customer behavior on my website,” for example, “can I predict how much a customer is likely to spend over the next year?” At the end of the day, however, what a stakeholder cares about is not whether one can predict future spending, they care about the actual answer to the question “Given this new customer’s behavior on my website, are they likely to spend a lot over the next year?” for a given customer. For that reason — and because understand exactly what question a model is answering is key to its effective use — this individual-level question will be our focus here.

Flavors of Passive Prediction Questions#

Passive Prediction Questions come in two flavors, corresponding to the two primary business purposes detailed above.

The first flavor of Passive Prediction Questions is questions about what is likely to occur in the future for a specific individual. Answering these questions is useful for identifying individuals for additional care or attention. For example, a hospital might want to know “How likely is Patient A to experience complications after surgery?” so they can decide whether the patient should receive extra nursing attention during recovery, or a factory owner might ask “How likely is this machine to break down in the next month?” to help them determine when to take the machine offline for maintenance. This is the more intuitive flavor of Passive Prediction Question, as it accords nicely with the normal meaning of the term “predict.”

The second favor of Passive Prediction Questions is questions about what would happen in different circumstances. Answering this type of question is the key to automation, as the “different circumstance” in question is often one in which a job is being done by an actual person. For example, a statistical model that can answer the question “if a radiologist were to look at this mammogram, would they conclude it showed evidence of cancer?” effectively is a model that automates the review of mammograms. A model that can answer the question “if you showed this email to its intended recipient, would they say it is spam?” is an algorithm that automates spam detection.

OK, but… Doesn’t This Feel a Little Contrived?#

You would be forgiven for asking whether I’ve gone a little too far in trying to force the simple task of automation into the “all data science tools are tools for answering questions” framework of the book. Yes, I see how you could call an algorithm that automates the review of mammograms a tool for answering the question “if a radiologist were to look at this mammogram, would they conclude it showed evidence of cancer?” But that seems awfully convoluted. Why can’t we just call it an algorithm that looks for cancer in mammograms?

The answer is that while it is certainly less succinct, but as we discussed in the introduction, because of how these models are developed, it is more accurate to say that our mammogram reading algorithm is trying to answer the question “if a radiologist were to look at this mammogram, would they conclude it showed evidence of cancer?” than to say that it’s “looking for cancer.” The way that most statistical models are developed (“trained”) to answer Passive Prediction Questions is using a large dataset of “training examples” — that is, data in which the outcome we care about is included in the dataset. For predicting individual-level future outcomes, this will be historical data in which outcomes — like surgery complications or factory machine failures — have already occurred. But for automation, this will be data in which a person has already completed the task, and their conclusions or actions have been recorded. To train a mammogram reading algorithm, in other words, you first need a database of mammograms that radiologists have already labelled as indicating cancer or not.

And this is where the distinction between “predicting what a radiologist would say” and “detecting cancer” becomes important: because this kind of statistical model was trained to emulate the behavior of radiologists in labelled mammograms, any systematic biases held by the human radiologists who reviewed the mammograms in the training data will be recapitulated in the model. Did the radiologists struggle to detect cancer in dense breast tissue? So too will the algorithm trained on their labels.

The tendency for algorithms to replicate human biases present in training data, unfortunately, extends to gender and racial biases. In 2018, for example, Reuters reported that Amazon was forced to scrap an effort to use machine learning to automatically review resumes because it turned out the algorithm — trained on historic hiring data — was biased against women. The exact details of the source of the bias is unclear — for obvious reasons Amazon is not eager to report on the failure — but my suspicion is that things went wrong in one of two ways.

The first is that the algorithm was given data on all past Amazon applicant resumes, along with data on which applicants had actually been hired. The algorithm was then effectively trained to answer the question “If an Amazon hiring manager looked at this resume, is it likely they would be hired?” In that case, the algorithm was recapitulating the gender bias of previous hiring managers.

My second guess is that the algorithm was given the resumes of current employees along with employee reviews. In that case, the algorithm was effectively being asked to answer the question “Given this resume, is it likely this is a person a current manager would rate highly once employed?” In that case, the algorithm was recapitulating gender biases in employee reviews.

In either case, these are examples of an alignment problem: the people developing these models wanted the algorithm to pick the applicants who would be the most productive employees, but the model they actually developed was trying to identify employees who looked like previously successful applicants. Had the previous hiring system been effective, this wouldn’t be a problem, but because the previous system included a gender bias, so too did the resulting algorithm. But because the engineers developing these tools did not think carefully enough about the question the model was actually being taught to answer, the problem was not identified until it was too late.

Differentiating Between Exploratory and Passive Prediction Questions#

If you have not felt a little confused about the distinction between Exploratory and Passive-Prediction Questions previously, there’s a good chance you find yourself struggling with that issue here, and for understandable reasons.

The first thing to emphasize is that the distinction between Exploratory and Passive Prediction Questions is a distinction in one’s goal, not the statistical machinery one might use to achieve that goal.

With Passive-Prediction Questions, our interest is in the values that get spit out of a model for each entity in the data. When answering a Passive-Prediction Question, the only thing we care about is the quality of those predictions, and so we evaluate the success of a model that aims to answer a Passive-Prediction Question by the quality of those predictions (using metrics like AIC, AUC, R-Squared, Accuracy, Precision, Recall, etc.). Thus, when using a logistic regression to answer a Passive-Prediction Question, we don’t actually care about what factors are being used to make our predictions, just that they improve the predictions. Our interest is only the quality of our predicted values, and a good model is one that explains a substantial portion of the variation in our outcome.

With Exploratory Questions, our interest is in improving our understanding of the problem space, not in making precise predictions for each entity in our data. Thus, in the example of logistic regression, our interest is in the factors on the “right-hand side” of our logistic regression and how they help us understand what shapes outcomes, not the exact accuracy of our predictions. A good model, in other words, doesn’t actually have to explain a large share of variation at the level of individual entities, but it does have to help us understand our problem space.

A model that looked at the relationship between individuals’ salaries and their age, education, and where they live might tell us a lot about the importance of a college degree to earnings (which we could see by the large and statistically significant coefficient on having a college degree), even if it only explains a small amount of overall variation in salaries (e.g., the R-Squared might only be 0.2).

This distinction also has important implications when working with more opaque supervised machine learning techniques, like deep learning, random forests, or SVMs. These techniques are often referred to as “black boxes” because exactly how different impute factors relate to the predictions that the model makes is impossible to understand (in other words, it’s like the input data is going into a dark box we can’t see into, and then predictions are magically popping out the other side). These models can be very useful for answering Passive-Prediction Questions, as they can accommodate very unusual, non-linear relationships between input factors and predicted values, but because these relationships are opaque to us, the data scientist, they don’t really help us understand the problem space.

The “Passive” in Passive Prediction#

The term “passive” in “Passive Prediction Questions” is meant to emphasize the distinction between Passive Prediction Questions and Causal Questions. Both Passive Prediction Questions and Causal Questions can be thought of as trying to “predict” some future outcome, but they differ in the contexts in which their predictions are valid. A full accounting of the distinction between Passive Prediction Questions and Causal Questions will have to wait until we have covered Causal Questions in detail, for the moment, we can get a sense of things by introducing a very casual definition of what it means for some cause X to effect some outcome Y.

In causal parlance, when we say that some factor X causes outcome Y (and that X is not merely correlated with Y), what we usually mean is that if we were to go out and actively change X, Y would change as a result. This isn’t a fully rigorous definition, but it drives home that causation is about what happens when we actively manipulate X.

To see how this distinction illustrated, let’s return to the example of a hospital interested in predicting which patients are likely to experience complications after surgery. Using past patient data, you are able to develop a model that very accurately answers the question “Given their pre-surgery vitals, how likely is a patient to experience complications after surgery?” Hooray! The hospital uses this model to determine which patients should get extra nursing visits and extra attention during recovery. You’ve done a great job answering a Passive Prediction Question by discovering a pattern in the world — a set of correlations between measurable variables — you can take advantage of.

Now in the course of developing this model, suppose you discover that one of the strongest predictors of complications after surgery is patient blood pressure — patients with high blood pressure are substantially more likely to experience complications than those with normal blood pressure. This leads you to wonder whether treating patients with high blood pressure with pressure-reducing medications prior to surgery might reduce complications. In other words, you now want to know the effect of going into the world and manipulating patient blood pressure — a Causal Question.

In the first case, you really don’t care if blood pressure is causing the surgical complications, by which we mean you don’t care if reducing blood pressure would reduce complications, or whether high blood pressure is just an easily observable symptom of an underlying condition that is the root cause of surgical complications (like leading a stressful life, or having relationship problems at home). In either case, the correlation is sufficient for your purposes of identifying patients you need to keep tabs on.

But if you want to know what would happen if you directly manipulated blood pressure, knowing that blood pressure and complications are correlated is not sufficient. After all, if living alone results in high blood pressure and difficulty recovering from surgery, then treating patient blood pressure may have no effect at all!

When answering Passive Prediction Questions, we are searching for correlations we can leverage to make accurate predictions, not causal relationships we can directly manipulate to shape outcomes. Indeed, those who specialize in answering Passive Prediction Questions (like computer scientists who specialized in supervised machine learning) don’t really care that “correlation does not (necessarily) imply causation.”

xkcd correlation comic

How do Large Language Models (LLMs) Fit Into This?#

Given their emergence as one of the most high profile examples of something people call “AI” these days, it’s worth directly addressing how LLMs like chatGPT, Llama, Bard, etc. fit into this framework.

One of the most powerful ways to understand LLMs is to think of them as tools for answering “If I came across this text [whatever text is in the LLMs prompt, pre-prompts, or other inputs in the model’s context window] while wandering around the internet,[1] what word am I most likely to encounter next?” LLMs then ask this question over and over, adding the newly selected word to the context window one at a time and then feeding the updated “conversation” back into itself as input to help it predict the next word. Indeed, this repetitiveness is why the technology behind most LLMs is called a recurrent neural network — because it keeps adding a word to the “conversation” you are having (it’s “context window”), then feeding the updated conversation back into the model as an updated input.

To be clear, this is a little reductionist. First, LLMs are able to abstract away from literal text to something like the meaning of words — they recognize that “pet” and “dog” have similar meanings in the sentences “I took my dog for a walk” and “I took my pet for a walk” — but even these abstractions are the result of looking for common patterns across existing text. And second, LLMs are also “finetuned” by having humans rate responses. But these nuances don’t change the fact that fundamentally LLMs are tools for recapitulating text and ideas that already exist in the world, with a strong bias towards what humans have tended to write most on the internet — a truth that both explains some of their power, but also helps to explain their fundamental limitations (e.g., their complete lack of fidelity to the truth, the fact they reflect societies’ gender and racial biases, etc.).