Using Passive Prediction Questions#

In the past few chapters, we explored the role of Exploratory Questions in helping data scientists better understand the problems they seek to solve and to prioritize subsequent efforts. While useful, however, answering Exploratory Questions alone is rarely sufficient to solve a stakeholder’s problem. To really solve problems, data scientists usually need to answer Passive Prediction Questions — the focus of this chapter — and/or Causal Questions (a topic we will return to in future chapters).

Passive Prediction Questions are questions about the future or otherwise unknown outcomes of individual entities (customers, patients, stores, etc.). “How likely is Patient X to experience a heart attack in the next two years?,” for example, or “How likely is it Mortgage Holder Y will fail to make their mortgage payment next month?”

Passive Prediction Questions are usually deployed for one of two business purposes:

  1. identifying individual entities of particular interest (high-risk patients, high-value clients, factory machinery in need of preventative maintenance, etc.), and

  2. automating classification or labeling tasks currently performed by people (reading mammograms, reviewing job applicant resumes, identifying internet posts that violate terms of use).

Unlike Exploratory Questions, data scientists don’t generally come up with “an answer” to Passive Prediction Questions; rather, data scientists answer Passive Prediction Questions by developing statistical models that take the attributes of an entity as inputs and spit out a unique answer for each entity. A company interested in spam detection, for example, might hire a data scientist to develop a model that takes the content of an email as input and, for every email a person receives, answers the question, “If the recipient of this email looked at it, would they consider it spam?” The exact statistical machinery one uses will vary across applications, but answering these questions is the realm where terminology like “supervised machine learning” or “statistical prediction” is most likely to be used.

Since Passive Prediction Questions don’t usually have an answer, a data scientist faced with a Passive Prediction challenge will often start by considering the feasibility of developing a model to give individual-level answers to a Passive Prediction Question. “Given data on new customer behavior on my website,” for example, “can I predict how much a customer is likely to spend over the next year?” Assuming feasibility, however, at the end of the day, what a stakeholder cares about is not whether one can predict future spending; they care about the actual predictions a data scientist can give for each entity — “Given customer 389237’s behavior on my website, are they likely to spend a lot over the next year?”

Types of Passive Prediction Questions#

Passive Prediction Questions come in two types, corresponding to the two primary business purposes detailed above.

The first type of Passive Prediction Question pertains to what will likely happen in the future for a specific individual. Answering these questions is useful for identifying individuals for additional care or attention. For example, a hospital might want to know, “How likely is Patient A to experience complications after surgery?” so they can decide whether the patient should receive extra nursing attention during recovery, or a factory owner might ask, “How likely is this machine to break down in the next month?” to help them determine when to take the machine offline for maintenance. This is the more intuitive kind of Passive Prediction Question, as it accords nicely with the colloquial meaning of the term “prediction.”

The second type of Passive Prediction Question pertains to what would happen in different circumstances. Answering this type of question is the key to automation, as the “different circumstance” in question is often one in which a job is being done by an actual person. For example, a statistical model that can answer the question “if a radiologist were to look at this mammogram, would they conclude it showed evidence of cancer?” is a model that automates the review of mammograms. A model that can answer the question “if the intended recipient of this email were to see it, would they say it is spam?” is an algorithm that automates spam detection.

Differentiating Between Exploratory and Passive Prediction Questions#

If you have not felt a little confused about the distinction between Exploratory and Passive Prediction Questions previously, there’s a good chance you find yourself struggling with that issue here, and for understandable reasons.

The first thing to emphasize is that the distinction between Exploratory and Passive Prediction Questions is a distinction in one’s goal, not the statistical machinery one might use to achieve that goal.

With Exploratory Questions, our interest is in improving our understanding of patterns in the world to help us understand our problem space, not in making precise predictions for each entity in our data. If we were to use a regression to answer an Exploratory Question, for example, the “answer” to our Exploratory Question would be found in the variables on the right-hand side of our regression, their coefficients, and their standard errors. That’s because those coefficients are what help us understand the factors that contribute to the outcomes we care about. A good model, in other words, doesn’t actually have to explain a large share of variation at the level of individual entities, but it does have to help us understand how different factors contribute to the outcome we wish to understand.

For example, suppose you’re interested in the impact of a college education on earnings. We might try to understand the role of a college education using a regression model that regresses individuals’ salaries on age, education, and where individuals live. If we saw that the coefficient on having a college degree had a large and statistically significant coefficient, that would tell us a lot of important information about overall importance of a college degree on earnings. And this would be true even if the model only explained a small amount of overall variation in salaries (e.g., the R-Squared might only be 0.2). The model, in other words, is able to tell us a lot about average differences in earnings for college graduates and non-college graduates, even though it is not particularly good at tell you the likely salary of any individual person in the data.

With Passive Prediction Questions, this logic is reversed. With Passive Prediction Questions, we don’t care about how well the model helps us understand patterns in the world; we only care about whether it can make good predictions of some outcome we care about for individual entities in the data. That’s why we care about metrics like AIC, AUC, R-Squared, Accuracy, Precision, Recall, etc. when deciding whether a model does a good job answering a Passive Prediction Question, not the size of the standard errors on the coefficients.

This is also the reason that data scientists are often comfortable using “black box” models when answering Passive Prediction Questions. Black box models are statistical models — like neural networks or random forests — where the way variation in explanatory variables contribute to predictions is opaque to the user. That often doesn’t matter when answering Passive Prediction Questions — since all we care about are the predicted values the model generates — but precisely because the patterns these models rely on to make their predictions can’t be easily understood, they are of little value for answering Exploratory Questions.[1]

The “Passive” in Passive Prediction#

The term “passive” in “Passive Prediction Questions” is meant to emphasize the distinction between Passive Prediction Questions and Causal Questions. Both Passive Prediction Questions and Causal Questions can be thought of as trying to “predict” some future outcome, but they differ in the contexts in which their predictions are valid. A full accounting of the distinction between Passive Prediction Questions and Causal Questions will have to wait until we cover Causal Questions in detail, for the moment, we can get a sense of things by introducing a very casual definition of what it means for some cause X to effect some outcome Y.

In casual parlance, when we say that some factor X causes outcome Y (and that X is not merely correlated with Y), what we usually mean is that if we were to go out and actively change X, Y would change as a result. This isn’t a fully rigorous definition, but it drives home that causation is about what happens when we actively manipulate X.

To see how this distinction illustrated, let’s return to the example of a hospital interested in predicting which patients are likely to experience complications after surgery. Using past patient data, you are able to develop a model that very accurately answers the question “Given their pre-surgery vitals, how likely is a patient to experience complications after surgery?” Hooray! The hospital uses this model to determine which patients should get extra nursing visits and extra attention during recovery. You’ve done a great job answering a Passive Prediction Question by discovering a pattern in the world — a set of correlations between measurable variables — you can take advantage of.

Now in the course of developing this model, suppose you discover that one of the strongest predictors of complications after surgery is patient blood pressure — patients with high blood pressure are substantially more likely to experience complications than those with normal blood pressure. This leads you to wonder whether treating patients with high blood pressure with pressure-reducing medications prior to surgery might reduce complications. In other words, you now want to know the effect of going into the world and manipulating patient blood pressure — a Causal Question.

In the first case, you really don’t care if blood pressure is causing the surgical complications, by which we mean you don’t care if reducing blood pressure would reduce complications, or whether high blood pressure is just an easily observable symptom of an underlying condition that is the root cause of surgical complications (like leading a stressful life, or having relationship problems at home). In either case, the correlation is sufficient for your purposes of identifying patients you need to keep tabs on.

But if you want to know what would happen if you directly manipulated blood pressure, knowing that blood pressure and complications are correlated is not sufficient. After all, if living alone results in high blood pressure and difficulty recovering from surgery, then treating patient blood pressure may have no effect at all!

When answering Passive Prediction Questions, we are searching for correlations we can leverage to make accurate predictions, not causal relationships we can directly manipulate to shape outcomes. Indeed, those who specialize in answering Passive Prediction Questions (like computer scientists who specialized in supervised machine learning) don’t really care that “correlation does not (necessarily) imply causation.”

xkcd correlation comic