In our previous readings, we learned how answering different types of questions can help us better understand the world around us. By answering Exploratory Questions, we can better understand the contours of our problem — where our problem is most acute, whether there are groups who have figured out how to get around the problem on their own, etc. — and by doing so help us prioritize our subsequent efforts. And by answering Passive Predictive Questions, we can help identify individual entities — patients, customers, products, etc. — to whom we may wish to pay extra attention or recommend certain products or services, or we can automate tasks by predicting how a person or more complicated process would have classified an entity.
In both cases, however, answering these questions only helps us better understand the world around us. But to the extent to which, as data scientists, we want to intervene to directly address problems, we are rarely interested in just knowing about the world around us — we want to act on the world, and wouldn’t be great if data science could provide us with a set of tools designed to help us predict the consequences of our actions?
Enter Causal Questions. Causal Questions ask what effect we can expect from intervening — that is, actively manipulating — the world around us in some way. For example, if we pay to show an ad to a specific customer, what will the effect of that choice be the likelihood they buy something on our website? Or if we chose to give a new drug to a patient, what will the effect of that choice be on their disease?
Because of their potential to help us understand the future consequences of our actions, it should come as no surprise that the ability to answer Causal Questions is of profound interest to everyone from companies to doctors and policymakers. At the same time, however, it may also come as no surprise that answering Causal Questions is an inherently challenging undertaking.
In this reading, we will discuss in detail what it means to answer a Causal Question — and why answering Causal Questions is inescapably difficult. Then in our next reading, we will turn to the role that Causal Questions play in the life of a practicing data scientist, and why expertise in causal inference (the discipline of answering Causal Questions) is one of the most valuable skills a data scientist can develop.
What We Mean By “Cause”#
To understand what it means to answer a Causal Question, and why answering Causal Questions is intrinsically hard, we must start by taking a step back to answer the question: “what do we mean when we say some action X causes a change in some outcome Y?”
Seriously, what do we mean when we say “X causes Y?” Try and come up with a definition!
While this question may seem simple, it turns out that this question has been the subject of serious academic debate for hundreds of years by philosophers no less famous than David Hume. Indeed, even today there is still debate over how best to answer this question.
In this course, we will make use of the Counterfactual Model of Causality (sometimes called the Neyman-Rubin causal model). In plain English, it posits that for “doing X to cause Y”, it must be the case that if we do X, then Y will occur, and if we did not do X, then Y would not occur. This is by far the most used definition of causality today, and yet remarkably, it only emerged in the 20th Century and was only really fleshed out in the 1970s. Yeah… that recently.
At first blush, this definition may seem simple. But its simplicity belies a profoundly difficult practical problem. See, this definition relies on comparing the value of our outcome Y in two states of the world: the world where we do X, and the world where we don’t do X. But as we only get to live in one universe, we can never perfectly know what the value of our outcome Y would be in both a world where we do X and one where we don’t do X for a given entity at a given moment in time. As such, we can never directly measure the causal effect of X on Y for a given entity (say, a given patient or customer) at a given moment in time — a problem known as the Fundamental Problem of Causal Inference (causal inference being what people call the practice of answering Causal Questions).
To illustrate, suppose we were interested in the effect of taking a new drug (our X) on cancer survival (our Y) for a given patient (a woman named Shikha who arrived at the hospital on June 18th 2022). We can give her the drug and evaluate whether she is still alive a year later, but that alone can’t tell us whether the new drug caused her survival according to our counterfactual model of causality — after all, if she survives maybe she would have survived even without the drug! To actually know the effect of the drug on Shikha by direct measurement, we would have to be able to measure her survival both in the world where we gave her the drug and the world where we did not and compare outcomes.
Since we can never see both states of the world — the world where we undertake the action whose effect we want to understand and the world where we don’t — almost everything we do when trying to answer Causal Questions amounts to trying to find something we can measure that we think is a good approximation of the state of the world we can’t actually see.
A quick note on vocabulary: by convention, we refer to the action whose effect we want to understand as a “treatment,” and the state of the world where an entity receives the treatment as the “treated condition.” Similarly, we refer to the state of the world where an entity does not receive the treatment as the “control condition.” We use this language even when we aren’t talking about medical experiments or even experiments at all. We also refer to the state of the world we cannot observe as the “counterfactual” of the world we can observe — so the world where Shikha does not get the cancer drug is the counterfactual condition to the world where Shikha does get the drug.
It’s at this point most people start throwing out “but what about…“‘s, and that’s good! You should be — that’s exactly the kind of thinking you have to do when trying to answer Causal Questions. For example, “what about if we measured the size of Shikha’s tumor before she took the drug and compared it to the size of her tumor after? If the tumor got smaller as soon as she started the drug, then surely the drug caused the tumor to shrink!”
Maybe! Implicitly, what you have done is asserted that you think that the size of Shikha’s tumor before we administered the drug is a good approximation for what the size of Shikha’s tumor would have been had we not given her the drug.
But this type of comparison will always fall short of the platonic ideal given by our definition of causality. Yes, Shikha’s tumor may have stayed the same size if we had not given her the drug (in which case the size of the tumor before she took the drug would be a good approximation), but it is also possible that regardless of whether we’d given her the drug, her cancer would have shrunk on its own.1
According to the Counterfactual Model of Causality, we could only ever know if taking the drug caused a decrease in tumor size if we could both administer the drug and observe the tumor and also observe a parallel world in which the same person at the same moment in time was not given the drug for comparison. And since we can never see this parallel world — the counterfactual to the world we observe — the best we can do is come up with different, imperfect tricks for approximating what might have happened in this parallel world, like comparing the tumor size before and after we administer the drug, imperfect though that may be.
So does that mean we’re doomed? Yes and no. Yes, it does mean that we’re doomed to never be able to take the exact measurements that make it possible to directly answer a Causal Question. But no, that doesn’t mean we can’t do anything — in the coming weeks, we will learn about different strategies for approximating counterfactual conditions, and in each case we will learn about what assumptions must be true for our strategy to provide a valid answer to our Causal Question. By making the assumptions that underlie each empirical strategy explicit, we will then be able to evaluate the plausibility of these assumptions.
In the example of Shikha, for example, we know that our comparison of tumor size before taking the drug to tumor size after taking the drug is only valid if her tumor would not have gotten smaller without the drug. This is something we can’t measure directly, but we can look to other patients with similar tumors, or the history of her tumor size to evaluate how often we see tumors get smaller at the rate observed after she took the drug. If it’s very rare for these types of tumors to ever get smaller, than we can have more confidence that a decrease in tumor size was the result of the drug.
We are also sometimes in a position to be more proactive than our effort to answer Causal Questions. Rather than trying to make inferences from the world around us using what is termed “observational data” (data that was generated through a process we did not directly control, a process we only “observe”), we can sometimes generate our own data through randomized experiments.
Randomized experiments — perhaps the most familiar tool for answering Causal Questions — are also just another way of approximating the unobservable counterfactual condition. In a randomized experiment — also known as “randomized experiments,” “Randomized Control Trials (RCTs)”, or “A/B Tests” depending on who you’re hanging out with — participants are assigned to either receive the treatment (the treatment group) or not (the control group) based on the flip of a coin, a roll of a die, or more commonly a random number generator on a computer. Provided we have enough participants, the Law of Large Numbers then promises that, on average, the people assigned to the control group will (probably) be “just like” the people assigned to the treatment group in every possible way (save being treated). Subject to a few other assumptions we’ll discuss in great detail later, that means that the outcomes of the control group — being just like the treatment group on average — will be a good approximation of what would have happened to the treatment group in world where they did not receive the treatment.
Randomized experiments are not a silver bullet, however. The validity of experimental comparisons still rests on a number of assumptions, many of which cannot be directly tested. For example, we can never be entirely sure that when we randomly assigned people to control and treatment groups, the process was truly random, or that we ended up with people who were similar in both groups (the law of large numbers only promises that getting similar groups becomes more likely as the size of the groups increases, not that it will happen with certainty!). Moreover, conducting a randomized experiment requires working in a context where the researcher can control everything, and that can sometimes generate results that may not generalize to the big messy world where you actually want to act.
So where does that leave us?#
For many data scientists, this will feel profoundly dissatisfying. Many people come to data science because of the promise that it will provide direct answers to questions about the world using statistics. But because of the Fundamental Problem of Causal Inference, this will never be possible when answering Causal Questions. Rather, the job of a data scientist answering Causal Questions is a lot like the job of a detective trying to solve a crime — your task is to determine what probably happened at a crime scene. You can gather clues, collect forensic evidence, and interview suspects, all in an effort to come up with the most likely explanation for a crime. But no matter how hard you try, you can’t go back in time to witness the crime itself, so you will never be able to be entirely sure if you are right or not.
But just as we investigate and prosecute crimes despite our inability to ever be 100% certain an arrested suspect is guilty, so too must businesses and governments make decisions using the best available evidence, even when that evidence is imperfect. But it is our job, as data scientists, to help provide our stakeholders with the best available evidence, and also to help them understand the strength of the evidence we are able to provide.
In this reading, we learned — in an intuitive sense — why answering Causal Questions is inherently hard. But this explanation, while accurate, is a little informal to be rigorous. In the readings that follow, we will be introduced to the Potential Outcomes Framework — the formal statistical framework that underlies the Neyman-Rubin Counterfactual Model of Causality. This framework will help us reason more systematically about how and when methods like randomized experiments, linear regression, matching, and differences-in-differences can help us answer Causal Questions.
But first, in the interest of not losing perspective on the forest for the trees, a discussion of how Causal Questions are used in practice.
The fact that diseases naturally change over time on their own is known as a disease’s “natural history.”