Using Causal Questions#
In our past readings, we’ve learned about the value of both Exploratory and Passive Prediction Questions for solving problems.
Exploratory Questions help us better understand the contours of our problem — where our problem is most acute, whether there are groups who have figured out how to get around the problem on their own, etc. This, in turn, helps us identify where to prioritize our subsequent efforts.
Passive Prediction Questions have two main purposes. First, they help us to identify individual entities — patients, customers, products, etc. — to whom we may wish to pay extra attention or recommend certain products. Second, models built to address Passive Prediction Questions can also be used to automate tasks by predicting how a person would have classified an entity or behaved in a given setting.
In both cases, however, there is a drawback - answering these questions only helps us understand the world around us, not how our actions on the world will have an impact. But as data scientists, we will often want to act to address the problems that motivate us. Wouldn’t it be great if there was also a set of tools designed to help us predict the consequences of our own actions?
Enter Causal Questions. Causal Questions ask what effect we can expect from our actions. For example, “What effect will changing the interface of our website have on online sales?” or “What effect will prescribing a drug have on a patient?”
Because they help us understand the consequences of actions we might take, it should come as no surprise that the ability to answer Causal Questions is of profound interest to everyone from companies to doctors and policymakers. At the same time, however, it will also come as no surprise that answering Causal Questions is an inherently challenging undertaking.
In this reading, we will discuss both where Causal Questions arise in practice and a workflow for answering them, but we will gloss over the nuances of how exactly we answer Causal Questions. In our next reading, we will turn from workflows to theory and discuss in detail what it actually means to measure the effect of an action \(X\) (e.g., administering a new drug to a patient or showing an ad to a user) on an outcome \(Y\) (patient survival, customer spending, etc.). This section may feel a little abstract and woo-woo at times, but please hang in there. Answering Causal Questions is as much about critical thinking as it is about statistics, and understanding the concepts introduced here will be critical to your success in this domain.
When Do Causal Questions Come Up?#
Causal Questions arise when stakeholders want to do something — buy a Superbowl ad, change how the recommendation engine in their app works, authorize a new prescription drug — but they fear the action they are considering may be costly and not actually work. In these situations, stakeholders will often turn to a data scientist in the hope that the scientist can provide greater certainty about the likely consequences of different courses of action before the stakeholder is forced to act at scale. This, in turn, helps to reduce the risk the stakeholder has to bear when making their decision — something all stakeholders appreciate. Usually, the action the stakeholder is considering will not have been chosen at random. Rather, a stakeholder will generally pose a Causal Question because they have some reason to suspect a given course of action may be beneficial. Indeed, Causal Questions often arise in response to patterns discovered when answering Exploratory or Passive Prediction Questions.
Where Causal Questions Come From — An Example#
Suppose the Chief of Surgery at a major hospital is interested in reducing surgical complications. The hospital begins by asking, “What factors predict surgical complications?” (a Passive Prediction Question) and developing a model that allows it to identify patients who are likely to experience complications during recovery.
While developing this model, the hospital discovered that blood pressure was one of the strongest predictors of surgical complications—patients with high blood pressure are substantially more likely to experience complications than those with normal blood pressure.
This leads the Chief to wonder whether they could reduce surgical complications if they treated patients who have high blood pressure with pressure-reducing medications before surgery. In other words, the Chief Surgeon wants to know, “What effect would treating patients with high blood pressure have on surgical complication rates?”
Rather than just giving all patients with high blood pressure new drugs (and delaying their surgeries while the drugs take effect), the Chief wants you to provide a rigorous answer to her question. After all, high blood pressure may cause surgical complications, in which case the blood pressure medication may reduce complications. But it might also be that high blood pressure is a symptom of a third factor that causes both high blood pressure and surgical complications. For example, some lower-income patients may be experiencing stressful lives and could have more difficulty taking time off after surgery to recover, both of which are factors that could contribute to higher blood pressure. In this case, high blood pressure is useful for identifying patients likely to experience surgical complications. However, treating high blood pressure wouldn’t reduce complications since those patients would still be unable to take time off after surgery.
And so, a Causal Question is born!
The Causal Question Work Flow#
Before we dive into the technical details of answering Causal Questions, it’s worth providing a high-level overview of how data scientists approach answering them.
Identify Relevant Previous Studies#
Once a Causal Question has been posed, the next step is to identify any research that has already been done that may help answer your causal question. It’s hard to overstate how often data scientists overlook this step, but it’s such a no-brainer once you think of it! There’s no reason to spend days or weeks designing a study to answer a question if someone else has already put the time and money into doing it for you!
If your stakeholder works in public policy or medicine, then the first place to look for previous studies is in academic medical or policy journals. But don’t assume that if you aren’t working on a medical or public policy question, you won’t be able to find an answer to your question in academic or pseudo-academic publications — lots of data scientists present research done at private companies in “industry” conferences like the MIT Conference on Digital Experimentation (CODE@MIT) or the NetMob Cellphone MetaData Analysis Conference!
And if you are at a company, ask around! Someone at your own company may have investigated a similar question before, and talking to them could save you a lot of effort.
Evaluate Previous Studies#
If you do find studies, then for each study, you will have to ask yourself two questions:
Did the study authors do a good job of answering the Causal Question? in the context they were studying?
Do I believe that the context in which the study was conducted is similar enough to my context that their conclusions are relevant to me?
This first question is about the internal validity of the study, and we’ll talk at length about how to evaluate that in the context of causal inference in the coming weeks. The second question is about the external validity (i.e., the generalizability) of the study to your context. There are lots of extremely well-conducted studies in the world that may be seeking to answer the same question as you. Still, if they investigated the effect of a new drug in young patients and your hospital only treats very old patients, you may not be comfortable assuming their results are good predictors for what might happen in your hospital.
Plan A New Study#
If you were unable to find any studies that answer your Causal Question satisfactorily (either on their own or in combination), then it may be time to do a study of your own!
When most people think about answering Causal Questions, their minds immediately jump to randomized experiments. Randomized experiments are often the best strategy for trying to answer Causal Questions, but they are not always the best choice.
Studies designed to answer Causal Questions can be divided into roughly two types: experimental studies and observational studies.
In an experimental study, a researcher has control over everything that happens, including who enrolls in the study and who gets assigned to the treatment group and who gets assigned to the control group. Examples of experimental studies include nearly all clinical trials, A/B tests where the version of a website or app users see is randomly determined, and field experiments where, say, voters are randomly assigned to receive different types of mailers from political campaigns to measure their effect on voter turnout.
In an observational study, by contrast, researchers use data from a context where the researchers did not control who was treated and who was not. This includes data from public opinion surveys, data on user behavior and demographics, or census data.
(We say studies can be divided into roughly two types because some studies fall into a category sometimes called “quasi-experimental.” In these studies, researchers were not in control over who was treated and who was not, but they have some reason for thinking that something in the world — like a chance storm or a draft lottery — caused who was treated and who was not to be determined randomly. But these types of studies tend to be more relevant for academics than applied data scientists, and evaluating them is incredibly difficult, so we will largely ignore them in this text.)
While it is sometimes believed that only experimental studies can generate valid answers to Causal Questions, this is unequivocally untrue, as is the slightly more generous version of this claim — that experimental studies always constitute the best evidence for answering Causal Questions. As we will explore in great detail in the coming readings, the validity of conclusions drawn from both experimental and observational studies rests on whether a number of fundamentally untestable assumptions hold. As a result, both types of studies are capable of providing meaningful answers to causal questions and of being deeply misleading.
Moreover, while experimental studies often (but not always) have greater internal validity (they are often better able to ensure that they have measured the true causal effect in the lab), this often comes at the expense of lower external validity because ensuring the researchers can control who is treated and who is not requires operating the study take place in a highly monitored, often artificial and unrealistic setting. Observational studies, by contrast, are often based on data collected in the real world and, as a result, may yield answers that tell us more about what is likely to happen in our own real-world application, even if they have somewhat lower internal validity.
Wrapping Up and Next Steps#
Hopefully, this reading has given you a better sense of how Causal Questions are used to solve stakeholder problems and when and where they come up in the life of a practicing data scientist. In the readings that follow, we will turn first to the details of the Potential Outcomes Model, a rigorous, formal statistical framework for understanding the Counterfactual Model of Causality. This framework will not only provide a presentation of the Counterfactual Model of Causality that may be appealing to those who draw intuition from mathematical formalism but also machinery that we can use to evaluate how much confidence we can have in answers generated using different methods of answering Causal Questions — including both experimental and observational studies.