Causal Questions, Experiments, and External Validity#

Randomized experiments — be they the Randomized Control Trials (RCTs) of medical research or the A/B tests used by e-commerce companies — are often considered to be the gold standard for answering Causal Questions. When it comes to internal validity, it is certainly the case that randomized experiments tend to outperform other research designs. But it is also important to understand that experiments often achieve high internal validity at the expense of external validity. In this reading, we will discuss some of the most common issues that arise around external validity in randomized experiments. In particular, we will discuss two classes of challenges in achieving external validity that we often see when designing and implementing experiments: challenges that arise over time and challenges that arise with scaling.

External Validity and Time Effects#

A major challenge with experiments is that there is often a desire for them to generate quick answers, especially in the tech industry. Most experiments have a short lifespan – even when a company is thinking about deploying an intervention for a very long time, they usually evaluate it using a relatively short experiment. To a certain extent, this makes sense — the longer an experiment runs, the longer a potentially valuable change is being postponed. But the desire to use experiments to generate quick answers can backfire, giving rise to results that don’t hold up in deployment.

Novelty and Primacy Effects#

Within our first class of challenges (temporal), there are two main types classes of effects to be aware of: novelty and primacy effects.

Novelty effects occur when people engage with a new product or feature due to its, well… novelty, but eventually lose interest. This may occur with a feature on a website, or with a physical product (like the Apple Vision Pro, that new exercise bike you bought, your gym membership, etc.). It is a very common reason that experiments may overstate the appeal of new products.

At the other end of the spectrum are primacy effects. Primacy effects occur when users take time to adjust to a new product.[1] When these effects are present, experiments will tend to underestimate user interest in a new product. Consider, for example, every time you’ve visited a loved webpage and found they had a new design and felt furious for a week before deciding it’s ok after all.

Primacy effects can be particularly pernicious because when a product is rolled out at scale, people may talk about it, which may help users learn to use it effectively. But that’s a phenomenon that is unlikely to happen during a small experiment, which brings us nicely to scaling effects.

External Validity and Scaling#

Achieving high external validity when answering Causal Questions is difficult for precisely for the same reason answering a Causal Question is difficult: a stakeholder is thinking about taking an action, but that the action is too dangerous to “just try” at full scale. As a result, when you are asked to answer a Causal Question, you will likely be asked to answer it through a study that is very small in scale with respect to the eventual deployment your stakeholder is considering. For example, YouTube isn’t going to let you test a change to how creators are paid for views by rolling it out to the whole website for a week or even by letting you roll it out to its biggest channels. You will be limited to experimenting on a relatively small subset of users or creators. The ability to experiment will mean you will probably be able to successfully measure the effect of the change you want to study among the people in the experiment, but whether that measured effect is the same as what you would see if the change were applied globally will depend on how the experiment scales.

Ecosystem Changes (aka General Equilibrium Effects)#

One of the biggest challenges around scaling is that experiments may induce changes in the behavior of the individual users in an experiment, but they are unlikely to cause changes in the broader ecosystem of users who might respond to a full-scale change in a product. For example, an experiment that changes how product listings appear for a few hundred customers on Amazon won’t change what types of products sellers list on Amazon. But, if Amazon changed how product listings appear for all users on Amazon, you can be certain sellers would respond by changing what they sell and how they sell it. Similarly, a change in the TikTok algorithm rolled out to 1% of users won’t impact the kind of videos creators make, but global changes in the algorithm will absolutely change how creators design their content.

Economists refer to these kinds of second-order changes in behavior as “general equilibrium effects” (as distinct from “partial equilibrium effects,” which refers to the direct effect of a change in a system). Once you start to look for them, you see them everywhere.

The magnitude of general equilibrium effects will usually vary with the size of the initial stimulus, which is why experiments tend to underestimate the general equilibrium effects that emerge during full-scale rollouts. But ecosystem changes also take time, meaning that experiments run for days or weeks will tend to underestimate general equilibrium effects — another example of how the distinction between “time effects” and “scale effects” can sometimes be a little blurry.

The problems of experiments, external validity, and general equilibrium effects are not unique to e-commerce or tech platforms. To illustrate, consider the case of a public health experiment in Rajasthan, India.[2] India’s public health care system operates under a universal healthcare model – every household is close to a free (or nearly free) government facility, and on average, households are within two kilometers to the nearest clinic. However, absenteeism is a major problem in government medical facilities in Rajasthan, leading to the dependence upon expensive private practitioners instead. So, in cooperation with the health officials and a nongovernmental organization (NGO), a set of researchers designed an experiment in which the NGO setup timeclocks to monitor attendance of midwife nurses at rural government clinics. The government then used the data to apply fines and punishments to nurses’ wages if they failed to show up at the clinic. During the first six months of the study, the experiment proved to be a huge success, with attendance nearly doubling.

But in a follow-up conducted 16 months later, the authors discovered that “general equilibrium effects” had come into play, and nurses dissatisfied with fines had managed to mobilize against the system. In the authors’ words, “After the first 6 months, however, the local health administration deliberately undermined the incentive system. The result was that, 16 months after program inception, there was no difference between the absence rates in treatment and comparison centers; both were extremely high (over 60%).”[3]

In this case, of course, we know about the general equilibrium effect because the experiment was relatively large — it covered 72 clinics (33 “treated”) serving 135 villages — and was left running for a year and a half. But the main point — that these effects are not at all unique to big tech and are likely to be overlooked by most experiments — still stands.

Wrapping Up#

Experiments are incredible. We have all benefited from lessons learned through medical experiments, and no other research design is as effective at convincing the general public of an intervention’s effectiveness. But they are not the be-all and end-all of answering causal questions. Even when experiments achieve high internal validity through effective randomization and careful management of spillover effects, their results may not generalize. Learning to recognize both the power and the perils of experiments is an important part of being a thoughtful data scientist, and hopefully the basic typology of common external validity concerns presented here will help you on that journey.