Beyond The Experiment#

The Role of Causal Inference with Observational Data in Industry

There are few (if any) people who would question the importance of A/B testing in the tech sector. It is used constantly to help companies refine their products, better target their ads, and achieve a cycle of incremental innovation.

While A/B testing is definitely a skill a good data scientist should have in their toolbox, it’s important to understand that it is not the be-all and end-all of causal inference in industry. In this reading, we’ll discuss why it’s important to learn how to think critically about causality when working with observational data.

Note

Remember, “observational data” is data that we gather by passively observing the world – or that somebody else collected by passively observing the world – rather than data that we get by directly manipulating subjects in an experiment.

A/B Testing Isn’t Always Feasible#

There are many circumstances in which you cannot (or should not) use A/B testing. Here are just a handful of situations where you can’t run an A/B test:

  • You Don’t Control The Relevant User Behavior: Suppose you want to know the effect of following a specific person on Instagram or Bluesky on an outcome like user satisfaction. The “ideal experiment” would be to randomly assign some users to follow the person in question and block others from doing so. But forcing people to follow specific accounts is a good way to alienate users. So, you have to find other ways to compare followers and non-followers of a given account for ways they may be different in terms of user satisfaction or behavior. The same problem arises with new features — you can randomly assign users to have access to a new feature, but in many situations, you can’t force them to use it.

  • It’s a One-Shot Deal: Suppose you’re trying to decide whether to open a second store. Since this event is singular, it doesn’t lend itself to experimentation. There’s no way a small business can experimentally open a bunch of “second stores” to see what happens to sales. When dealing with these kinds of big strategic decisions, sometimes the best you can do is look to your company’s past experiences / the experiences of other companies that you think are comparable using the best available tools for observational causal inference.

  • Randomization is Bad Business: It’s one thing to randomize someone’s landing page color scheme; it’s another thing entirely to randomly show different people different prices for the same products. While that’s an accepted practice in some industries (airline travel), it risks alienating customers in other markets (like grocery stores). Similarly, sometimes you want a big rollout – like the launch of a new title on Netflix or the premier of a new movie – and an A/B test would interfere with the excitement of a new launch. (Not an abstract example: here’s Netflix talking about this constraint).

  • Time Horizons are Too Long: A/B testing is popular in part because most tech companies can get data on users so quickly that it’s easy to run experiments quickly. But not all decisions companies make have this same data velocity. A company thinking about opening more stores across the country can’t put its plans on hold, pick a few locations, set up new shops, and wait for sales data. As we read in Kohavi, Tang and Xu, A/B tests work best when we have an Overall Evaluation Criterion (OEC) that is “measurable in the short term (the duration of an experiment) yet believed to causally drive long-term strategic objectives.” When that is not available / the shortest experiment available that would allow you to measure a meaningful OEC would take too long, you need other options!

Going Beyond Testing for the Future#

Another limitation of A/B tests is that we usually use them for small-scale trials to motivate larger future rollouts. Sometimes, however, we are interested in measuring the effect of a single, large intervention.

In many areas of business, it’s increasingly common to hire outside companies to do very complicated projects, like administering government programs or providing a service to employees. In these situations, many companies are moving from paying vendors for inputs (hours worked, money spent) to paying them for results.

This has given rise to the practice of “at-risk contracting.” In at-risk contracting, vendors are paid based on how successfully they achieve results. For example, a hospital might hire a company to reduce in-hospital infections and pay them based on reductions in infections, or a company may hire a vendor to minimize the downtime of their internal network and pay them based on downtime. By tying compensation to the outcome that the hiring party actually cares about, the vendors are well incentivized to do precisely what the hiring company wants them to do.

But it’s critical with at-risk contracting that both parties are confident that there’s a reliable way to measure the effect of the vendor’s efforts. Otherwise people may get under or overpaid.

In practice, it’s unfortunately still the case that many companies just measure outcomes before the vendor starts and compare them to outcomes when the vendor finishes. But if the outcome in question is subject to natural variation – such as seasonality – or general trends that are outside the vendors control, then we would expect the outcome to vary whether the vendor does anything or not. Thus the need for better causal designs, like differences-in-differences.

We Don’t Want To Have A Limited Toolbox#

Professionally, if you only know how to do A/B testing, you risk limiting your opportunities for upward advancement. Big companies are getting better and better at providing analysts with extremely user-friendly A/B testing tools, which means that A/B tests are becoming easier and easier. Twitter, for example, had an internal tool called Duck, Duck, Goose that makes running an A/B test a relatively low-difficulty task (is this still true for X? No idea).

And, while running a good A/B test is never easy (remember, the internal and external validity of our causal estimates will always be dependent on a wide range of assumptions), it is something that more and more people are learning to do well and which companies are learning to make easier. Thus, if all you know how to do are A/B tests, you may find it increasingly hard to differentiate yourself in the data science market.

Again, this isn’t to say you shouldn’t learn to do A/B testing, just don’t only learn A/B testing.

Not enough to be able to do it; you must be able to explain it#

One final reason it’s important to understand how to apply the logic of causal inference to observational data is that even if you only work with A/B data, it is almost inevitable that you will interact with other people in your company who want to use observational data. In these situations, it’s not enough that you understand the considerations that go into making causal inferences from observational data; you also need to be able to explain those to colleagues.