# Answering Descriptive Questions: Internal Validity

While some aspects of internal validity are things you have almost certainly encountered in statistics and machine learning course — is the model overfit to the data? are your standard errors correctly calculated? are you omitting important variables from your regression? — most statistics courses fail to discuss internal validity in more holistic terms. In this section, I wish to take a step back to discuss what we are really trying to accomplish when we answer Descriptive Questions, and how that impacts how we evaluate "internal validity."

Whether one uses simple summary statistics (means and medians), plots, or more sophisticated algorithms from the domains of statistical inference and unsupervised machine learning, answering Descriptive Questions always boils down to the same challenge:

**Generating answers that are (1) understandable, (2) faithfully represent patterns in the data, and (3) are relevant given the problem one is seeking to solve.**

What is meant by these three components exactly? Let's take each in turn.


<!-- 
## Netflix Movie Clusters

![Netflix movie suggestion categories](images/netflix_clusters.jpeg)

Netflix doesn’t just need to figure out what movie you would enjoy watching most; it needs you to understand why the movie is something you were likely to enjoy so you buy into it. -->

<!-- This issue isn't limited to simple summary statistics, though. Unsupervised machine learning algorithms have the same problem—ask a clustering algorithm to divide a dataset into three clusters, and it will, even if the differences between the groups is *very* small. And ask a Principal Component Analysis algorithm to find a vector that minimizes the sum of squared distances between all points and the vector and it will, even if that vector isn't really measuring any kind of central tendency in the data.

Answering Descriptive Questions will not always entail gathering and merging new sources of data; in some cases, answering Descriptive Questions is about making sense of existing data sets by identifying hidden (latent) patterns. This can be accomplished by a range of tools, but this practice is most commonly associated with *unsupervised machine learning* and statistical inference.

Answering Descriptive Questions always boils down to the same challenge: taking datasets that aren't comprehensible in their raw form, and identify **meaningful** patterns in the data that can be summarized in a manner that humans can actually wrap their heads around. And why is this hard? Because by definition, answering Descriptive Questions requires trying to find ways to make the incomprehensible comprehensible by identifying *meaningful* patterns. And while we have many tools for calculating specific types of data summaries—means, medians, clustering algorithms, principal component analyses, etc.—none of these can evaluate whether the summaries they generate are *faithful and meaningful* to the question you are seeking to answer. And it is therefore up to you, the data scientist, to decide what summaries of the data provide a faithful and meaningful representation of the underlying data. -->
<!-- 
## The Danger of the Helpful Computer

What do we mean by "faithful" summaries? To illustrate, suppose that you have been hired by the US state of Florida to help with a financial problem they're having. They buy a lot of electricity that's generated with natural gas, but as a result, their electricity costs keep bouncing up and down with natural gas prices. As this makes it hard for them to do financial planning, they've asked you to find a financial asset the city can buy that will smooth out these fluctuations. More specifically, they want an asset that will pay out more when natural gas prices are high (so they can use the money to offset their increased electricity costs), and less when natural gas is low.

They've given you data on four potential assets, and so you run a linear regression looking at the relationship between the value of these assets and natural gas prices. You analyse the data by looking at the correlation between the asset's payout and natural gas prices, and by fitting a linear regression with natural gas prices and asset payouts. You find that all four assets have essentially identical relationships with natural gas prices—a correlation of about 0.8, and a regression coefficient of about 0.5, suggesting that when natural gas prices rise by a dollar, asset payouts will increase by 0.50 dollars. Perfect, right? All four assets would work equally well, and all four could help limit budget fluctuations for Florida!

Well... no. If we dig a little deeper, we see that these summary statistics are not telling us all the meaningful information in the data; our summary statistics are technically *correct*, but they aren't faithfully representing everything that matters given the problem we want to solve.

We can see this by plotting the data[^anscombesquartet]:

[^anscombesquartet]: Anscombe's quartet. (2022, October 21). [Image from Wikipedia Commons.](https://en.wikipedia.org/wiki/Anscombe%27s_quartet).

![plot of all four regression fits with scatter points](images/anscombes_quartet_naturalgas.png)

Clearly, the relationship between these different assets and natural gas prices are *not* all the same! Buying the asset in the top left would likely do a good job of smoothing out the state's budget, but in nearly all years, the asset in the bottom right would be useless for smoothing the state's budget since in most years the asset's payoff doesn't change at all!

This is obviously a simple example and one where a simple plot is sufficient to allow us to see the problem. But this problem is inherent to answering *any* exploratory question—whether we're calculating simple statistics or using sophisticated unsupervised machine learning techniques; when we summarize data, it is our job as data scientists to ensure that our summaries are representing the *relevant* patterns in the data in a faithful and meaningful manner. And because what is relevant depends on the problem we are trying to solve, it's something we as data scientists have to evaluate, not something an algorithm can do for us. -->
<!-- 
--------

## Unsupervised Machine Learning

If you are familiar with machine learning techniques already, you are likely most familiar with *supervised machine learning*, in which we tell our algorithms what we want to do by providing lots of examples of the behavior we want them to emulate in a training dataset. For example, if we want a supervised machine learning algorithm to identify pictures that contain dogs, we might give them lots of pictures with and without dogs that include labels for whether there's a dog in each picture. From these examples, the algorithm then learns to emulate the demonstrated behavior.

The choice of three here is arbitrary, and in practice, one would generally run their clustering algorithm for clusters of 2, 3, 4, etc., and compare the results for patterns that seemed meaningful based on the researchers' medical knowledge.

### Choosing What To Ask and Present

OK... by now you've probably noticed that I've been saying you want to make sure you faithfully represent the "important" and "critical" properties of your data, but I haven't defined those terms. And here's why: there are no objective definitions of these terms. What is *important* depends both (a) on the context, and (b) on the value system of you (the data scientist) and your stakeholder.

#### Context

#### Values

In our previous reading, we talked about how some questions *explicitly* invoke our value systems—prescriptive questions, like "should murders be eligible for parole?," or "is it fair that Americans with more money can donate unlimited amounts of money to Political Action Committees in the United States?"—while others are ostensibly just questions about objective reality. But now we have to blur that line ever so slightly, because even when dealing with questions about objective reality (descriptive questions), our values come into play. How?

Suppose you are a policymaker choosing between two possible policies for reducing $CO_2$ emissions in the United States. You are told:

- Policy A would reduce $CO_2$ emissions by 95%, have only a minimal impact on unemployment and business profits, and would require a 100 million dollar tax.
- Policy B would reduce $CO_2$ emissions by only 90%, would have a moderate impact on unemployment and business profits, and would require a 200 million dollar tax.

Which would you choose?

Now suppose I also told you that the 100 million dollars in taxes from Policy A would come entirely from taxing people who live below the poverty line, while the 200 million dollar tax for Policy B would be collected from all Americans in proportion to their income. Does that change how you see the issue?

**People tend to make decisions based on the information that is available to them,** and so what questions get asked (and what data is thus presented) can have a *huge* impact on how decisions are made. And as a data scientist, you will often be in control of what questions are being asked, and so it is incumbent upon you to ensure that your stakeholders are presented with all the data that *you* feel is important for them to know.

This is actually one of the big reasons that the lack of diversity in data science is such a problem—it's not that White men are intrinsically misogynistic or racist, but because our life experiences influence what we think is important, and thus what we ask our data (both authors of this book are White men).

Consider this infamous (though thankfully low stakes) illustrative example of a major tech failure (seriously, [go watch the video](https://www.youtube.com/watch?v=t4DT3tQqgRM)): the camera that only sees White people. We can't know exactly what went wrong, but I think it's safe to say that if there were more Black developers working at HP, surely *one of them* would have stopped to ask the question "does this work as well for Black faces as White faces?" But no one did, and so this product shipped. But OK, HP isn't a very *good* tech company. A better tech company like Google would never make that mistake. Oh wait... [Google's Photos product tagged photos of dark-skinned people as "Gorillas". And how do we know? Yup, because they released that product too.](https://www.theverge.com/2018/1/12/16882408/google-racist-gorillas-photo-recognition-algorithm-ai)

OK, fine, you say—but those are just low-stakes settings. Surely that wouldn't happen when it *counts*. [Cue: facial recognition's differential error rate for men and women, and for people with light versus dark skin tones.](https://sitn.hms.harvard.edu/flash/2020/racial-discrimination-in-face-recognition-technology/) And that technology is being used by police, border agents, and so much more.

To be clear, the problem is not *just* that these companies created discriminatory algorithms—as we'll discuss later in this book, almost any machine learning tool trained on public data will end up reflecting all the racist and misogynistic biases of our society. The problem is that *they shipped the racist products!* No one in these companies thought to stop and ask the question: "hey, before we roll this out, should we check to see how this behaves with people who don't look like our predominantly light-skinned workforce?"

(Yes, these are machine learning examples, not Descriptive Questions per se. Machine learning examples get a lot more press, so it's easier to demonstrate to these kinds of news stories. But while the principle that the questions we ask reflect our values is especially important in Exploratory analyses, it also has broad salience, as illustrated here.)

So remember: when deciding what to look at and report in your analyses, remember that people will make decisions based on the patterns *you* have decided are important, so if you don't stop to ask a question about, say, gender or racial bias, odds are it won't be something considered by your stakeholders.

This example is awful and the plot doesn't have labels. Need better! -->

<!-- 
simpler representations that we can wrap our heads around to help us understand the world around us. Dropping a giant dataset with the CO2 emissions and latitude and longitude of every power plant in the US alongside a business survey of energy consumption by different companies broken down by building on the desk of the CEO of the environmental non-profit in the example above would clearly not aid them in the slightest to understand the world any better, just as a thumb drive with hundreds of gigs of patient data isn't helpful to a medical researcher in its raw form. The sheer *amount* of data embodied by those datasets is just too great.

A map showing CO2 emissions by region, or a summary of the features that are most common within each patient cluster, by contrast, *is* something humans can wrap their heads around. And the *reason* it is understandable is due in large part to the fact that these are simple representations of the patterns in the data that matter for the question we seek to answer with all the noise that isn't critical removed. In information theoretic terms, we've taken a huge amount of noisy information and reduced it to just the signal we care about, which can be communicated with far less information.

But there's a challenge that's inherent to this type of process of simplification which you may already be noticing: answering Descriptive Questions requires identifying what is important *and throwing everything else away*. And it is for this reason that answering Descriptive Questions well is dependent on the judgment of the data scientist. -->

<!-- clustering algorithms / naming the clusters / Netflix-->

<!-- infections by, say, removing fabric chairs that are hard to disinfect, the location of infections is likely important. If, by contrast, you were brought in by someone studying how the ways antibiotics were prescribed impacted infections, you would instead want to focus on the treatment histories of patients. And if you were hired by the hospital itself which just wanted to reduce infections by any means possible, you'd want to study both to know where future efforts might be best targeted. -->