Answering Exploratory Questions: Internal Validity

Answering Exploratory Questions: Internal Validity#

While some aspects of internal validity are things you have almost certainly encountered in statistics and machine learning course — is the model overfit to the data? are your standard errors correctly calculated? are you omitting important variables from your regression? — most statistics courses fail to discuss internal validity in more holistic terms. In this section, I wish to take a step back to discuss what we are really trying to accomplish when we answer Exploratory Questions, and how that impacts how we evaluate “internal validity.”

Whether one uses simple summary statistics (means and medians), plots, or more sophisticated algorithms from the domains of statistical inference and unsupervised machine learning, answering Exploratory Questions always boils down to the same challenge:

Generating answers that are (1) understandable, (2) faithfully represent patterns in the data, and (3) are relevant given the problem one is seeking to solve.

What is meant by these three components exactly? Let’s take each in turn.