Answering Descriptive Questions: Internal Validity

Answering Descriptive Questions: Internal Validity#

While some aspects of internal validity are things you have almost certainly encountered in statistics and machine learning course — is the model overfit to the data? are your standard errors correctly calculated? are you omitting important variables from your regression? — most statistics courses fail to discuss internal validity in more holistic terms. In this section, I wish to take a step back to discuss what we are really trying to accomplish when we answer Descriptive Questions, and how that impacts how we evaluate “internal validity.”

Whether one uses simple summary statistics (means and medians), plots, or more sophisticated algorithms from the domains of statistical inference and unsupervised machine learning, answering Descriptive Questions always boils down to the same challenge:

Generating answers that are (1) understandable, (2) faithfully represent patterns in the data, and (3) are relevant given the problem one is seeking to solve.

What is meant by these three components exactly? Let’s take each in turn.