Passive Prediction Questions and Internal Validity

Passive Prediction Questions and Internal Validity#

As with Exploratory Questions, when answering Passive Prediction Questions there are two major types of concerns: those related to internal validity, and those related to external validity.

Internal Validity#

Of all the places where data science is fragmented, none is more evident than in how data scientists evaluate how effectively we think a model is representing our data, especially when focused on prediction.

The first data science perspective on evaluating the internal validity of a model comes from the field of statistics. Statisticians have approached evaluating model fit with, unsurprisingly, methods based on the idea of random sampling and the properties of statistical distributions. They make assumptions about the distributions underlying data and use those to derive theoretically-motivated metrics. That’s the origin of statistics like Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), as well as the emphasis on the validity of the standard errors assigned to factors on the right-hand side of the regression.

When computer scientists were first developing their own machine learning techniques… I’m editorializing a little here, but I think it’s safe to say that initially they either didn’t know about a lot about these metrics, or they thought that they could do a better job investing their own. So they developed the “split-train-test” approach to model evaluation: they split their data into two parts, train their model on part of the data, then test how well the model is able to predict the (known) outcomes in the test dataset.

Of course, over time these two fields have largely converged in adopting one another’s methods, and some—like cross-validation—live comfortably in the middle. But if you’re ever wondering why, when you get to a machine learning class, it seems like everything you learned in stats has been abandoned (or end up in a stats class and have the opposite experience), it’s largely an artifact of parallel development of methods of model evaluation in computer science and statistics departments.

But the ins and outs of fitting machine learning or statistical models is not the comparative advantage of this text, and so our time is better spent turning to External Validity.