(1) Understandable#
Generating answers that are (1) understandable, (2) faithfully represent patterns in the data, and (3) are relevant given the problem one is seeking to solve.
Answering Exploratory Questions effectively is all about making the patterns in our incomprehensively complicated world legible to people. To do so, we take large datasets that are too complicated to understand in their raw form, identify salient structure and patterns in this data, and summarize them in a way that allows us to communicate those patterns to other people. The method by which we make structures in the data understandable will vary across applications — summary statistics, regression coefficients, scatter plots, or other visualizations — but the goal of identifying and communicating information about salient patterns is always the same.
Professionals from different disciplines often use different terminology to describe this process of summarization. Some like to refer to it as “separating the signal (the thing that’s important) from the noise (all the other variation that doesn’t matter),” others talk about “dimensionality reduction” (basically linear algebra speak for summarization), while still others may talk about “modeling the underlying data generating process that gave rise to the observed data.” Regardless of the terminology one uses, however, these all boil down to the same thing: filtering and discarding the variation the data scientist deems to be irrelevant to make it easier to see and understand the variation deemed important.
The importance of researcher discretion in deciding what variation to discard as noise and what variation to foreground as “important” is one of the defining challenges of answering Exploratory Questions. Other types of questions — like Passive Prediction Questions — often involve using more mathematically sophisticated modeling tools, and consequently are viewed as more challenging. In my experience, however, learning to understand the stakeholder’s problem context and the variation in a data set well enough to exercise this discretion effectively is actually one of the things young data scientists struggle with most. It requires both good domain knowledge to understand what is meaningful (as we will discuss below), and also for the data scientist to spend a lot of time exploring the data thoughtfully and from different perspectives. This is a hard skill to learn,[1] but with intentionality, patience, and practice, it is a talent that once learned will helps set you apart from the average pytorch-jockey.
Summarizations created to answer Exploratory Questions can differ radically in their ambition. At one end of the spectrum are simple summary statistics, like means, median, and standard deviations. These seek to provide a simple characterization of a single feature of a single variable. Slightly more ambitious are various basic data visualizations — like histograms (which are substantially richer than the aforementioned summary statistics) or scatter plots and heatmaps (which provide substantial granularity and communicate information about the relationship between different variables). And the most ambitious efforts make use of multivariate regressions and unsupervised machine learning algorithms to make inferences about the Data Generating Process (DGP) — the actual physical or social processes that gave rise to the data you observe, and which (hopefully) can be represented in a relatively parsimonious manner, much as the relatively simple laws of physics give rise to the orbits of the planets and the complexity of life.
To illustrate what I mean by trying to deduce something about the data-generating process, suppose you are a medical researcher interested in a poorly understood disease like Chronic Fatigue Syndrome (CFS). It is generally agreed that CFS is more of a label for a constellation of symptoms than an understood physical ailment, and you have a hypothesis that the symptoms of CFS aren’t actually caused by a single biological dysfunction, but rather that multiple distinct biological dysfunctions give rise to similar symptoms that we have mistakenly grouped under this same umbrella term. In other words, you think that the data-generating process that gives rise to patients diagnosed with Chronic Fatigue Syndrome consists of two distinct diseases.
You’re fortunate enough to have detailed patient data on people diagnosed with the condition, but it’s impossible for you to just look at these gigabytes of thousands of patient records and “see” any meaningful patterns. You need a way to filter out irrelevant data to identify the “signal” of these two conditions. To aid you in this question, you decide to ask “If you were to group patients into two groups so that the patients in each cluster looked as similar as possible, but patients in different clusters looked as dissimilar as possible, how would you group these patients?”
This, you may recognize, is precisely the question clustering algorithms (a kind of unsupervised machine learning algorithm) are designed to answer! So you apply your clustering algorithm to the patient data and get back a partition of the patients into two distinct groups. This, in and of itself, doesn’t constitute a particularly understandable summarization of your data, but it provides a starting point for trying to investigate diagnostically and biologically relevant differences that exist between these populations. If one cluster included more patients reporting fatigue when doing any exercise, while another cluster reported they felt better when they exercised, but felt a high level of baseline fatigue that didn’t respond to sleep, that might suggest that the data-generating process for these patients was actually driven by two different biological processes. And it gives you a great starting point to prioritize your subsequent investigations into what might explain these differences!