Solving the Right Problem#
In Douglas Adams’ comedic sci-fi classic Hitchhiker’s Guide to the Galaxy, a race of hyperintelligent pandimensional beings set out to build a massive supercomputer the size of a city to solve the mysteries of the cosmos once and for all. When they turned on the computer, named Deep Thought, they announced that:
“The task we have designed you to perform is this. We want you to tell us… the Answer!”
“The Answer?” said Deep Thought.
“The Answer to what?”
“Life!” urged one designer.
“The Universe!” said another.
“Everything!” they said in chorus.
Deep Thought paused, then answered, “Life, the Universe, and Everything. There is an answer. But,” Deep Thought added, “I’ll have to think about it.”
Seven and a half million years later, when Deep Thought had finally finished its calculations, the descendants of those designers assembled to learn the result of their ancestors’ work.
“Er …good morning, O Deep Thought,” said Loonquawl [one descendants] nervously, “do you have … er, that is …”
“An answer for you?” interrupted Deep Thought majestically. “Yes. I have.”
The two [descendants] shivered with expectancy. Their waiting had not been in vain.
“There really is one?” breathed Phouchg [the other descendant].
“There really is one,” confirmed Deep Thought.
“To Everything? To the great Question of Life, the Universe and Everything?”
“Yes. […] Though I don’t think,” added Deep Thought, “that you’re going to like it.”
[…]
“All right,” said the computer, and settled into silence again.
The two fidgeted.
The tension was unbearable.
“Forty-two,” said Deep Thought, with infinite majesty and calm.
“Forty-two!” yelled Loonquawl. “Is that all you’ve got to show for seven and a half million years’ work?”
“I checked it very thoroughly,” said the computer, “and that quite definitely is the answer. I think the problem, to be quite honest with you, is that you’ve never actually known what the question is.”
“But it was the Great Question! The Ultimate Question of Life, the Universe and Everything,” howled Loonquawl.
“Yes,” said Deep Thought with the air of one who suffers fools gladly, “but what actually is it?”
A slow stupefied silence crept over the men as they stared at the computer and then at each other.
“Well, you know, it’s just Everything … everything …” offered Phouchg weakly.
“Exactly!” said Deep Thought. “So once you do know what the question actually is, you’ll know what the answer means.”[1]
In addition to establishing the premise for one of the greatest comedic science fiction novels in human history,[2] I feel this passage perfectly exemplifies the three things that cause most data science projects to fail.
The first is that we are often so enamored with technology that we have absolute faith that if we just throw our problems at it, it will solve them for us. But it won’t. Without our thoughtful, critical guidance, it can’t. It never has, and it never will.
The second is that the reason this doesn’t work is that if we don’t actually figure out what it is we hope the technology will do for us, and make sure that what we’re asking for will actually solve a real problem, technology couldn’t care less. It will do what it has been asked to do — no more and no less. Garbage in, garbage out.
But the third reason data science projects fail that is exemplified in this passage is the most subtle. In this passage, we can see that Deep Thought recognizes the idiocy of the request it has been given by Loonqual and Phouchg, it doesn’t do anything about it. And that, unfortunately, is the cardinal sin committed by most young data scientists — they fail to recognize that helping the stakeholder properly specify their problem is a core part of the job.
At this point, you may be thinking “well, isn’t that their problem? They’re the ones who asked me to do the wrong thing?” And… yes, in some sense it is. But it’s also yours. Once you leave the classroom, you will no longer be evaluated on the complexity of your model or the aesthetics of your visualizations—you’ll be evaluated on whether you’ve made your stakeholder’s life better. And even if it was the stakeholder who originally misspecified their need, if you fail to correct that error and deliver a result that doesn’t help your stakeholder, then that’s all that will be remembered.
In the next chapter, we will discuss the concept of “stakeholder management:” specific ways you can work with your stakeholder to help refine and improve your mutual understanding of the problem you are seeking to solve before you—like the Deep Thought computer in Hitchhiker’s Guide to the Galaxy—spend weeks dutifully grinding away to solve a misspecified problem, only to deliver a result to their stakeholder that turns out to not be as helpful as expected.
Not someone who has an obvious stakeholder with whom you can have this type of conversation? Well, stick with me — many of the suggested questions and conversational strategies detailed below are ones I’ve often used in conversation with myself in my academic research, and I assure you they work almost as well when talking to the voices in your head as with another person.
How then should you — the young data scientist — go about ensuring your efforts are well spent?
How will you know if you’ve solved the problem?#
What does success look like?
Abstract the Problem#
So how do you help your stakeholder better understand their problem?
If I could offer only one piece of advice on how to approach a sticky problem, it would be this: rephrase the problem in a more general manner that abstracts away from the specifics. It’s difficult to overstate how often a “unique” sticky problem becomes very straightforward once you realize it’s a special case of a more general type of problem, or once you realize that your stakeholder has (often unknowingly) introduced constraints to the problem that aren’t actually constraints.
Perhaps my favorite example of this comes from a talk given by Vincent Warmerdam at PyData 2019.
The World Food Program (WFP) is a global leader in food aid provision. As Vincent tells the story — which he reports having heard at an Operations Research Conference — the WFP was struggling with an extremely difficult data science problem: how best to get food from the places it was being grown/stored to the people who needed it most. Essentially, the WFP would receive reports of needs from communities facing food insecurity. One community might report a need for bread and beef, while another might request lentils and meat. The WFP would compile these requests and then set about trying to determine the most efficient way to meet these needs.
This type of logistics problem is an example of a notoriously difficult problem (essentially a version of the Traveling Salesman Problem, which is NP-Complete, if that means anything to you) that companies like FedEx and UPS buy supercomputers to address. But this particular problem was made extra challenging by all the different types of food the WFP was trying to provide communities.
What the WFP realized was that they didn’t actually need to provide bread to the village asking for bread. See, humans don’t need bread to avoid starvation — they need a certain number of calories, a certain amount of protein, and a handful of other nutrients.[3] So when a village asks for bread, rice, or wheat, you can instead think of them asking for carbohydrates. And a village asking for beef or beans is actually asking for protein and iron. So by simply abstracting the task from “How best can we meet all these food requests?” to “How best can we meet the nutritional needs indicated by these requests?” the WFP was able to dramatically reduce the number of constraints being imposed on the logistical optimization problem WFP needed to solve, making its task far simpler.
Make Your Questions Specific and Actionable#
In developing your questions, it is important to make them specific and actionable. A specific and actionable question makes it very clear what you need to do next. For example, suppose an international aid organization told you they were worried that urbanization in Africa, Asia, and Latin America was impacting efforts to reduce infant mortality. Some examples of specific, actionable questions are: “Is infant mortality higher among recent migrants to urban centers, controlling for income?” or “Are the causes of infant mortality among recent migrants to urban centers different from those living in rural areas?” Reading those questions, you can probably immediately think of what data you’d need to collect, and what regressions you’d want to run to generate answers to those questions.
Vague questions would be “Is urbanization impacting efforts to reduce infant mortality?”, or “Does urbanization affect infant mortality?” Note that when you read these, they don’t seem to obviously imply a way forward.
Perhaps the best way to figure out if your question is answerable is to write down what an answer to your question would look like. Seriously – try it. Can you write down, on a piece of paper, the graph, regression table, or machine learning diagnostic statistics (complete with labels on your axes, names for variables, etc.) that would constitute an answer to your question? If not, it’s probably too vague.
Iterate#
And here’s the last but perhaps most important step: iterate. Bring your work back to your stakeholder as often as possible.
Many stakeholders find the idea of data science mysterious and abstract and will struggle to understand what is and is not feasible. By bringing them intermediate results, the whole process will start to become more concrete for the stakeholder, and it will help them provide you with better feedback.
The way this book is organized suggests a natural flow from problem articulation to answering Exploratory Questions to prioritize efforts, to answering Passive-Prediction Questions to target individuals for extra attention or automate tasks, and finally to Causal Questions to better understand the effects of that extra attention/automation. In reality, however, a good data scientist is always coming back to the stakeholder, updating their plan, and jumping back in the sequence when new questions arise.