Solving Problems with Data#
Few fields have shown as much promise to address the world’s problems as data science. Today, data science is improving our understanding of and adaptation to climate change. It is being used in medicine to speed drug discovery, improve the quality of X-rays and MRIs, and ensure that patients receive appropriate medical care. It is used in courtrooms to fight for fair elections and electoral maps and by data journalists to document and communicate the injustices prevalent in our criminal justice system and issues in policing.
Data science also enables new technologies that may improve our lives. Autonomous drones are delivering blood and medical supplies to rural health clinics from Rwanda to North Carolina, and driver-aid features continue to make progress in reducing the over 30,000 traffic deaths and millions of injuries that occur in the US alone every year. And nearly every facet of business — from the way businesses source materials and manage inventory to the way product offerings respond to customer behavior — has been reshaped by data science.
At the same time, it is also increasingly clear that today’s challenges will not be met by “throwing data science at the problem” or “just adding AI.” According to a 2020 MIT/BCG survey, only 11% of businesses that had piloted or employed Artificial Intelligence (AI) had reaped a sizeable return on their AI investments. Businesses and regulators are also coming to appreciate the potential of data science tools to reinforce racial and gender inequities. Algorithms at Amazon have been found to discriminate against female job applicants. Medical algorithms have been found to prioritize White patients over Black patients for kidney transplants and preventative care. In the criminal justice system, algorithms have been found to incorrectly identify Black defendants than White defendants as being a “danger to society” when providing risk assessments to judges deciding on pre-trial release, bail and sentencing. And even Meta’s own research has shown its algorithms drive political polarization and division among users, and push users into extremist groups.[1]
How, then, should a burgeoning data scientist approach this discipline, full of such promise and peril? And why have so many data science endeavors failed to deliver on their promise?
The Three Big Ideas#
This book is my answer to these questions, and it is organized around three big ideas:
Data science is about solving problems. All too often, data scientists get lost in the technical details of models and lose sight of the bigger picture. Data science is not about maximizing accuracy or AUC scores — it’s about using data and quantitative methods to solve problems, and at the end of the day the only “metric” that matters is whether your work has solved the problem you set out to address.
Data scientists solve problems by answering questions, and the question you are asking determines what tool is appropriate. At their core, all data science tools are tools for answering questions, whether you realize it or not. Learning to recognize how data scientists use questions to solve problems — and exactly what questions are being answered by the tools you use every day — is key to navigating the ambiguity of real-world problem solving.
Reasoning rigorously about uncertainty and errors is what differentiates good data scientists from great data scientists. Data science isn’t just about minimizing classification errors and uncertainty — it’s also about deciding how unavoidable errors should be distributed, how to weigh the risks and trade-offs inherent in probabilistic decision-making rigorously and in a manner that takes into account the problem you are trying to solve, and to take uncertainty into account when acting on data.
From the summary of these three ideas, it should be immediately obvious that this book is different from many other data science books you may have read. Where most data science books are designed to teach specific data science techniques or methods, the aim of this book is to provide readers with a framework for thinking about their goals and how to achieve them using data science methods. It is, in a sense, about everything you need to know beyond the technicalities of model fitting. It is about what comes before and after you fit your model: it will help you decide what questions to answer, what data to collect, and what models to consider using before you actually fit your model, and it will help you learn to evaluate whether a result is likely to generalize, whether a model is safe to deploy, and how to communicate those facts to a stakeholder after model fitting.
The importance of these skills is often underestimated by data science students, and for understandable reasons. Data science curricula usually begin with coding, statistics, and model evaluation techniques. As a result, the hardest part of data science classes is often mastering the technical details of model implementation. Moreover, the limited time available to instructors and the need to support full classes of students means data science exercises almost always have to come with clear directions and problem scaffolding to ensure students meet their learning goals.
But real-world problems don’t come with directions. Indeed, a problem that is clearly defined and for which a solution is obvious isn’t a problem anyone will pay you very much to solve. No, classroom exercises are carefully structured to foster learning and to make it possible for instructors to grade and provide feedback at scale. But real problems — the kind you will encounter in industry, government, or research — are hard to even articulate clearly, never mind solve. And that is why, as we will see, what really sets exceptional professional data scientists apart is not their ability to get a high AUC — it’s their ability to navigate and thrive in the face of ambiguous problems and goals.
This Is A Book About The Forest, Not The Trees#
The goal of this book, to frame things a little differently, is to help young data scientists maintain perspective. Students spend so much time learning individual techniques that they are unable to see the forest for trees. But this is not a book about individual techniques — it’s about learning to think strategically about how to use those techniques to solve real problems.
To maintain its focus on the “forest,” this book takes as given that you have already been introduced to statistical inference and machine learning and know how to faithfully fit a model in a robust manner. That means that topics like hypothesis testing, cross-validation, how to use train-test splits, and how to evaluate a model’s AUC will be treated as assumed knowledge.[2] This is in no way meant to suggest these topics aren’t important — we will reference them constantly — just that I will not attempt to teach them here, both to maintain focus on the goals of this book, and also because there already exist many other resources that introduce these topics better than I could.