Solving Problems with Data#
Few fields have shown as much promise to address the world’s problems as data science. Today, data science is improving our understanding of and adaptation to climate change. It is being used in medicine to speed drug discovery, improve the quality of X-rays and MRIs, and ensure that patients receive appropriate medical care. It is used in courtrooms to fight for fair elections and electoral maps and by data journalists to document and communicate the injustices prevalent in our criminal justice system and issues in policing.
Data science also enables new technologies that may improve our lives. Autonomous drones are delivering blood and medical supplies to rural health clinics from Rwanda to North Carolina, and driver-aid features continue to make progress in reducing the over 30,000 traffic deaths and millions of injuries that occur in the US alone every year. And nearly every facet of business — from the way businesses source materials and manage inventory to the way product offerings respond to customer behavior — has been reshaped by data science.
At the same time, businesses and regulators are also coming to appreciate the potential of data science tools to reinforce racial and gender inequities. Algorithms at Amazon have been found to discriminate against female job applicants. Medical algorithms have been found to prioritize White patients over Black patients for kidney transplants and preventative care. In the criminal justice system, algorithms have been found to incorrectly identify Black defendants than White defendants as being a “danger to society” when providing risk assessments to judges deciding on pre-trial release, bail and sentencing. And even Meta’s own research has shown its algorithms drive political polarization and division among users, and push users into extremist groups.[1]
How, then, should a burgeoning data scientist approach this discipline, full of such promise and peril? Why have so many data science endeavors failed to deliver on their promise? And why do we need yet another data science book?
This Book#
This book is different from many other data science books you may have read. Where most data science books are designed to teach specific data science techniques or methods, the aim of this book is to provide you with a framework for thinking about your goals and how to achieve them using data science. It is, in a sense, about everything you need to know beyond the technicalities of model fitting. This is about everything that comes before and after you fit your model: it will help you work with stakeholders to clearly articulate the problem they want to address, formulate questions whose answers will help address your stakeholder’s problem, choose an appropriate tool based on the question you seek to answer, and, critically, evaluate and refine your model based on your stakeholders needs.
The importance of these skills is often underestimated by data science students, and for understandable reasons. Data science curricula usually begin with coding, statistics, and model evaluation techniques. As a result, the hardest part of data science classes is often mastering the technical details of model implementation. Moreover, the limited time available to instructors and the need to support full classes of students means data science exercises almost always have to come with clear directions and problem scaffolding to ensure students meet their learning goals.
But real-world problems don’t come with directions. Indeed, a problem that is clearly defined and for which a solution is obvious isn’t a problem anyone will pay you very much to solve. No, classroom exercises are carefully structured to foster learning and to make it possible for instructors to grade and provide feedback at scale. But real problems — the kind you will encounter in industry, government, or research — are hard to even articulate clearly, never mind solve. And that is why, as we will see, what really sets exceptional professional data scientists apart is not their ability to get a high AUC — it’s their ability to navigate and thrive in the face of ambiguous problems and goals.
The Three Big Ideas#
Data science is about solving problems. All too often, data scientists get lost in the technical details of models and lose sight of the bigger picture. Data science is not about maximizing accuracy — it’s about using data and quantitative methods to solve problems, and at the end of the day the only “metric” that matters is whether your work has solved the problem you set out to address.
Data scientists solve problems by answering questions, and the question you are asking determines what tool is appropriate. At their core, all data science tools are tools for answering questions, whether you realize it or not. Learning to recognize how data scientists use questions to solve problems — and exactly what questions are being answered by the tools you use every day — is key to navigating the ambiguity of real-world problem-solving.
Reasoning rigorously about uncertainty and errors is what differentiates good data scientists from great data scientists. Data science isn’t just about minimizing classification errors and uncertainty — it’s also about deciding how unavoidable errors should be distributed, how to weigh the risks and trade-offs inherent in probabilistic decision-making rigorously and in a manner that takes into account the problem you are trying to solve, and to take uncertainty into account when acting on data.
This Is A Book About The Forest, Not The Trees#
The goal of this book, to frame things a little differently, is to help young data scientists maintain perspective. Students spend so much time learning individual techniques that they are unable to see the forest for trees. But this is not a book about individual techniques — it’s about learning to think strategically about how to use those techniques to solve real problems.
To maintain its focus on the “forest,” this book takes as given that you have already been introduced to statistical inference and machine learning and know how to faithfully fit a model in a robust manner. That means that topics like hypothesis testing, cross-validation, how to use train-test splits, and how to evaluate a model’s AUC will be treated as assumed knowledge.[2] This is in no way meant to suggest these topics aren’t important — we will reference them constantly — just that I will not attempt to teach them here, both to maintain focus on the goals of this book, and also because there already exist many other resources that introduce these topics better than I could.
Introduction Structure#
The remainder of this introductory chapter contains an overview of the Big 3 ideas of the book. All concepts discussed here will also be covered in greater detail in future readings, but before we dive into them in detail, it’s helpful to get a sense of the overall approach we will be taking!