Solving Problems with Data#
Few fields have shown as much promise to address the world’s problems as data science. Today, data science is improving our understanding of and adaptation to climate change. It is being used in medicine to speed drug discovery, improve the quality of X-rays and MRIs, and ensure that patients receive appropriate medical care. It is used in courtrooms to fight for fair elections and electoral maps and by data journalists to document and communicate the injustices prevalent in our criminal justice system and issues in policing.
Data science also enables new technologies that may improve our lives. Autonomous drones are delivering blood and medical supplies to rural health clinics from Rwanda to North Carolina, and driver-aid features continue to make progress in reducing the over 30,000 traffic deaths and millions of injuries that occur in the US alone every year. And nearly every facet of business — from the way businesses source materials and manage inventory to the way product offerings respond to customer behavior — has been reshaped by data science.
At the same time, businesses and regulators are also coming to appreciate the potential of data science tools to reinforce racial and gender inequities. Algorithms at Amazon have been found to discriminate against female job applicants. Medical algorithms have been found to prioritize White patients over Black patients for kidney transplants and preventative care. In the criminal justice system, algorithms have been found to incorrectly identify Black defendants than White defendants as being a “danger to society” when providing risk assessments to judges deciding on pre-trial release, bail and sentencing. And even Meta’s own research has shown its algorithms drive political polarization and division among users, and push users into extremist groups.[1]
Moreover, despite huge inflows of investment into AI companies and the near ubiquitous discussion of Generative AI in all industries, the idea that one need only throw “data science” at a problem to generate profits has not held up empirically. According to a 2025 MIT report finds that “[d]espite $30–40 billion in enterprise investment into GenAI, this report uncovers a surprising result in that 95% of organizations are getting zero return,” echoing the results of a 2020 MIT/BCG survey that showed that at the time only 11% of businesses that had piloted or employed AI had reaped a sizeable return on their AI investments.
How, then, should a burgeoning data scientist approach this discipline, full of such promise and peril? Why have so many data science endeavors failed to deliver on their promise? And why do we need yet another data science book?
This Book#
This book is different from many other data science books you may have read. Where most data science books are designed to teach specific data science techniques or methods, the aim of this book is to provide you with a framework for thinking about your goals and how to achieve them using data science. It is, in a sense, about everything you need to know beyond the technicalities of model fitting. This is about everything that comes before and after you fit your model: it will help you work with stakeholders to clearly articulate the problem they want to address, formulate questions whose answers will help address your stakeholder’s problem, choose an appropriate tool based on the question you seek to answer, and, critically, evaluate and refine your model based on your stakeholders needs.
The importance of these skills is often underestimated by data science students, and for understandable reasons. Data science curricula usually begin with coding, statistics, and model evaluation techniques. As a result, the hardest part of data science classes is often mastering the technical details of model implementation. Moreover, the limited time available to instructors and the need to support full classes of students means data science exercises almost always have to come with clear directions and problem scaffolding to ensure students meet their learning goals.
But real-world problems don’t come with directions. Indeed, a problem that is clearly defined and for which a solution is obvious isn’t a problem anyone will pay you very much to solve. No, classroom exercises are carefully structured to foster learning and to make it possible for instructors to grade and provide feedback at scale. But real problems — the kind you will encounter in industry, government, or research — are hard to even articulate clearly, never mind solve. And that is why, as we will see, what really sets exceptional professional data scientists apart is not their ability to get a high AUC — it’s their ability to navigate and thrive in the face of ambiguous problems and goals.
Four Big Ideas#
To help students make the leap from carefully curated classroom exercises to solving messy, real-world problems, this book is organized around four big ideas:
Data science is about solving problems.
Data scientists solve problems by answering questions. Part two of this book then details
The questions data scientists answer can be divided into three categories: descriptive, passive predictive, and causal.
Reasoning rigorously about uncertainty and errors is what differentiates good data scientists from great data scientists.
Part I of this book explores the first of these ideas by discussing importance of identifying and properly articulating the problem one wishes to solve, as well as how to refine one’s understanding of their problem by working with your stakeholder. Part II explores Ideas 2 and 3 by illustrating how data scientists solve problems by answering questions about the world, and by introducing a taxonomy of questions data scientists are called upon to answer. Finally, Part III builds on Part II by discussing the types of issues and uncertainty that are inherent to each of the types of questions data scientists encounter. The emphasis of Part III is on the types of issues that tend not to be emphasized in introductory statistics or machine learning courses. This Part will therefore skip over topics like overfitting and model diagnostics, and instead focus on concepts like external validity, Goodhart and Campbells’ Laws, adversarial users, adverse selection in deployment, statistical decision-making, customizing loss functions to suit the substantive context, and the role of ethics in loss-functions.
Assumed Background#
This book takes as given that readers have already taken an introductory statistical inference and machine learning courses and know how to faithfully fit a model in a robust manner. That means that topics like hypothesis testing, cross-validation, how to use train-test splits, and how to evaluate a model’s AUC will be treated as assumed knowledge.[2]
This is not a book about individual techniques — it’s about learning to think strategically about how to use those techniques to solve real problems. Assuming familiarity with these topics allows us to take a wider, more holistic view of the goals of a data scientist and to avoid losing sight of the forest for the trees (to say nothing of the fact that many books have already been written that provide exceptional treatments of these topics to which I have little to add).