Welcome to Data Science for Humans#

The beginning of a textbook by Josh Clinton and Nick Eubank.

Few fields have shown as much promise to address the world’s problems as data science. Today, data science is being used to develop climate models to improve our understanding of global climate change and mitigate its effects. It is being used in medicine to speed drug discovery, improve the quality of our x-rays and MRIs, and to ensure that patients receive appropriate medical care. Data science is used in courtrooms to fight for fair elections and electoral maps, and by data journalists to document and communicate to readers the injustices prevalent in our criminal justice system and issues in policing. Autonomous drones are delivering blood and medical supplies to rural health clinics from Rwanda to North Carolina. Driver aid features and autonomous cars continue to make progress in reducing the over 30,000 traffic deaths and millions of injuries that occur in the US alone every year. And nearly no facet of business has gone untouched by the recent revolution in data analytics, from song and movie recommendation engines on Netflix, Spotify, and the Apple App Store to the use of personalized, targeted advertisements used to ensure businesses can make the most of their advertising revenue, and the supply chain and logistics systems that have completely changed how and where goods are produced around the world

At the same time, however, recent years have also made clear that today’s global challenges will not be met by simply “throwing data science at the problem” and hoping things will work out. Even in business, where many assume that Artificial Intelligence is a sure ticket to profits, a major recent study found only 11% of businesses that had piloted or employed Artificial Intelligence had reaped a sizeable return on their AI investments. In recent years we’ve also seen near endless examples of data science tools reinforcing racial and gender inequities in our social, like algorithms discriminating against female job candidates at Amazon, prioritizing White patients over Black patients for kidney transplants and preventative care, and being more likely to incorrectly identify Black defendants than White defendants as being a “danger to society” when providing risk assessments to judges deciding on pre-trial release, bail and sentencing. And even companies like Facebook’s own research have shown its algorithms drive political polarization and division among users, and push users into extremist groups.1

How, then, should a burgeoning data scientist approach this field full of such promise but also so many pitfalls? In this book, we will present a framework for approaching and solving problems with data science in a way that is both effective and responsible.

What Do You Mean “For Humans?”#

On the one hand, our title is ridiculous. Of course, our book is written for humans given that we are the only species capable of doing Data Science (on this planet?). Despite our curiosity as to what it may contain, Data Science for Cats would not make much sense.

But our title is intentional as it highlights that when doing Data Science it is important to distinguish how we use data from how computers use information. Computers are exceptional pattern-finders and maximizers. Given a dataset, computers are able to find complex relationships through brute-force computing that well exceeds human capabilities. The ability to process and consider so many relationships also allows computers to maximize (or minimize) complex optimization problems for difficult situations (e.g., when calculus cannot be used to solve for the optimal solution).

But computers are limited. They are completely dependent on the data that is given to them—i.e., what data is collected and missing, how important quantities are measured—and while they can find patterns and optimize using the data they are given, they lack the ability to evaluate the meaning and importance of what they may find.

Humans, on the other hand, are defined by their curiosity. Questions drive us and are often central to what we do. What is going on in the world around us, and why? While we can be faulted for reaching too quickly for answers to the what and why—perhaps given that our brain seems wired to do so as we quickly reach for generalizations and assertions about the world even when our ability to evaluate the claims are limited—human curiosity is a defining and important feature that Data Science complements.

In addition to being question-driven—by which we mean to have inquiries and analyses following from pre-defined questions rather than using pattern-recognition processes find relationships which are then used to generate questions—two other traits are critical for our approach. A second is fallibility. We make mistakes. Lots of them. And the consequences can be tremendously impactful not only for ourselves, but also for others.

In answering questions it is important to always think about the uncertainty that we may have about our answers. How confident should we be in our answer? What assumptions did we have to make in getting our answer and how might our results change? Developing our question-solving skills in Data Science with an forward-looking appreciation for quantifying our uncertainty as best we can is essential for placing our results in context. Our data (and methods) are rarely perfect and it is critical that we help quantity—as best we can—how robust and sensitive our answers may be.

A final human trait is that we are social creatures. It is not enough to answer a question for ourselves. We also want to communicate our findings to others. This implicates considerations both when conducting Data Science and also for reporting the results from Data Science.

When conducting Data Science we want to be mindful of the importance of replication and clarity and do our analyses in ways that minimize errors. Science, by definition, is based on replication by others and the idea that others should be able to exactly replicate our processes to get the same results. Art, on the other hand, is more personality driven—what matters is the product rather than the process. We are doing Data Science, not Data Art and a guiding principle is to make our analyses as easy to follow as possible and to minimize the potential for user-induced error by using defensive-based programming approaches and file management.

It is not enough to obtain (a replicable) result, however. We must also communicate that result to others—including those who may not be aware of the precise details and complexities of the analyses—in ways that are not deceptive or misleading. In Data Science we are sometimes if not often, trying to solve a question that was given by someone else (e.g., a CEO) and it is incumbent on us to communicate our results as cleanly as possible in ways that minimize the potential for misinterpretation. Data Science is increasingly being used to describe important conditions and the characterizations are often consumed by non-experts—think, for example, of analyses of the incidence of COVID-19 during the pandemic or the results of public opinion polls—and it is essential that we think about our audience when communicating those results. Visualization is often essential here.

Pulling these thoughts together then, a human-centered Data Science takes account of three aspects of the human condition: 1) being question-driven, 2) accounting for our uncertainty/fallability, and (3) ensuring our analyses and findings can be understood by others.

How to Use This Book#

This book is designed to be relatively modular. While it can be read from front to back—and doing so should provide a person with zero data science knowledge or experience with a robust set of skills for doing real data science work in the world—we recognize that people come to data science with a range of different backgrounds, and we wish to honor that diversity.

In the next two chapters of Part 1, we provide an overview of the field of data science field as it exists and an explanation of how we got to where we are today, then a more detailed overview of the problem-solving framework that underlies our suggested approach to data science.

In Part 2, we turn to data manipulation in R. Data science is a fundamentally applied field, and one where application nearly always entails programming. In light of that, Part 1 of this book provides a zero-assumed-background introduction to data science programming in R, complete with real-world examples and exercises that will quickly empower readers to use the skills they are learning to answer questions they care about.

In Part 3, we then turn to how we can use these skills to solve problems using our problems-first framework. This Part begins with a discussion of the importance of clearly identifying one’s problem (with many examples of where very smart people’s efforts have gone astray when they tried to skip this step), before discussing how to move from problems to answerable questions, and then the issues that arise in attempting to answer different types of questions.

Finally, in Part 4 we include several stand-alone chapters on topics we think are important to data scientists, such as data science ethics, project work-flow management, interpretable models, etc.

Throughout, our book strives to embody several important principles:

  • Our primary goal in this book is to provide you, the reader, with concrete, applicable skills you can immediately use to solve the problems you care about in the world.

  • The job of a data scientist isn’t to “just fit statistical models”; it’s to solve problems. And while doing so effectively certainly requires being able to understand and fit different machine learning and statistical models effectively, it also requires being able to think critically about their application and to interpret the results of our models not as “truth” but with appropriate epistemological humility.

In other words, the emphasis of this book is not on learning the nuances of a handful of algorithms, or learning abstract programming principles; rather, our focus in this text is on developing the skills necessary to think critically about the application of Data Science tools in the real world. To that end, in addition to empowering you to use a range of data science tools yourself to answer the questions you care about, we will spend as much time analyzing case studies of data science tools being used—both successfully and unsuccessfully—as learning new methods. It is our hope that, by the end of this book, we will have provided you with a unified perspective on the emerging field of Data Science, and empowered you to think critically about the promises and perils of this emerging field.

If that sounds good to you, read on!


1

Recent reporting by the Wall Street Journal has shown that Facebook’s own research has confirmed what many outside experts have long argued: the way its recommendation engines prioritize content that results in “user engagement” (clicks, shares, comments) ends up promoting partisan, polarizing, sensationalist, or extreme content. In addition, their own research has also shown that group recommendations are contributing to extremism. According to one internal presentation, “64% of all extremist group joins are due to our recommendation tools” like Groups You Should Join and other discovery tools.