How to Read (Academic Edition)#

In this class, you will be asked to do significantly more reading than many of you — especially those of you from engineering backgrounds — are used to. Moreover, many of our reading may feel different from a lot of the reading that you’ve done in the past, and so I want to take a moment to discuss how you should approach reading in this — and indeed in any — reading intensive course.

Knowing that many students in this class speak English is a second language (and so may find reading more time-consuming), and many of you also don’t have experience with more reading intensive courses, I have worked very hard to ensure that the readings I assign communicate key concepts as efficiently as possible. But as we will see, learning to answer causal questions well requires wrestling with the complexities that arise when apparently simple math meets the real world, and that unavoidably requires thoughtful exposition.

Read Actively#

The first major piece of advice I can offer is that you should always be active when reading for this course. That may mean taking notes on a separate piece of paper as you go, or highlighting passages that stand out and adding comments in the margins. But reading of the type you will encounter in this course should never be a passive process.

This is especially true any time you encounter mathematical notation. A key skill in answering causal questions is the ability to map concepts represented in mathematical notation onto facets of real world examples. With that in mind, any time you are reading something written in mathematical notation, it is good practice to think of a specific example and see if you can relate each term you encounter to the real world example.

Be Patient with Examples From Different Domains#

Data science is an extremely diverse field. This course does its best to embrace that diversity though the use of examples from a wide range of substantive domains. This, at times, causes frustration among students as many examples will necessarily come from domains that don’t feel relevant to your particular interests. Try and resist this urge — while this may seem like a problem, it’s actually emblematic of a huge opportunity for intellectual arbitrage (the porting of insights that have been richly developed in one domain to a different domain where they are unfamiliar)!

This will be particularly relevant when we get to the study of causal inference (the study of how to answer Causal Questions), as there are lots of concepts that have been well-developed in the social sciences that people are only now starting to apply in industry, meaning many of the best texts and examples will be public policy or social science oriented.

And while the downside of that is that there aren’t as many great books written about causal inference in industry as you may wish, the upside is that there are lots of opportunities to for young data scientists to innovate by applying these concepts in new ways!

So please bear with these examples, and practice trying to apply the concepts you read to an industry example that matters to you.

Do NOT Summarize with LLMs#

I fully recognize that there is a strong temptation when faced with a long reading to stuff it into a Large Language Model and ask for a summary. Don’t.

There are a few reasons for this. The first is that while an LLM can provide you with a broad summary of what you’re reading, it will necessarily have to exclude all the nuance in the original reading. If you just want to figure out if a reading is generally relevant to your interests, that’s fine; but you’re here to learn a subject — one that can can be infuriatingly subtle — and cutting out all those nuances will limit what you can learn and prevent you from being able to test your understanding of concepts by wrestling to understand how each new sentence relates to what you’ve read previously. So just as the goal of the readings isn’t to allow you to answer our reading reflection questions (those are just there to draw your attention to especially salient points), nor is the goal of the readings to understanding it at the level of a summary.

The second big reason is that the process of summarizing material yourself is critical to consolidating your learning. This insight comes, in part, from research on whether students taking notes on computers learn more effectively than students taking notes on paper. This research has found that students taking notes on computers can write much more quickly than students taking notes by hand, but that counter-intuitively this seems to result in worse learning outcomes (based on subsequent learning assessments). Why? It appears that students taking notes on computers are effectively able to transcribe everything happening in class, while students taking notes on paper have to think about the material in real time in order to summarize it enough that they can keep up taking notes.

Letting LLMs summarize material for you seems likely to cause a similar issue — by allowing an algorithm to organize the information in a more concise manner, it deprives you of the opportunity to engage with the material to create your own summary, a process that forces you to actively think about the connections between concepts. The importance of this type of active learning is one of the biggest bedrock findings of research on learning in recent years — students who have material given to them (e.g., through a passive lecture) often think they understand material, but it is the students who learn material actively — through class or group exercises, problem sets, or other activities — who perform better on learning assessments.

Finally, learning to focus on a reading for a prolonged period of time is an important skill, and one we practice less and less in the modern age. Sometimes letting our attention flitting from thing to thing is fine, but being able to focus for prolonged periods when required is an important skill to cultivate! (If you’re interested, you can find a really interesting discussion of this idea here.)