Data Science Memos#

As you move between organizations in your career, you will be shocked by the number of different ways you will be asked to write up and communicate your analyses. Indeed, odds are you’ve already discovered how much expectations can differ between professors at the same university.

Regardless of where you go, however, there will always be a need for you to document and circulate your analyses in writing. Presentations are a common tool for sharing results as well, but even in organizations where you are asked to present your findings, you will likely also be asked to provide a written record of your work that can be circulated more widely or kept as a more archival record of a project for the future.

The variation in expectations for how reports should be formatted and what they should include is more than a small annoyance for students. Forcing students to adjust their writing style from class to class prevents them from focusing their attention on what matters most: mastering the art of clear communication.

With that in mind, the MIDS program has decided to adopt a single format for written reports across classes. In particular, all MIDS classes will ask that your written reports take the form of 4-6 page Data Science Memos, or DSMs.

To be clear, the adoption of this format is not a reflection of the fact that we feel this is necessarily the best way to write up your analyses in all situations. Nor are we adopting this format because it is the format you are most likely to encounter after graduation. Depending on your substantive career interests, there is a good chance you will end up in an industry that asks you to write your reports in a different manner.

Rather, we have chosen DSMs because they are (a) a very common written format in data science, (b) we think it is a style of writing that emphasizes the correct priorities in written communication, and (c) we feel that coordinating around a specific format will allow you to focus on mastering a style of written communication, and we feel that mastery of a specific style of writing will serve you better than being familiar but not great at a number of different report types.

The Data Science Memo Format#

A DSM is a 4-6 page document designed to communicate a specific takeaway to a stakeholder (colleagues on your team, your boss, your client, etc.).

The Executive Summary / TL;DR Section#

The memo should start with a one or two-paragraph Executive Summary (some companies refer to this section as the TL;DR rather than an executive summary). The Executive Summary should contain everything a reader needs to know if they were to read only the summary. It should fully summarize the problem that motivates the project, what the project intends to do, how what it intends to do is meant to help address the problem, and the key takeaway of the analysis.

In other words, the executive summary is not just an introduction meant to draw the reader in. Some students new to this format have the habit of hiding their results or takeaways until the end of the memo as a way to maintain suspense and keep the reader engaged. That is a great tendency in fiction, but in professional settings where your goal is to communicate a specific message to your stakeholder, hiding that message till the end of the memo just increases the odds your stakeholder never gets the message you’re trying to communicate because they got distracted and never finished the memo.

Decisions To Be Made#

After the summary, most stakeholders appreciate being told the critical decisions about next steps that need to be made based on the memo content. This can also be paired with a concise summary of the information you believe is critical for them to know when making their decisions. This helps stakeholders understand what they should focus on as they read, and (as discussed below) helps ensure you get across your most important points to your stakeholder even if they don’t finish your memo.

Memo Body#

After the executive summary comes the body of the report. DSMs tend to be media-rich documents with lots of carefully chosen figures. Indeed, some people advocate thinking of a data science memo as a series of figures with text only serving the role of “connective tissue” between figures, although whether that makes sense in any specific context depends a lot on the takeaway one is trying to communicate and whether it lends itself to visual or textual communication.

Among data scientists, DSMs are often written as Jupyter Notebooks or R Markdown documents. These memos are then circulated as a pair of files — a much longer Jupyter Notebook/R Markdown (complete with code for reproducibility) and a PDF document generated by exporting the notebook while suppressing code elements.[1]

That should give you a sense of how a DSM is organized, but as for what goes into a DSM in substantive terms, let’s pause for a moment to discuss the principles of stakeholder communication.

Learning to Write To Stakeholders#

As students (especially data science students), the way most of us were taught to write reports is to:

  • start with an introduction that helps explain the broader context in which the report is being positioned,

  • describes the data that we intend to use,

  • describe how we have wrangled and cleaned that data,

  • describe how we plan to model that data,

  • report the results of that modelling, and

  • discuss limitations and tack on a boilerplate conclusion.

This structure makes a lot of sense in the context of a class because the order of presentation mirrors the objectives of the assignment: demonstrate that you understand the substantive topics that are being taught, demonstrate that you are being thoughtful about the data that you are collecting and that you have internalized the emphasis your instructors have placed on the importance of data cleaning, and that you understand the principles of data modeling. Results come last because in the context of a class, your results don’t actually matter. No professor is going to make a significant business decision based on a student report, and no government is going to set policy based on what you say.

This structure — which almost entirely front-loads material that is not particularly interesting to the instructor — is also viable because it is basically the job of the reader (your instructor or teaching assistant) to read everything you wrote.

But when it comes to writing to a stakeholder — your boss, your colleagues, etc. — this logic does not apply.

A Stakeholder’s Mindset#

As you learn to write to stakeholders, it is important to begin by putting yourself in their shoes. What is it that they want to know? And how can you best communicate those facts to them?

Here are a few facts that are true of most stakeholders:

  • They have more to do than time to do it in: nothing is more scarce in an organization than the time of decision makers. The moment it stops being obvious to your stakeholder why the material they are reading is directly relevant to them, they will stop reading and turn to one of the other hundred critical issues vying for their attention.

  • Their life is full of problems. They want to know things that will help them solve those problems: Whatever your DSM is trying to communicate to your stakeholder, your stakeholder is only likely to pay attention if you can clearly articulate for them how it will help address a problem that they have. It’s not their job to figure out why different things may be useful (even if you think it should be) — it’s your job to tell them why the thing you are telling them is useful.

  • They aren’t interested in checking your work: unlike your professors, stakeholders will not generally be interested in checking your work. They only care about knowing things that will help solve their problems. That’s not to say they aren’t interested in understanding some of the logic underlying your reasoning — that’s something they should be able to follow regardless of their technical background. But they will almost never want to hear about the nitty-gritty details of data cleaning, feature encoding, model validation, etc. (unless your stakeholder is a fellow data scientist and the takeaway you wish to communicate is about the best way to do these things in a relevant context). First, because they may not have the appropriate expertise to evaluate those nitty-gritty details, and second because those details aren’t (usually) crucial for the stakeholder to understand for them to understand your takeaway.

A good practice for writing to stakeholders is to imagine that distractions are bombarding your stakeholder while they are trying to read your memo. These distractions are constantly trying to get them to put down your memo instead of finishing. Your job as a writer is to (a) minimize the likelihood that they put your memo down, and (b) maximize the likelihood that if they do put it down, they’ve already learned the things you most want them to know.

Minimizing the likelihood that your stakeholder will put down your memo before they finish amounts to ensuring that they are constantly aware of why they should care about what’s in front of that. More than anything, that means you should be constantly motivating everything you are doing by relating it back to a problem that your stakeholder already cares about.

At every transition — between sections, between topics, and even between paragraphs — it should be explicitly clear to the reader that what follows is relevant to their problem so they have an affirmative reason to maintain their focus.

Maximizing the likelihood that your stakeholder will learn the things you most want to communicate — even if they don’t finish your memo — requires putting the big idea of your memo up front. Then you can double back and fill in the details you’d like the stakeholder to know if they get far enough. This style of writing is often referred to as an inverted pyramid structure (big takeaways at the top, smaller details at the bottom).

Journalism is the quintessential example of this style of writing. No one reads entire news articles, so they are always nearly organized with the most critical information up front, after which they double back to fill in additional details for anyone still reading. There are a number of different ways this can be accomplished — start with “who? what? when? where? how?”, then add important details, then add context — is one of the first article formats journalists are introduced to.

But the structure I think you’re likely to see most if you start looking for it is that in the first two or three paragraphs of a well-written news story, you will notice that there is a single paragraph that is designed to summarize everything that the journalist wants you to know about the story — the “nut graph”. Then the following paragraphs fill in the details of the story in descending order of importance.

Unlike a news article, though, you should finish your memo by bringing everything full circle with a recap of the big picture. That way you remind any stakeholder that has reached the end of memo of the motivation for the project as a whole — the problem you’re trying to solve, how you have tried to solve it, and the decisions that need to be made. This structure — where you do an inverted pyramid then return to the big picture motivation of the project at the end — is sometimes referred to as an “hourglass” structure (although since hourglasses are often symmetric, I prefer to think of it as putting my inverted pyramid on a nice little stand. :) ).

But I can’t tell you/my stakeholder everything I did in 4-6 pages#

That’s right! Having only 4-6 pages to communicate your takeaway is hard (especially when those 4-6 pages include multiple figures). But that’s the point — one of the hardest things to learn to do as a communicator is to decide what is important and what is not, and having a page limit forces you to develop that skill. There’s a reason Blaise Pascal famously said, “[i]f I had more time, I would have written a shorter letter.”[^pascal]

[1] The exact quote, which appeared in “Lettres Provinciales” in 1657, was “[j]e n’ai fait celle-ci plus longue que parce que je n’ai pas eu le loisir de la faire plus courte.” Or, translated literally, “I have made this longer than usual because I have not had time to make it shorter.”

Common Mistakes in Student Writing#

Here are a few common ways students make mistakes in writing.

How You Spent Your Time != What You Should Write About#

Many students like to write reports that feel like a “letter home from camp” — that is, a list of all the things they did (sometimes in the order in which they did them). This, I think, is another artifact of learning to write to your professors — when writing to an instructor, the more you can show you did a lot of work, the more likely you think your effort is to be rewarded.

But the goal of a good DSM is not to show you worked really hard — it’s to convey a specific takeaway. And things that don’t help accomplish that goal don’t belong in your memo. In the words of Arthur Quiller-Couch:

If you here require a practical rule of me, I will present you with this: ‘Whenever you feel an impulse to perpetrate a piece of exceptionally fine writing, obey it — wholeheartedly — and delete it before sending your manuscript to press. Murder your darlings.’

Similarly, just because you did Step 1 before Step 2 before Step 3 doesn’t mean “explain 1, then 2, then 3” is the best way to communicate what you want to communicate. Indeed, oftentimes as we work, we zero in on what is most important, which means the thing we do last is actually what you may want to put first in your writing.

A closely related habit is to allocate space in your DSM in proportion to how you spent your time. As data scientists, we spend lots of our lives cleaning data and trying things that don’t pan out. There is a very natural tendency to talk about these things in your DSM. Suppress that urge. Unless they explicitly say otherwise, your stakeholders don’t want to check your work (and in many cases, may not have the expertise required to do so — that’s why they hired you after all!). Discussion of details

Limitations#

There is a tendency for students to use a “Limitations” sections to, well
 just try and cover their butts by throwing out anything they can think of about the paper that is imperfect. That’s ok in the classroom, but it’s not useful in the real world.

The point of a Limitations section isn’t to demonstrate your ability to identify any imperfections in the study; the point of a limitation section is to give your stakeholder a sense of how much confidence they should have in the results presented in the report from your professional perspective. Just because you had to make an assumption does not mean that the assumption constitutes a “limitation” of the study unless you have reason to think that the assumption is unlikely to be true (or is sufficiently untrue as to impact the results). Please only include things in your limitation section that you think really are substantive limitations!

MIDS-Specific Expectations#

Within MIDS, there are a few modifications to this format to deal with the fact that we are not your real stakeholders and you are writing these for classes.

Specify Your Stakeholder#

At the top of your memo, you should add a two or three-sentence summary of your target audience. This should include a specific job title for your target reader, and the technical background you think they have. Your instructor may tell you that this stakeholder has to have certain characteristics for certain assignments.

Appendices#

Because we are professors, there will be times that we do need you to “show your work” in your DSMs in a way that is not authentic to what you’d do in a professional setting.

To accommodate that, faculty may ask that you include appendices in your DSMs that detail things you wouldn’t normally include when writing to a stakeholder. These will be requested by faculty on an assignment-by-assignment basis, and should not impact how you write your main memo.

DSM Rubric#

To help harmonize expectations around DSMs, all MIDS courses will employ a common rubric for evaluating memos. Some classes may add to this rubric to meet the learning objectives of a specific class, but — to the extent possible by fallible human beings — we will endeavor to apply these rubric items consistently across classes.

To aid with grading scalability, all rubric items in this common rubric may take on one of only three discrete values: 1, 0.5, or 0.

Does the memo clearly state the problem that motivates the analysis?

  • 1: The memo clearly and explicitly states the problem (or all problems, if more than one) motivating the work presented in the memo. The problem is stated in concrete terms, and is a problem that the declared stakeholder would view as a problem over which they have potential agency.

  • 0.5: Any of:

    • The work in the memo appears to be motivated by a specific problem, but the problem (problems) is (are) not clearly or explicitly stated.

    • The problem is explicitly stated, but at a level of generality that it is unclear the stakeholder would feel it is a problem they are in a position to solve (“Large levels of \(CO_2\) emissions are causing global warming.” is an explicitly stated problem, but not one that any one stakeholder is likely to feel they have agency to solve, especially based on the work in one memo.)

  • 0: The problem motivating the memo is unclear — not only is it not explicitly stated, but a clear motivating problem is also not obvious to the reader.

Does the memo clearly state what it intends to do and how that plan will help address the motivating problem?

  • 1: The memo clearly and explicitly lays out how the plan of action/analysis for the memo will help address the stakeholder’s problem. The logic is clear, and does not omit any large steps (i.e., it does not fall victim to the Underpants Gnome problem). Moreover, the steps seem reasonable given the stakeholder to whom the memo is addressed — the memo does not, for example, suggest that using the included analysis, the middle-manager will change US antitrust policy.

  • 0.5:

    • One can imagine how the memo’s contents could be used to address the problem, but the logic of how that might be done is not laid out clearly or explicitly

    • The memo lays out a way that it’s contents could be used to help address the problem, but the logic of the reasoning strains credulity

  • 0: How the memo content is meant to help solve the problem is unclear.

Does the memo have a clear takeaway for the reader?

Does the memo start with a short Executive Summary/TL;DR section that states the motivating problem, what the analysis intends to accomplish, how that plan will help address the motivating problem, and the main takeaway of the analysis?

Are the figures in the memo clear and well labelled?