Team Data Science Project: Problem and Questions#
For this assignment, you and your group must decide on a problem that interests your team. This problem will be the focus of the various assignments your team will complete this semester, including:
Statement of your problem and three Exploratory Questions your think, if answered, would help you better understand your problem.
A report — written to an imaginary but specified stakeholder — in which you answer your Exploratory Questions. Of the three Exploratory Questions you seek to answer, at least one must be answered through your own analysis (i.e., you must load data into Python/R and generate your answers). You may answer up to two by citing reputable sources (if someone else has already found the answer, why re-invent the wheel?!).
A report in which you answer a Causal Question you have concluded would help address your problem (which you may have revised or re-articulated as a result of the answers to your Exploratory Questions).
If you are in IDS 705, be aware that you will also be required to do a Machine Learning project with your same team — you may coordinate your efforts in the two classes so long as all deliverables are unique for each class (Duke frowns on “double counting” work in multiple courses).
Assignment 1: Choosing A Problem#
Part one of this assignment is to pick a problem and three Exploratory Questions you seek to answer to improve your understanding of the problem. This is not a long assignment, just a check-in along the way.
The Problem you choose to address can come from any domain and can be an issue of global importance or personal interest (provided you can get the rest of your team and Partner Team to also agree to work on it). Examples of the types of problems and Exploratory Questions that you might wish to answer to help improve your understanding of the problem include:
Problem: Too many people are killed in car accidents.
Exploratory Questions:
What share of car-related fatalities is due to car-pedestrian, single-car, or multiple-car accidents?
What share of car-related fatalities occurs on freeways as opposed to in cities?
What share of car-related fatalities involves a driver under the influence of drugs or alcohol?
Problem: Many states are adding bureaucratic hurdles to getting social services, but the effect of these hurdles is unclear, both in terms of their effect on reducing fraud and on deterring entitled recipients from getting aid.
Exploratory Questions:
What states have changed their rules around social service provision (helpful if we want to do a pre-post analysis or a difference-in-difference analysis)?
Are there a lot of people who are entitled to social services who don’t receive them? (Do we know?)
What social service programs that have had bureaucratic hurdles imposed serve the largest populations?
Problem: Police shootings involving people with mental health issues are much too common, and it’s not clear the police are appropriately trained to deal with people dealing with mental health crises.
What states have changed their rules around social service provision (helpful if we want to do a pre-post analysis or a difference-in-difference analysis)?
Are there a lot of people who are entitled to social services who don’t receive them? (Do we know?)
What social service programs that have had bureaucratic hurdles imposed serve the largest populations?
While the problems in these examples are all “big” problems—in the sense of being societally important questions—your problem need not be of this nature. Past teams have done projects trying to figure out how to optimally train in tennis (by looking at whether playing more tennis improves or hinders subsequent tournament performance), how to improve AirBnB host profits (by looking at whether “super host” status improves AirBnB host revenues above and beyond the effect of just having the features that make one eligible to be a super-host), and how to minimize cell-phone user churn.
Moreover, while you will need to find work related to your problem, I would encourage you to not focus too much on data at this initial stage. Let’s be honest: in classes you usually pick a dataset then pick a question, or at best pick them at the same time. But that’s a contrivance allowed by classes — you’d never pick a problem to address in the real world on the basis of data availability. So please pick a problem that interests you and we can work on data issues together.
If you end up unable to find relevant data, I will have some “off-the-shelf” projects for which I know data is available you may turn to for the Causal Question report in the second half of the semester.
Due Date#
Your problem statement and three questions must be submitted by January 28th (yes, that’s SOON! These are questions and a problem statement not answers).