Assignment 1: Data Proposals
First, you will pick a topic area of interest to you and find a dataset that can provide insights into that topic.
Each team member comes up with one dataset and writes a short description (around half of the page) on the dataset as follows:
- Dataset name
- Short description of the dataset (5-15 sentences)
- Why is this dataset interesting? (3-5 sentences)
- How to access or collect the dataset? (1-3 sentences)
- What potential questions do you want to explore? (at least three questions)
The teams combine a list of datasets from every team member and submit the team report.
Recommended Data Sources
To get up and running quickly with this assignment, we recommend exploring one of the following provided datasets:
- World Bank Indicators, 1960–2017. The World Bank has tracked global human developed by indicators such as climate change, economy, education, environment, gender equality, health, and science and technology since 1960. The linked repository contains indicators that have been formatted to facilitate use with Tableau and other data visualization tools. However, you’re also welcome to browse and use the original data by indicator or by country. Click on an indicator category or country to download the CSV file.
- https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present-Dashboard/5cd6-ry5g? (click Export to download a CSV file). This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department’s CLEAR (Citizen Law Enforcement Analysis and Reporting) system.
- Daily Weather in the U.S., 2017. This dataset contains daily U.S. weather measurements in 2017, provided by the NOAA Daily Global Historical Climatology Network. This data has been transformed: some weather stations with only sparse measurements have been filtered out. See the accompanying [[http://vis.csail.mit.edu/classes/6.859/A2/weather.txt|weather.txt for descriptions of each column.
- Social mobility in the U.S.. Raj Chetty’s group at Harvard studies the factors that contribute to (or hinder) upward mobility in the United States (i.e., will our children earn more than we will). Their work has been extensively featured in The New York Times. This page lists data from all of their papers, broken down by geographic level or by topic. We recommend downloading data in the CSV/Excel format, and encourage you to consider joining multiple datasets from the same paper (under the same heading on the page) for a sufficiently rich exploratory process.
- The Yelp Open Dataset provides information about businesses, user reviews, and more from Yelp’s database. The data is split into separate files (business, checkin, photos, review, tip, and user), and is available in either JSON or SQL format. You might use this to investigate the distributions of scores on Yelp, look at how many reviews users typically leave, or look for regional trends about restaurants. Note that this is a large, structured dataset and you don’t need to look at all of the data to answer interesting questions. In order to download the data you will need to enter your email and agree to Yelp’s Dataset License.
Additional Data Sources
If you want to investigate datasets other than those recommended above, here are some possible sources to consider. You are also free to use data from a source different from those included here. If you have any questions on whether your dataset is appropriate, please ask the course staff ASAP!
The report should be named "A1_[Your team name].pdf". Each team submits the list of four datasets that each member found in this Dropbox folder. The assignment is due on January 19, 2022, 23:59.
Points for this assignment