Assignment 2: Exploratory Data Analysis
In this assignment, you will identify a dataset of interest and perform an exploratory analysis to better understand the shape & structure of the data, investigate initial questions, and develop preliminary insights & hypotheses.
Your final submission will take the form of a report consisting of captioned visualizations that convey key insights gained during your analysis. The report should consist of four parts.
Part 1: Data Selection
The team chose one dataset from the previous assignment that your group found most interesting to investigate. You can copy and paste the dataset description from your previous assignment. Explain some reasons why your team choose the dataset (5-10 sentences).
Part 2: Data Preparation
Describe how you transform and clean the data for exploratory data analysis. The explanation should demonstrate that you did some effort to assess data quality and perform some data transformation (0.5 - 1 page).
- How do you transform the data from the original source to the suitable format for visualization?
- How do you clean the dataset? Is there any outlier data point? How do you detect that?
- How do you perform sanity checks on the data? Is there any missing data, typos, or incorrect mapping?
- Is there additional data that could be useful for your research questions?
Part 3: Research Questions
After selecting a topic and dataset — but prior to analysis — you should write down an initial set of at least three different questions you would like to investigate.
Part 4: Exploratory Data Analysis
Next, you will perform an exploratory analysis of your dataset using a visualization tool of your choice. You should consider two different phases of exploration.
In the first phase, you should seek to gain an overview of the shape & structure of your dataset. What variables does the dataset contain? How are they distributed? Are there any notable data quality issues? Are there any surprising relationships among the variables? Be sure to also perform “sanity checks” for patterns you expect to see!
In the second phase, you should investigate your initial questions, as well as any new questions that arise during your exploration. For each question, start by creating a visualization that might provide a useful answer. Then refine the visualization (e.g., by adding additional variables, changing sorting or axis scales, transforming your data by filtering or subsetting it, etc.) to develop better perspectives, explore unexpected observations, or sanity check your assumptions. You should repeat this process for each of your questions, but feel free to revise your questions or branch off to explore new questions.
How to export figures from Tableau: this link
Submission
WHAT - Your final submission should take the form of a PDF report that consists of 10 or more captioned visualizations detailing your most important insights. Your “insights” can include important surprises or issues (such as data quality problems affecting your analysis) as well as responses to your analysis questions. See an example of the report on analyzing data about motion pictures.
Each visualization image should be a screenshot exported from a visualization tool, accompanied by a title and descriptive caption (2-4 sentences long) describing the insight(s) learned from that view. Provide sufficient detail for each caption such that anyone could read through your report and understand what you’ve learned. You are free, but not required, to annotate your images to draw attention to specific features of the data. You may perform highlighting within the visualization tool itself, or draw annotations on the exported image.
The end of your report should include a brief summary of main lessons learned.
WHERE - The report should be named "A2_[Your team name].pdf". Submit the file in this Dropbox folder.
WHEN - Assignment 2 is due before February 4, 2022, 23:59.
Grading
We will use the following rubric to grade your assignment. Note, rubric cells may not map exactly to specific point scores.
Component
|
Excellent
|
Satisfactory
|
Poor
|
---|
Breadth of Exploration
|
More than 3 questions were initially asked, and target substantially different portions/aspects of the data.
|
At least 3 questions were initially asked of the data, but there is some overlap between questions.
|
Fewer than 3 initial questions were posed of the data.
|
Depth of Exploration
|
A sufficient number of follow-up questions were asked to yield insights that helped to more deeply explore the initial questions.
|
Some follow-up questions were asked, but they did not take the analysis much deeper than the initial questions.
|
No follow-up questions were asked after answering the initial questions.
|
Data Quality
|
Data quality was thoroughly assessed with extensive profiling of fields and records.
|
Simple checks were conducted on only a handful of fields or records.
|
Little or no evidence that data quality was assessed.
|
Visualizations
|
More than 10 visualizations were produced, and a variety of marks and encodings were explored. All design decisions were both expressive and effective.
|
At least 10 visualizations were produced. The visual encodings chosen were largely effective and expressive, but some errors remain.
|
Several ineffective or inexpressive design choices are made. Fewer than 10 visualizations have been produced.
|
Data Transformation
|
More advanced transformation were used to extend the dataset in interesting or useful ways.
|
Simple transforms (e.g., sorting, filtering) were primarily used.
|
The raw dataset was used directly, with little to no additional transformation.
|
Captions
|
Captions richly describe the visualizations and contextualize the insight within the analysis.
|
Captions do a good job describing the visualizations, but could better connect prior or subsequent steps of the analysis.
|
Captions are missing, overly brief, or shallow in their analysis of visualizations.
|
Creativity & Originality
|
You exceeded the parameters of the assignment, with original insights or a particularly engaging design.
|
You met all the parameters of the assignment.
|
You met most of the parameters of the assignment.
|
Points for this assignment
30 points
Acknowledgements: This assignment is closely similar to the assignment from the Interactive Data Visualization course taught by Arvind Satyanarayan at MIT.