Lab #03: Data Wrangling

due Tuesday, September 14 11:59 PM

Goals

NOTICE

When submitting your lab in gradescope, connect each question with exactly all pages the question appears on. WARNING: failure to do so will result in loss of points.

Getting started

Find lab03-username at https://github.com/sta199-f21. Copy the SSH URL. Close any open projects in RStudio and go to file > new project > version control and paste the SSH URL to clone your repository.

Open the lab03.Rmd template and update the YAML header with your name and today’s date. Then, knit the document and make sure the resulting PDF file has the correct date. Stage, commit, and push your changes.

Write your answers in the lab03.Rmd template file. Your assignment should have at least three meaningful commits and all code chunks should have informative names.

The Greatest Crossover Event in the History of STA199 Labs

We will also load the tidyverse package as usual.

library(tidyverse)

This data was originally collected for a FiveThirtyEight piece. The version of the avengers data we will work with here can be uploaded from the avengers.csv file provided with this lab. This dataset covers the entire Marvel Cinematic Universe(MCU), so some of the names will be familiar if you are a fan of the films, while other might be from the comics. Don’t worry if you aren’t a Marvel fan though, no background knowledge is needed to succeed on this lab!

More information on the variables is available at this link this link.

  1. You are interested in creating a dataset with the most “classic” Avengers. Using filter, create a new dataset with only Avengers that were created 1) in 1970 or earlier and 2) were not given probationary status. You should call the new dataset classic_avengers (hint: there should be 27 rows in this dataset.).

Now would be a good time to knit, commit, and push.

  1. Who are the newest Classic Avengers? Create a new variable called years_served that represents the number of years avengers served as of 2021. (Hint: you can use either the year variable or years_since_joining variables to do this.) Then arrange the dataset in ascending order of years_served, select the name and years_served variables and print the first three rows. Report in words the names of the three newest Classic Avengers and how long they have served. Please do this using a single pipeline.

  2. Has the percentage of female Avengers changed over time? Find the percentage of female Avengers in the classic_avengers dataset (hint: could create a new numeric variable e.g. 0 for male, 1 for female). Next find the percentage of female Avengers in the original, complete dataset. Has the addition of new avengers since 1970 changed the proportion of female avengers? Please describe in words what you observe.

Now might be a good time for another knit, commit, and push.

  1. Arrange the overall avengers dataset in descending order of appearances and return only the columns name, appearances, death1, and return1. Slice the first five Avengers here. What do you notice about these Avengers in terms of deaths and returns? Spoiler alert

  2. Using the avengers dataset, use two dplyr commands to examine the mean and median number of appearances for Avengers who have died at least once compared to those who have not died at least once. What do you learn about Marvel movies from your results? Also, does this tell you anything about the mean and median?

One more exercise, but first, knit, commit, and push!

  1. Are newer or older Avengers more popular? Create a scatterplot of the mean appearances by year of introduction for Avengers. For this question, please do not include observations for whether year is equal to 1900, as this appears to be a placeholder for missing values. Add geom_smooth() with the argument se = FALSE. Please include informative axis labels and an informative title. Can you observe any trends in the data? Which years tend to have the Avengers with the most appearances? Briefly explain whether you need to use any caution in interpreting any portions of this graph.

Once you are fully satisfied with your lab, Knit to PDF to create a PDF document.

Follow the instructions in previous labs to submit your PDF to Gradescope.

When submitting your lab to gradescope, connect each question with exactly all pages the question appears on. WARNING: failure to do so will result in loss of points.

Grading

Overall: 50 pts.