Main Idea: working with multiple data frames
- To answer questions with data, we often need to use related data from many different datasets.
- We can combine data from different sources using a well-chosen join function.
library(tidyverse)
Instead of working with a single dataset, usually you will have to work with many different related datasets. To answer research questions using related datasets, we need to join datasets together.
There are many possible types of joins. All have the format something_join(x, y)
.
inner_join()
: join all rows from x
where there are matching values in y
. Return all combinations in case of multiple matchesleft_join()
: include all rows from x
right_join()
: include all rows from y
full_join()
: include all rows in x
or y
semi_join()
: return all rows from x
with match in y
anti_join()
: return all rows from x
without a match in y
x <- tibble(value = c(1, 2, 3),
xcol = c("x1", "x2", "x3"))
y <- tibble(value = c(1, 2, 4),
ycol = c("y1", "y2", "y4"))
x
## # A tibble: 3 × 2
## value xcol
## <dbl> <chr>
## 1 1 x1
## 2 2 x2
## 3 3 x3
y
## # A tibble: 3 × 2
## value ycol
## <dbl> <chr>
## 1 1 y1
## 2 2 y2
## 3 4 y4
We will demonstrate each of the joins on these small, toy datasets.
x
## # A tibble: 3 × 2
## value xcol
## <dbl> <chr>
## 1 1 x1
## 2 2 x2
## 3 3 x3
y
## # A tibble: 3 × 2
## value ycol
## <dbl> <chr>
## 1 1 y1
## 2 2 y2
## 3 4 y4
inner_join(x, y)
## Joining, by = "value"
## # A tibble: 2 × 3
## value xcol ycol
## <dbl> <chr> <chr>
## 1 1 x1 y1
## 2 2 x2 y2
x
## # A tibble: 3 × 2
## value xcol
## <dbl> <chr>
## 1 1 x1
## 2 2 x2
## 3 3 x3
y
## # A tibble: 3 × 2
## value ycol
## <dbl> <chr>
## 1 1 y1
## 2 2 y2
## 3 4 y4
left_join(x, y)
## Joining, by = "value"
## # A tibble: 3 × 3
## value xcol ycol
## <dbl> <chr> <chr>
## 1 1 x1 y1
## 2 2 x2 y2
## 3 3 x3 <NA>
x
## # A tibble: 3 × 2
## value xcol
## <dbl> <chr>
## 1 1 x1
## 2 2 x2
## 3 3 x3
y
## # A tibble: 3 × 2
## value ycol
## <dbl> <chr>
## 1 1 y1
## 2 2 y2
## 3 4 y4
right_join(x, y)
## Joining, by = "value"
## # A tibble: 3 × 3
## value xcol ycol
## <dbl> <chr> <chr>
## 1 1 x1 y1
## 2 2 x2 y2
## 3 4 <NA> y4
x
## # A tibble: 3 × 2
## value xcol
## <dbl> <chr>
## 1 1 x1
## 2 2 x2
## 3 3 x3
y
## # A tibble: 3 × 2
## value ycol
## <dbl> <chr>
## 1 1 y1
## 2 2 y2
## 3 4 y4
full_join(x, y)
## Joining, by = "value"
## # A tibble: 4 × 3
## value xcol ycol
## <dbl> <chr> <chr>
## 1 1 x1 y1
## 2 2 x2 y2
## 3 3 x3 <NA>
## 4 4 <NA> y4
x
## # A tibble: 3 × 2
## value xcol
## <dbl> <chr>
## 1 1 x1
## 2 2 x2
## 3 3 x3
y
## # A tibble: 3 × 2
## value ycol
## <dbl> <chr>
## 1 1 y1
## 2 2 y2
## 3 4 y4
semi_join(x, y)
## Joining, by = "value"
## # A tibble: 2 × 2
## value xcol
## <dbl> <chr>
## 1 1 x1
## 2 2 x2
x
## # A tibble: 3 × 2
## value xcol
## <dbl> <chr>
## 1 1 x1
## 2 2 x2
## 3 3 x3
y
## # A tibble: 3 × 2
## value ycol
## <dbl> <chr>
## 1 1 y1
## 2 2 y2
## 3 4 y4
anti_join(x, y)
## Joining, by = "value"
## # A tibble: 1 × 2
## value xcol
## <dbl> <chr>
## 1 3 x3
How do the join functions above know to join x
and y
by value
? Examine the names to find out.
names(x)
## [1] "value" "xcol"
names(y)
## [1] "value" "ycol"
We will again work with data from the nycflights13
package. We are going to work with a sample of 100 cases from three separate datasets in this pacakges.
flights2 <- read_csv("data/flights2.csv")
airports2 <- read_csv("data/airports2.csv")
planes2 <- read_csv("data/planes2.csv")
# $X1 = NULL
Examine the documentation for the datasets airports
, flights
, and planes
.
Question: How are these datasets related? Suppose you wanted to make a map of the route of every flight. What variables would you need from which datasets?
Join flights to airports. Note these two datasets have no variables in common so we will have to specify the variable to join by using by =
. Check out the documentation for more information.
flights2 %>%
left_join(airports2, by = c("dest" = "faa"))
## # A tibble: 100 × 28
## X1.x year month day dep_time sched_dep_time dep_delay arr_time
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2013 2 9 NA 1220 NA NA
## 2 2 2013 12 30 1434 1419 15 1736
## 3 3 2013 5 21 1855 1845 10 2153
## 4 4 2013 2 8 NA 1835 NA NA
## 5 5 2013 2 10 1809 1548 141 2044
## 6 6 2013 7 13 2040 1936 64 2305
## 7 7 2013 3 21 1722 1640 42 1928
## 8 8 2013 10 10 725 725 0 952
## 9 9 2013 8 23 629 630 -1 744
## 10 10 2013 6 18 1323 1327 -4 1557
## # … with 90 more rows, and 20 more variables: sched_arr_time <dbl>,
## # arr_delay <dbl>, carrier <chr>, flight <dbl>, tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>, X1.y <dbl>, name <chr>, lat <dbl>, lon <dbl>, alt <dbl>,
## # tz <dbl>, dst <chr>, tzone <chr>
Create a new dataset dest_delays
with the median arrival delay for each destination. Note this question does not require you to use joins.
Create a new dataset by joining dest_delays
and airports
. Only include observations that have both delay and airport information. Note dest_delays
and flights
have no variables in common so you will need to specify the variables to join using by
as in the example above.
Question: Are all of the observations in dest_delays
included in the new dataset you created by joining dest_delays
and airports
? Use an appropriate join function to investigate this issue and determine what is going on here.
Use an anti_join
to help diagnose this issue. Recall anti_join
returns all rows from x
without a match in y
.
tailnum
variable in the flights
dataset. The year the plane was manufactured is given in the year
variable in the planes
dataset.Step #1: Start by finding the average arrival delay for each plane and store the resulting dataset in plane_delays
.
Step #2: Join plane_delays
to the planes
data using an appropriate join and then use mutate to create an age variable. Note this data is from 2013. So let’s look at each plane’s age as of 2013.
Step #3: Finally, create an effective visualization of the data.