pivot_wider()
and kable()
library(tidyverse)
library(knitr)
sta199 <- read_csv("data/sta199-fa21-year-major.csv")
For this Application Exercise, we will look at the year in school and majors for students taking STA 199 in Fall 2021, i.e., you! The data set includes the following variables:
section
: STA 199 sectionyear
: Year in schoolmajor_category
: Major / academic interest.
An element is a member of a set.
A set is a collection of elements. A set may be empty.
A function is a consistent rule that maps each element of one (input) set to exactly one element of a second (output) set.
The probability of an event tells us how likely an event is to occur, and it can take values from 0 to 1, inclusive. It can be viewed as
An event or outcome the basic unit to which probability is applied, e.g. the result of an observation or experiment.
A sample space is the set of all possible outcomes. Each outcome in the sample space is disjoint or mutually exclusive meaning they can’t occur simultaneously.
Let’s take a look at the majors. Note that we have categorized majors so that each student can only be in one major category.
# add code
## add code
## add code
# add code
# add code
Now let’s make at table looking at the relationship between year and major.
sta199 %>%
count(year, major_category)
We’ll reformat the data into a contingency table, a table frequently used to study the association between two categorical variables. In this contingency table, each row will represent a year, each column will represent a major, and each cell is the number of students have a particular combination of year and major.
To make the contingency table, we will use a new function in dplry
called pivot_wider()
. It will take the data frame produced by count()
that is current in a “long” format and reshape it to be in a “wide” format.
We will also use the kable()
function in the knitr
package to neatly format our new table.
sta199 %>%
count(year, major_category) %>%
pivot_wider(id_cols = c(year, major_category),#how we identify unique obs
names_from = major_category, #how we will name the columns
values_from = n, #values used for each cell
values_fill = 0) %>% #how to fill cells with 0 observations
kable() # neatly display the results
year | compsci only | econ only | other | pubpol only | stat + other major | stats only | undecided |
---|---|---|---|---|---|---|---|
First-year | 8 | 6 | 39 | 22 | 26 | 7 | 5 |
Junior | 7 | 3 | 12 | 4 | 1 | 0 | 0 |
Senior | 2 | 0 | 5 | 1 | 1 | 0 | 0 |
Sophomore | 23 | 6 | 42 | 11 | 8 | 3 | 5 |
For each of the following exercises:
Calculate the probability using the contingency table above.
Then write code to check your answer using the sta199
data frame and dplyr
functions.
## add code
## add code
## add code
## add code
pivot_wider
and pivot_longer