AE 11: Probability

Bulletin

Exam 01 due tonight at 11:59pm
Quiz 04 out and due Thursday
Homework 1 grades released today
Be sure you’re pushing your application exercise commits (easy points!)

Learning goals

Introduce probabilities and how we can use them to understand categorical data
Create a contingency table using pivot_wider() and kable()
Use a contingency table to explore the relationship between two categorical variables.

Introduction

library(tidyverse)
library(knitr)

sta199 <- read_csv("data/sta199-fa21-year-major.csv")

For this Application Exercise, we will look at the year in school and majors for students taking STA 199 in Fall 2021, i.e., you! The data set includes the following variables:

section: STA 199 section
year: Year in school
major_category: Major / academic interest.
- For the purposes of this AE, we’ll call this the student’s “major”.

Definitions

An element is a member of a set.
A set is a collection of elements. A set may be empty.
A function is a consistent rule that maps each element of one (input) set to exactly one element of a second (output) set.
The probability of an event tells us how likely an event is to occur, and it can take values from 0 to 1, inclusive. It can be viewed as
- the proportion of times the event would occur if it could be observed an infinite number of times (frequentist paradigm)
- our degree of belief an event will happen (Bayesian paradigm)
An event or outcome the basic unit to which probability is applied, e.g. the result of an observation or experiment.
- Example: A is the event a student in STA 199 is a sophomore.
A sample space is the set of all possible outcomes. Each outcome in the sample space is disjoint or mutually exclusive meaning they can’t occur simultaneously.
- Example: The sample space for year is {First-year, Sophomore, Junior, Senior}

Exercise 1

Let’s take a look at the majors. Note that we have categorized majors so that each student can only be in one major category.

What is the sample space for major? You can use code to identify the sample space.

# add code

Let’s make a table that includes the majors, the number of students in each, and the associated probabilities.

## add code

What is the probability a randomly selected STA 199 student is a “pubpol only” major?

## add code

What is the probability a randomly selected STA 199 student is studying statistics?

# add code

What is the probability a randomly selected STA 199 student is not a “pubpol only” major?

# add code

Exercise 2

Now let’s make at table looking at the relationship between year and major.

sta199 %>%
  count(year, major_category)

ABCDEFGHIJ0123456789

year <chr>	major_category <chr>	n <int>
First-year	compsci only	8
First-year	econ only	6
First-year	other	39
First-year	pubpol only	22
First-year	stat + other major	26
First-year	stats only	7
First-year	undecided	5
Junior	compsci only	7
Junior	econ only	3
Junior	other	12

We’ll reformat the data into a contingency table, a table frequently used to study the association between two categorical variables. In this contingency table, each row will represent a year, each column will represent a major, and each cell is the number of students have a particular combination of year and major.

To make the contingency table, we will use a new function in dplry called pivot_wider(). It will take the data frame produced by count() that is current in a “long” format and reshape it to be in a “wide” format.

We will also use the kable() function in the knitr package to neatly format our new table.

sta199 %>% 
  count(year, major_category) %>%
  pivot_wider(id_cols = c(year, major_category),#how we identify unique obs
              names_from = major_category, #how we will name the columns
              values_from = n, #values used for each cell
              values_fill = 0) %>% #how to fill cells with 0 observations 
  kable() # neatly display the results

year	compsci only	econ only	other	pubpol only	stat + other major	stats only	undecided
First-year	8	6	39	22	26	7	5
Junior	7	3	12	4	1	0	0
Senior	2	0	5	1	1	0	0
Sophomore	23	6	42	11	8	3	5

How many students in STA 199 are first-years and in the “econ only” majors category.
How many students in STA 199 are in the “other” major category?

Exercise 3

For each of the following exercises:

Calculate the probability using the contingency table above.
Then write code to check your answer using the sta199 data frame and dplyr functions.

What is the probability a randomly selected STA 199 student is a sophomore?

## add code

What is the probability that a randomly selected STA 199 student is a “compsci only” major?

## add code

What is the probability that a randomly selected STA 199 student is a sophomore or a “compsci only” major?

## add code

What is the probability that a randomly selected STA 199 student is a sophomore and a “compsci only” major

## add code

Resources

Notes on pivot_wider and pivot_longer
- Click here for slides
- Click here for video