library(tidyverse)
library(knitr)
sta199 <- read_csv("data/sta199-fa21-year-major.csv")
gss <- read_csv("data/gss2018.csv")

Bulletin

Learning goals

Definitions

Let A and B be events.

Part 1: STA 199 years & majors

For this portion of the AE, we will continue using the data including the year in school and majors for students taking STA 199 in Fall 2021, i.e., you! The data set includes the following variables:

Let’s start with the contingency table from the last class:

sta199 %>% 
  count(year, major_category) %>%
  pivot_wider(id_cols = c(year, major_category),#how we identify unique obs
              names_from = major_category, #how we will name the columns
              values_from = n, #values used for each cell
              values_fill = 0) %>% #how to fill cells with 0 observations 
  kable() # neatly display the results
year compsci only econ only other pubpol only stat + other major stats only undecided
First-year 8 6 39 22 26 7 5
Junior 7 3 12 4 1 0 0
Senior 2 0 5 1 1 0 0
Sophomore 23 6 42 11 8 3 5

Try to answer the questions below using the contingency table and using code to answer in a reproducible way.

Part A: What is the probability a randomly selected STA 199 student is studying a subject in the “other” major category?

# add code 

Part B: What is the probability a randomly selected STA 199 student is a first-year?

# add code 

Part C: What is the probability a randomly selected STA 199 student is a first year and is studying a subject in the “other” major category?

## add code 

Part D: What is the probability a randomly selected STA 199 student is a first year given they are studying a subject in the “other” major category?

## add code 

Part E: What is the probability a randomly selected STA 199 student is studying a subject in the “other” major category given they are a first-year?

# add code

Part F: Are being a first-year and studying a subject in the “other” category independent events? Briefly explain.

Part 2: Bayes’ Theorem

The global coronavirus pandemic illustrates the need for accurate testing of COVID-19, as its extreme infectivity poses a significant public health threat. Due to the time-sensitive nature of the situation, the FDA enacted emergency authorization of a number of serological tests for COVID-19 in 2020. Full details of these tests may be found on its website here.

We will define the following events:

The Abbott Alinity test has an estimated sensitivity of 100%, P(Pos | Covid) = 1, and specificity of 99%, P(Neg | No Covid) = 0.99.

Suppose the prevalence of COVID-19 in the general population is about 2%, P(Covid) = 0.02.

Part A: Use the Hypothetical 10,000 to calculate the probability a person has COVID given they get a positive test result, i.e. P(Covid | Pos).

Test Covid No Covid Total
Pos 23
Neg 3
Total 10000

Part B: Use Bayes’ Theorem to calculate P(Covid|Pos).

Part 3: Getting started on Lab 05

The General Social Survey (GSS) has been used to measure trends in attitudes and behaviors in American society since 1972. The survey includes demographic information, questions used to gauge attitudes about government spending priorities, confidence in institutions, lifestyle, and many other topics. A full description of the survey may be found here. You will be analyzing data from the 2018 GSS in the lab.

The goal of the lab is to create visualizations and calculate associated probabilities to analyze respondents’ views about industrial air pollution and government spending on alternative energy sources. The data is in gss2018.csv and the variables are:

  1. How many observations are in this dataset? What does each observation represent?

  2. By default, R will arrange the categories of a categorical variable in alphabetical order in any output and visualizations, but we want the levels for indus and altenergy to be in logical order. To achieve this, we will use the factor() function to make both of these variables factors (categorical variables with ordering) and specify the levels we wish to use.

    The code to for indus is below. Use this code to make indus a factor, and write code to make altenergy a factor with the levels in the following order: “Don’t know”, “Too little”, “About right”, “Too much.” Save your result to the gss data frame, so the ordered variables are used throughout the lab.

gss <- gss %>%
  mutate(indus = factor(indus, levels = c("Not dangerous", "Somewhat dangerous", 
                                          "Very dangerous", 
                                          "Extremely dangerous")))
  1. Before looking at the relationship between feelings on impact of industrial air pollution to environment and government spending on alternative energy sources, we’ll look at the distribution of each variable individually.

    Make a bar plot to examine the distribution of indus. Then calculate the marginal probabilities for indus. In general, how do survey respondents feel about the impact of industrial air pollution? Write 1 - 2 observations from the visualization and probabilities to support your response.