Lab 08: Inference using the Central Limit Theorem

due Friday, November 05 at 11:59pm

Learning Goals

In this lab you will…

Getting started

Packages

We will use the tidyverse and tidymodels packages in this assignment

What makes a good burrito?

The goal of today’s lab is to use CLT-based inference to evaluate the synergy of burritos.

The Data

Today’s dataset has been adapted from Scott Cole’s Burritos of San Diego project, located here. The goal of the project was to identify the best and worst burritos in San Diego, characterize variance in burrito quality, and generate predictive models for what makes a burrito great.

As part of this project, 71 participants reviewed burritos from 79 different taco shops. Reviewers captured objective measures of the burrito (such as whether it contains certain ingredients) and reviewed it on a number of metrics (such as quality of the tortilla, the temperature, quality of meat, etc.). For the purposes of this lab, you may consider each of these observations to be an independent and representative sample of all burritos.

The subjective ratings in the dataset are as follows. Each variable is ranked on a 0 to 5 point scale, with 0 being the worst and 5 being the best.

In addition, the reviewers noted the presence of the following burrito components. Each of the following variables is a binary variable taking on values present or none:

The data are available in burritos.csv

Exercises

Instructions

  • Make sure we see all relevant code and output in the knitted PDF. If you use inline code, make sure we can still see the code used to derive that answer.
  • Write a narrative for each exercise.
  • All narrative should be written in full sentences, and visualizations should have clear title and axis labels.

The goal of this analysis is to use inference based on the Central Limit Theorem to analyze the mean synergy rating of burritos.


  1. We’ll start by examining the distribution of synergy, a rating indicating how well all the ingredients in the burrito come together.

    • Visualize the distribution of synergy using a histogram with binwidth of 0.5.

    • Calculate the following summary statistics: the mean synergy, standard deviation of synergy, and sample size size. Save the summary statistics as summary_stats. Then display summary_stat.

  2. The goal of this analysis is to use CLT-based inference to understand the true mean synergy rating of all burritos. The idea is that if CLT holds, we can assume the distribution of the sample mean is normal and thus easily generate a normal null distribution to test hypotheses.

Based on the data, what is your “best guess” for the mean synergy rating of burritos?

  1. Is the synergy in burritos generally good? To answer this question, we will conduct a hypothesis test to evaluate if the mean synergy is greater than 3.

Before conducting inference, we need to check the conditions to make sure the Central Limit Theorem can be applied in this analysis. For each condition, indicate whether it is satisfied and provide a brief explanation supporting your response.

 - Independence? 
 - Sample size / distribution? 
  1. State the null and alternative hypotheses to evaluate the question posed in the previous exercise. Write the hypotheses in words and in statistical notation.

  2. Let \(\bar{x}\) be the mean synergy score in a sample of 330 randomly selected burritos. Given the Central Limit Theorem and the hypotheses from the previous exercise

  1. Next, use R as a “calculator” to calculate the test statistic, \(T\). Recall the formula for the test statistic:

\[T = \frac{\bar{x}- \mu_{0}}{s/\sqrt{n}}\] where \(\bar{x}\) is the sample mean, \(\mu_0\) is the mean under the null, \(s\) is the sample s.d. and \(n\) is the sample size.

  1. Now let’s calculate the p-value and draw a conclusion.
  1. Now let’s calculate a 90% confidence interval for the mean synergy rating of all burritos. The confidence interval for a population mean is

\[\bar{x} \pm t^*_{n-1} \times \frac{s}{\sqrt{n}}\]

We already know \(\bar{x}\) and \(\frac{s}{\sqrt{n}}\), so let’s focus on the calculating \(t^*_{n-1}\). We will use the qt() function to calculate the critical value \(t^*_{n-1}\).

Here is an example: If we want to calculate a 95% confidence interval for the mean, we will use qt(0.975, n-1), where 0.975 is the cumulative probability at the upper bound of the 95% confidence interval (recall we used this value to find the upper bound when calculating bootstrap confidence intervals), and (n-1) are the degrees of freedom.

  1. In the previous exercises, we conducted a hypothesis test and calculated a confidence interval step-by-step. We can also use infer for the calculations in CLT-based inference using the t_test() function.

The results should be the same as the calculations you did in exercises in the previous exercises.

burritos %>%
  t_test(response = _____, 
         alternative = "______", 
         mu = ______, 
         conf_int = FALSE)

The results should be the same as the calculations from Exercise 8.

burritos %>%
  t_test(response = _____, 
         conf_int = TRUE, 
         conf_level = _____) %>%
  select(lower_ci, upper_ci)
  1. Now let’s compare inference simulation-based inference versus inference using the Central Limit Theorem.

Submission

Knit to PDF to create a PDF document. Knit and commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo.

Please only upload your PDF document to Gradescope. Associate the “Overall” graded section with the first page of your PDF, and mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages.

Grading (50 points)

Component Points
Ex 1 5
Ex 2 2
Ex 3 4
Ex 4 4
Ex 5 4
Ex 6 5
Ex 7 6
Ex 8 8
Ex 9 4
Ex 10 4
Workflow & formatting 4