library(tidyverse)
library(tidymodels)
library(scatterplot3d)
library(viridis)
By the end of today, you will be able to…
To begin, we’ll work with a dataset on iris flowers contained with base R
. This dataset comes from an old statisical analysis paper aptly titled The use of multiple measurements in taxonomic problems
data(iris)
glimpse(iris)
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
See ?iris
for more information.
[write here]
Conceptualize \(R^2\)
\[ R^2 = 1 - \frac{\text{sum of squared error}}{\text{sum of square distance from mean in data}} \] Let’s focus on the second term to build intuition.
The numerator “sum of squared error” is the amount of variability not explained by the model.
The denominator is proportional to the variance in the data, i.e. the amount of variability in the data.
Together, this second term represents the proportion of variability not explained by the model.
If the proportion not explained is 0, the model explains all variability and \(R^2 = 1 - 0 = 1\).
If the proportion not explained is \(1\), i.e. the model does not explain any variability, then \(R^2 = 1 - 1 = 0\).
[updated explanation]
Suppose that we want to go out and collect more data on irises, but measuring several parts of each iris flower is time consuming. To save on the time it takes to collect future data, let’s see if one of the observations (petal length) could be predicted from just measuring a flower’s sepal. (This might make sense since the sepal is the part of the flower below the bud that contains the petals before it blooms)
To do this, we will setup three linear models:
predict petal length from sepal length
predict petal length from sepal width
predict petal length from both sepal width and sepal length
[answer here]
# code here
[answer here]
# code here
Returning to previous joint model,
\[ \text{petal length} = \beta_0 + \beta_{\text{sepal width}} \cdot \text{sepal width} + \beta_{\text{sepal length}} \cdot \text{sepal length} \]
Does our data offer sufficient evidence that Sepal.Width
is actually associated with (and therefore might help us predict) Petal.Length
?
iris %>%
ggplot(aes(x = Sepal.Width, y = Petal.Length)) +
geom_point() +
theme_minimal()
Let’s conduct a hypothesis test in a regression framework to find out.
If Sepal.Width
does not help explain Petal.Length
, \(\beta_{\text{sepal width}} = 0\), this is our null hypothesis.
For OLS regression, our test statistic is
\[ T = \frac{\hat{\beta} - 0}{\text{SE}_{\hat{\beta}}} \sim t_{n - 2} \] We want to see if our observed statistic, \(\hat{T}\), falls far in the tail under the null.
R
takes care of much of this behind the scenes with the tidy output and reports a p-value for each \(\beta\) by default.
Fit the regression model and display the tidy output below.
# code here
Is \(\beta_{\text{sepal width}}\) significant?
iris %>%
ggplot(aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
geom_point() +
scale_color_viridis_d()
In the plot above, it appears the relationship between sepal length and petal length, i.e. the slope of Petal.Length ~ Sepal.Length
varies drastically from one species of iris to another.
Sepal.Length
, Species
and an interaction effect between Sepal.Length
and Species
. See here for example from the prep.# code here
[write here]