library(tidyverse)
library(tidymodels)
library(scatterplot3d)
library(viridis)
By the end of today, you will be able to…
To begin, we’ll work with a dataset on iris flowers contained with base R
. This dataset comes from an old statisical analysis paper aptly titled The use of multiple measurements in taxonomic problems
data(iris)
glimpse(iris)
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
See ?iris
for more information.
[write here]
Conceptualize R2
R2=1−sum of squared errorsum of square distance from mean in data
The numerator “sum of squared error” is the amount of variability not explained by the model.
The denominator is proportional to the variance in the data, i.e. the amount of variability in the data.
Together, this second term represents the proportion of variability not explained by the model.
If the proportion not explained is 0, the model explains all variability and R2=1−0=1.
If the proportion not explained is 1, i.e. the model does not explain any variability, then R2=1−1=0.
[updated explanation]
Suppose that we want to go out and collect more data on irises, but measuring several parts of each iris flower is time consuming. To save on the time it takes to collect future data, let’s see if one of the observations (petal length) could be predicted from just measuring a flower’s sepal. (This might make sense since the sepal is the part of the flower below the bud that contains the petals before it blooms)
To do this, we will setup three linear models:
predict petal length from sepal length
predict petal length from sepal width
predict petal length from both sepal width and sepal length
[answer here]
# code here
[answer here]
# code here
Returning to previous joint model,
petal length=β0+βsepal width⋅sepal width+βsepal length⋅sepal length
Does our data offer sufficient evidence that Sepal.Width
is actually associated with (and therefore might help us predict) Petal.Length
?
iris %>%
ggplot(aes(x = Sepal.Width, y = Petal.Length)) +
geom_point() +
theme_minimal()
Let’s conduct a hypothesis test in a regression framework to find out.
If Sepal.Width
does not help explain Petal.Length
, βsepal width=0, this is our null hypothesis.
For OLS regression, our test statistic is
T=ˆβ−0SEˆβ∼tn−2
R
takes care of much of this behind the scenes with the tidy output and reports a p-value for each β by default.
Fit the regression model and display the tidy output below.
# code here
Is βsepal width significant?
iris %>%
ggplot(aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
geom_point() +
scale_color_viridis_d()
In the plot above, it appears the relationship between sepal length and petal length, i.e. the slope of Petal.Length ~ Sepal.Length
varies drastically from one species of iris to another.
Sepal.Length
, Species
and an interaction effect between Sepal.Length
and Species
. See here for example from the prep.# code here
[write here]