In this lab you will…
tidymodels
framework to build a linear model and estimate regression parametersA repository has already been created for you and your teammates. Everyone in your team has access to the same repo.
Go to the sta199-f21 course organization on GitHub.
You should see a repo with the *lab09** prefix.
Each person on the team should clone the repository and open a new project in RStudio.
We will use the tidyverse
and tidymodels
packages in this assignment
Parasites can cause infectious disease – but not all animals are affected by the same parasites. Some parasites are present in a multitude of species and others are confined to a single host. It is hypothesized that closely related hosts are more likely to share the same parasites. More specifically, it is thought that closely related hosts will live in similar environments and have similar genetic makeup that coincides with optimal conditions for the same parasite to flourish.
In this lab we will see how much evolutionary history predicts parasite similarity.
Today’s dataset comes from an Ecology Letters paper by Cooper at al. (2012) entitled “Phylogenetic host specificity and understanding parasite sharing in primates” located here. The goal of the paper was to identify the ability of evolutionary history and ecological traits to characterize parasite host specificity.
Each row contains two species, species1
and species2
Subsequent columns describe metrics that compare the species.
divergence_time
: how many (millions) of years ago the two species diverged. i.e. how many million years ago they were the same species.distance
: geodesic distance between species geographic range centroids (in kilometers)BMdiff
: difference in body mass between the two species (in grams)precdiff
: difference in mean annual precipitation across the two species geographic ranges (mm)parsim
: a measure of parasite similiarity (proportion of parasites shared between species, ranges from 0 to 1.)The data are available in parasites.csv
located in the data
folder.
Load the data and save your dataframe as parasites
Let’s start by examining the relationship between divergence_time
and parsim
.
Fit the linear regression model and display the results.
geom_smooth(method = "lm")
.This is called a “logit” transformation and takes values between \((0, 1]\) and maps them to \((-\infty, + \infty)\) like we desire while preserving their order.
parsim
takes values between 0 and 1, we want to transform this variable so that it can range between \((-\infty, + \infty)\). This will be better suited for fitting a regression model (and interpreting predicted values!)transformed_parsim
that is calculated as log(parsim/(1-parsim))
. Add this variable to your data frame.divergence_time
and transformed_parsim
. Add a regression line to your visualization.divergence_time
and transformed_parsim
distance
and transformed_parsim
BMdiff
and transformed_parsim
precdiff
and transformed_parsim
Do not report the model outputs in a tidy
format but do save each one as dt_model
, dist_model
, BM_model
and prec_model
respectively.
Would it be useful to compare the slopes in each model to choose the variable that is the strongest predictor of parasite similarity? Why or why not?
As you may have guessed from the name \(R^2\) can be calculated by squaring the correlation (recall the correlation from Lab 01). The correlation \(r\) takes values -1 to 1, therefore, \(R^2\) takes values 0 to 1. Intuitively, if \(r = 1 \text{ or }-1\), then \(R^2 = 1\), indicating the model is a perfect fit for the data. If \(r \approx 0\) then \(R^2 \approx 0\), indicating the model is a very bad fit for the data.
You can calculate \(R^2\) using the glance
function. For example, you can calculate \(R^2\) for dt_model
using the code glance(dt_model)$r.squared
.
Knit to PDF to create a PDF document. Knit and commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo.
Please only upload your PDF document to Gradescope. Associate the “Overall” graded section with the first page of your PDF, and mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages.
Component | Points |
---|---|
Ex 0 | 2 |
Ex 1 | 8 |
Ex 2 | 12 |
Ex 3 | 8 |
Ex 4 | 8 |
Ex 5 | 7 |
Workflow & formatting | 5 |