Learning Goals

In this lab you will…

Use tidymodels framework to build a linear model and estimate regression parameters
Visualize your linear model

Getting started

A repository has already been created for you and your teammates. Everyone in your team has access to the same repo.
Go to the sta199-f21 course organization on GitHub.
You should see a repo with the *lab09** prefix.
Each person on the team should clone the repository and open a new project in RStudio.

Packages

We will use the tidyverse and tidymodels packages in this assignment

Intro

Parasites can cause infectious disease – but not all animals are affected by the same parasites. Some parasites are present in a multitude of species and others are confined to a single host. It is hypothesized that closely related hosts are more likely to share the same parasites. More specifically, it is thought that closely related hosts will live in similar environments and have similar genetic makeup that coincides with optimal conditions for the same parasite to flourish.

In this lab we will see how much evolutionary history predicts parasite similarity.

The Data

Today’s dataset comes from an Ecology Letters paper by Cooper at al. (2012) entitled “Phylogenetic host specificity and understanding parasite sharing in primates” located here. The goal of the paper was to identify the ability of evolutionary history and ecological traits to characterize parasite host specificity.

Each row contains two species, species1 and species2

Subsequent columns describe metrics that compare the species.

divergence_time: how many (millions) of years ago the two species diverged. i.e. how many million years ago they were the same species.
distance: geodesic distance between species geographic range centroids (in kilometers)
BMdiff: difference in body mass between the two species (in grams)
precdiff: difference in mean annual precipitation across the two species geographic ranges (mm)
parsim: a measure of parasite similiarity (proportion of parasites shared between species, ranges from 0 to 1.)

The data are available in parasites.csv located in the data folder.

Exercises

Instructions

Make sure we see all relevant code and output in the knitted PDF. If you use inline code, make sure we can still see the code used to derive that answer.
Write a narrative for each exercise.
All narrative should be written in full sentences, and visualizations should have clear title and axis labels.

Load the data and save your dataframe as parasites
Let’s start by examining the relationship between divergence_time and parsim.
- Based on the goals of the analysis, what is the response variable?
- Visualize the relationship between the two variables (remember to include axes and title.)
- In one to two sentences, describe what you see.
Fit the linear regression model and display the results.
- Write the regression equation.
- Interpret the slope and the intercept in the context of the data.
- Recreate the visualization from Exercise 1, this time adding a regression line to the visualization geom_smooth(method = "lm").
- What do you notice about the prediction (regression) line that may be strange, particularly for very large divergence times?

⊕This is called a “logit” transformation and takes values between $(0, 1]$ and maps them to $(-\infty, + \infty)$ like we desire while preserving their order.

Since parsim takes values between 0 and 1, we want to transform this variable so that it can range between $(-\infty, + \infty)$. This will be better suited for fitting a regression model (and interpreting predicted values!)

Create a new variable transformed_parsim that is calculated as log(parsim/(1-parsim)). Add this variable to your data frame.
Then, visualize the relationship between divergence_time and transformed_parsim. Add a regression line to your visualization.
Write a 1-2 sentence description of what you observe in the visualization.

Which variable is the strongest individual predictor of parasite similarity between species? To answer this question, begin by fitting a linear regression model to each pair of variables.

divergence_time and transformed_parsim
distance and transformed_parsim
BMdiff and transformed_parsim
precdiff and transformed_parsim

Do not report the model outputs in a tidy format but do save each one as dt_model, dist_model, BM_model and prec_model respectively.

Would it be useful to compare the slopes in each model to choose the variable that is the strongest predictor of parasite similarity? Why or why not?

To compare the explanatory power of each individual predictor, we will look at $R^2$ between the models. $R^2$ is a measure of how much of the variability in the response variable is explained by the model (we will talk more about $R^2$ and the mathematics behind it in an upcoming lecture!).

As you may have guessed from the name $R^2$ can be calculated by squaring the correlation (recall the correlation from Lab 01). The correlation $r$ takes values -1 to 1, therefore, $R^2$ takes values 0 to 1. Intuitively, if $r = 1 \text{ or }-1$, then $R^2 = 1$, indicating the model is a perfect fit for the data. If $r \approx 0$ then $R^2 \approx 0$, indicating the model is a very bad fit for the data.

You can calculate $R^2$ using the glance function. For example, you can calculate $R^2$ for dt_model using the code glance(dt_model)$r.squared.

Calculate $R^2$ for each model fit in the previous exercise.
Which variable is the strongest predictor of parasite similarity? Briefly explain your choice.

Submission

Knit to PDF to create a PDF document. Knit and commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo.

Please only upload your PDF document to Gradescope. Associate the “Overall” graded section with the first page of your PDF, and mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages.

Grading (50 points)

Component	Points
Ex 0	2
Ex 1	8
Ex 2	12
Ex 3	8
Ex 4	8
Ex 5	7
Workflow & formatting	5

Lab 09: Intro to linear regression

due Tuesday, November 09 at 11:59pm