Lab 09: Intro to linear regression

due Tuesday, November 09 at 11:59pm

Learning Goals

In this lab you will…

Getting started


We will use the tidyverse and tidymodels packages in this assignment


Parasites can cause infectious disease – but not all animals are affected by the same parasites. Some parasites are present in a multitude of species and others are confined to a single host. It is hypothesized that closely related hosts are more likely to share the same parasites. More specifically, it is thought that closely related hosts will live in similar environments and have similar genetic makeup that coincides with optimal conditions for the same parasite to flourish.

In this lab we will see how much evolutionary history predicts parasite similarity.

The Data

Today’s dataset comes from an Ecology Letters paper by Cooper at al. (2012) entitled “Phylogenetic host specificity and understanding parasite sharing in primates” located here. The goal of the paper was to identify the ability of evolutionary history and ecological traits to characterize parasite host specificity.

Each row contains two species, species1 and species2

Subsequent columns describe metrics that compare the species.

The data are available in parasites.csv located in the data folder.



  • Make sure we see all relevant code and output in the knitted PDF. If you use inline code, make sure we can still see the code used to derive that answer.
  • Write a narrative for each exercise.
  • All narrative should be written in full sentences, and visualizations should have clear title and axis labels.
  1. Load the data and save your dataframe as parasites

  2. Let’s start by examining the relationship between divergence_time and parsim.

    • Based on the goals of the analysis, what is the response variable?
    • Visualize the relationship between the two variables (remember to include axes and title.)
    • In one to two sentences, describe what you see.
  3. Fit the linear regression model and display the results.

    • Write the regression equation.
    • Interpret the slope and the intercept in the context of the data.
    • Recreate the visualization from Exercise 1, this time adding a regression line to the visualization geom_smooth(method = "lm").
    • What do you notice about the prediction (regression) line that may be strange, particularly for very large divergence times?

This is called a “logit” transformation and takes values between \((0, 1]\) and maps them to \((-\infty, + \infty)\) like we desire while preserving their order.

  1. Since parsim takes values between 0 and 1, we want to transform this variable so that it can range between \((-\infty, + \infty)\). This will be better suited for fitting a regression model (and interpreting predicted values!)
  1. Which variable is the strongest individual predictor of parasite similarity between species? To answer this question, begin by fitting a linear regression model to each pair of variables.

Do not report the model outputs in a tidy format but do save each one as dt_model, dist_model, BM_model and prec_model respectively.

Would it be useful to compare the slopes in each model to choose the variable that is the strongest predictor of parasite similarity? Why or why not?

  1. To compare the explanatory power of each individual predictor, we will look at \(R^2\) between the models. \(R^2\) is a measure of how much of the variability in the response variable is explained by the model (we will talk more about \(R^2\) and the mathematics behind it in an upcoming lecture!).

As you may have guessed from the name \(R^2\) can be calculated by squaring the correlation (recall the correlation from Lab 01). The correlation \(r\) takes values -1 to 1, therefore, \(R^2\) takes values 0 to 1. Intuitively, if \(r = 1 \text{ or }-1\), then \(R^2 = 1\), indicating the model is a perfect fit for the data. If \(r \approx 0\) then \(R^2 \approx 0\), indicating the model is a very bad fit for the data.

You can calculate \(R^2\) using the glance function. For example, you can calculate \(R^2\) for dt_model using the code glance(dt_model)$r.squared.


Grading (50 points)

Component Points
Ex 0 2
Ex 1 8
Ex 2 12
Ex 3 8
Ex 4 8
Ex 5 7
Workflow & formatting 5