Bulletin

Main Ideas

Lecture Notes and Exercises

Load the tidyverse and datasauRus packages

library(tidyverse)
library(datasauRus)

There are two types of variables numeric and categorical.

Types of variables

Numerical variables can be classified as either continuous or discrete. Continuous numeric variables have an infinite number of values between any two values. Discrete numeric variables have a countable number of values.

  • height
  • number of siblings

Categorical variables can be classified as either nominal or ordinal. Ordinal variables have a natural ordering.

  • hair color
  • education

Numeric Variables

To describe the distribution of a numeric we will use the properties below.

  • shape
    • skewness: right-skewed, left-skewed, symmetric
    • modality: unimodal, bimodal, multimodal, uniform
  • center: mean (mean), median (median)
  • spread: range (range), standard deviation (sd), interquartile range (IQR)
  • outliers: observations outside the pattern of the data

We will continue our investigation of home prices in Minneapolis, Minnesota.

mn_homes <- read_csv("data/mn_homes.csv")

Add a glimpse to the code chunk below and identify the following variables as numeric continuous, numeric discrete, categorical ordinal, or categorical nominal.

  • area
  • beds
  • community
glimpse(mn_homes$community)
##  chr [1:495] "Calhoun-Isles" "Longfellow" "Longfellow" "Southwest" "Camden" ...

The summary command is also useful in looking at numerical variables. Use this command to look at the numeric variables from the previous chunk.

summary(mn_homes$beds)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   3.000   3.087   4.000   7.000

We can use a histogram to summarize a numeric variable.

ggplot(data = mn_homes, 
       mapping = aes(x = salesprice)) + 
   geom_histogram(bins = 25)

A density plot is another option. We just connect the boxes in a histogram with a smooth curve.

ggplot(data = mn_homes, 
       mapping = aes(x = salesprice)) + 
   geom_density()

Side-by-side boxplots are helpful to visualize the distribution of a numeric variable across the levels of a categorical variable.

ggplot(data = mn_homes, 
       mapping = aes(x = community, y = salesprice)) + 
       geom_boxplot() + coord_flip() + 
       labs(main= "Sales Price by Community", x= "Community", y="Sales Price")

Question: What is coord_flip() doing in the code chunk above? Try removing it to see.

Categorical Variables

Bar plots allow us to visualize categorical variables.

ggplot(data = mn_homes) + 
  geom_bar(mapping = aes(x = community)) + coord_flip() + 
  labs(main= "Homes by Community", x= "Community", y="Number of Homes")

Segmented bar plots can be used to visualize two categorical variables.

library(viridis)
## Loading required package: viridisLite
ggplot(data = mn_homes, mapping = aes(x = community, fill = fireplace)) + 
  geom_bar() +
  coord_flip() + 
  scale_fill_viridis(discrete=TRUE, option = "D", name="Fireplace?") +
  labs(main= "Fireplaces by Community", x= "Community", y="Number of Homes")

ggplot(data = mn_homes, mapping = aes(x = community, fill = fireplace)) + 
  geom_bar(position = "fill") + coord_flip() + 
  scale_fill_viridis(discrete=TRUE, option = "D", name="Fireplace?") +
  labs(main= "Percentage of Homes with a Fireplace by Community", x=
  "Community", y="Percentage of Homes")

Question: Which of the above two visualizations do you prefer? Why? Is this answer always the same?

There is something wrong with each of the plots below. Run the code for each plot, read the error, then identify and fix the problem.

ggplot(data = mn_homes) + 
  geom_point(mapping = aes(x = lotsize, y = salesprice,
                           shape = 21, size = .85))
ggplot(data = mn_homes) + 
  geom_point(x = lotsize, y = area), shape = 21, size = .85)
ggplot(data = mn_homes) +
  geom_point(mapping = aes(x = lotsize, y = area), color=community), size = 0.85)
ggplot(data = mn_homes) +
  geom_point(mapping = aes(x = 1otsize, y = area))

General principles for effective data visualization

  • keep it simple
  • use color effectively
  • tell a story

Why is data visualization important? We will illustrate using the datasaurus_dozen data from the datasauRus package.

glimpse(datasaurus_dozen)
## Rows: 1,846
## Columns: 3
## $ dataset <chr> "dino", "dino", "dino", "dino", "dino", "dino", "dino", "dino"…
## $ x       <dbl> 55.3846, 51.5385, 46.1538, 42.8205, 40.7692, 38.7179, 35.6410,…
## $ y       <dbl> 97.1795, 96.0256, 94.4872, 91.4103, 88.3333, 84.8718, 79.8718,…

The code below calculates the correlation, mean of y, mean of x, standard deviation of y, and standard deviation of x for each of the 13 datasets.

Question: What do you notice?

datasaurus_dozen %>% 
   group_by(dataset) %>%
   summarize(r = cor(x, y), 
             mean_y = mean(y),
             mean_x = mean(x),
             sd_x = sd(x),
             sd_y = sd(y))
## # A tibble: 13 × 6
##    dataset          r mean_y mean_x  sd_x  sd_y
##    <chr>        <dbl>  <dbl>  <dbl> <dbl> <dbl>
##  1 away       -0.0641   47.8   54.3  16.8  26.9
##  2 bullseye   -0.0686   47.8   54.3  16.8  26.9
##  3 circle     -0.0683   47.8   54.3  16.8  26.9
##  4 dino       -0.0645   47.8   54.3  16.8  26.9
##  5 dots       -0.0603   47.8   54.3  16.8  26.9
##  6 h_lines    -0.0617   47.8   54.3  16.8  26.9
##  7 high_lines -0.0685   47.8   54.3  16.8  26.9
##  8 slant_down -0.0690   47.8   54.3  16.8  26.9
##  9 slant_up   -0.0686   47.8   54.3  16.8  26.9
## 10 star       -0.0630   47.8   54.3  16.8  26.9
## 11 v_lines    -0.0694   47.8   54.3  16.8  26.9
## 12 wide_lines -0.0666   47.8   54.3  16.8  26.9
## 13 x_shape    -0.0656   47.8   54.3  16.8  26.9

Let’s visualize the relationships

ggplot(data = datasaurus_dozen, 
       mapping = aes(x = x, y = y)) + 
   geom_point(size = .5) + 
   facet_wrap( ~ dataset)

Question: Why is visualization important?

Practice

  1. Modify the code outline to create a faceted histogram examining the distribution of year built within each community.

When you are finished, remove eval = FALSE and knit the file to see the changes.

Note: depending on the variable you use, you might want to change the bin width as I do below.

ggplot(data = mn_homes, mapping = aes(x = area)) +
  geom_histogram(binwidth = 100) +
  facet_wrap(~ community) +
  labs(x = "Area", 
      title = "Which Communities Have the Largest Homes?", 
      subtitle = "Faceted by Community")