Main Ideas
- There are different types of variables.
- Visualizations and summaries of variables must be consistent with the variable type.
Load the tidyverse
and datasauRus
packages
library(tidyverse)
library(datasauRus)
There are two types of variables numeric and categorical.
Numerical variables can be classified as either continuous or discrete. Continuous numeric variables have an infinite number of values between any two values. Discrete numeric variables have a countable number of values.
Categorical variables can be classified as either nominal or ordinal. Ordinal variables have a natural ordering.
To describe the distribution of a numeric we will use the properties below.
mean
), median (median
)range
), standard deviation (sd
), interquartile range (IQR
)We will continue our investigation of home prices in Minneapolis, Minnesota.
mn_homes <- read_csv("data/mn_homes.csv")
Add a glimpse
to the code chunk below and identify the following variables as numeric continuous, numeric discrete, categorical ordinal, or categorical nominal.
glimpse(mn_homes$community)
## chr [1:495] "Calhoun-Isles" "Longfellow" "Longfellow" "Southwest" "Camden" ...
The summary
command is also useful in looking at numerical variables. Use this command to look at the numeric variables from the previous chunk.
summary(mn_homes$beds)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 3.000 3.087 4.000 7.000
We can use a histogram to summarize a numeric variable.
ggplot(data = mn_homes,
mapping = aes(x = salesprice)) +
geom_histogram(bins = 25)
A density plot is another option. We just connect the boxes in a histogram with a smooth curve.
ggplot(data = mn_homes,
mapping = aes(x = salesprice)) +
geom_density()
Side-by-side boxplots are helpful to visualize the distribution of a numeric variable across the levels of a categorical variable.
ggplot(data = mn_homes,
mapping = aes(x = community, y = salesprice)) +
geom_boxplot() + coord_flip() +
labs(main= "Sales Price by Community", x= "Community", y="Sales Price")
Question: What is coord_flip()
doing in the code chunk above? Try removing it to see.
Bar plots allow us to visualize categorical variables.
ggplot(data = mn_homes) +
geom_bar(mapping = aes(x = community)) + coord_flip() +
labs(main= "Homes by Community", x= "Community", y="Number of Homes")
Segmented bar plots can be used to visualize two categorical variables.
library(viridis)
## Loading required package: viridisLite
ggplot(data = mn_homes, mapping = aes(x = community, fill = fireplace)) +
geom_bar() +
coord_flip() +
scale_fill_viridis(discrete=TRUE, option = "D", name="Fireplace?") +
labs(main= "Fireplaces by Community", x= "Community", y="Number of Homes")
ggplot(data = mn_homes, mapping = aes(x = community, fill = fireplace)) +
geom_bar(position = "fill") + coord_flip() +
scale_fill_viridis(discrete=TRUE, option = "D", name="Fireplace?") +
labs(main= "Percentage of Homes with a Fireplace by Community", x=
"Community", y="Percentage of Homes")
Question: Which of the above two visualizations do you prefer? Why? Is this answer always the same?
There is something wrong with each of the plots below. Run the code for each plot, read the error, then identify and fix the problem.
ggplot(data = mn_homes) +
geom_point(mapping = aes(x = lotsize, y = salesprice,
shape = 21, size = .85))
ggplot(data = mn_homes) +
geom_point(x = lotsize, y = area), shape = 21, size = .85)
ggplot(data = mn_homes) +
geom_point(mapping = aes(x = lotsize, y = area), color=community), size = 0.85)
ggplot(data = mn_homes) +
geom_point(mapping = aes(x = 1otsize, y = area))
General principles for effective data visualization
Why is data visualization important? We will illustrate using the datasaurus_dozen
data from the datasauRus
package.
glimpse(datasaurus_dozen)
## Rows: 1,846
## Columns: 3
## $ dataset <chr> "dino", "dino", "dino", "dino", "dino", "dino", "dino", "dino"…
## $ x <dbl> 55.3846, 51.5385, 46.1538, 42.8205, 40.7692, 38.7179, 35.6410,…
## $ y <dbl> 97.1795, 96.0256, 94.4872, 91.4103, 88.3333, 84.8718, 79.8718,…
The code below calculates the correlation, mean of y, mean of x, standard deviation of y, and standard deviation of x for each of the 13 datasets.
Question: What do you notice?
datasaurus_dozen %>%
group_by(dataset) %>%
summarize(r = cor(x, y),
mean_y = mean(y),
mean_x = mean(x),
sd_x = sd(x),
sd_y = sd(y))
## # A tibble: 13 × 6
## dataset r mean_y mean_x sd_x sd_y
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 away -0.0641 47.8 54.3 16.8 26.9
## 2 bullseye -0.0686 47.8 54.3 16.8 26.9
## 3 circle -0.0683 47.8 54.3 16.8 26.9
## 4 dino -0.0645 47.8 54.3 16.8 26.9
## 5 dots -0.0603 47.8 54.3 16.8 26.9
## 6 h_lines -0.0617 47.8 54.3 16.8 26.9
## 7 high_lines -0.0685 47.8 54.3 16.8 26.9
## 8 slant_down -0.0690 47.8 54.3 16.8 26.9
## 9 slant_up -0.0686 47.8 54.3 16.8 26.9
## 10 star -0.0630 47.8 54.3 16.8 26.9
## 11 v_lines -0.0694 47.8 54.3 16.8 26.9
## 12 wide_lines -0.0666 47.8 54.3 16.8 26.9
## 13 x_shape -0.0656 47.8 54.3 16.8 26.9
Let’s visualize the relationships
ggplot(data = datasaurus_dozen,
mapping = aes(x = x, y = y)) +
geom_point(size = .5) +
facet_wrap( ~ dataset)
Question: Why is visualization important?
When you are finished, remove eval = FALSE
and knit the file to see the changes.
Note: depending on the variable you use, you might want to change the bin width as I do below.
ggplot(data = mn_homes, mapping = aes(x = area)) +
geom_histogram(binwidth = 100) +
facet_wrap(~ community) +
labs(x = "Area",
title = "Which Communities Have the Largest Homes?",
subtitle = "Faceted by Community")