+ - 0:00:00
Notes for current slide
Notes for next slide

Strings in R



Introduction to Data Science

introds.org

1

What is a "string"?

A string is a collection of characters placed between quotes.

A character is a single input from your keyboard (e.g. a single letter or a single punctuation mark).

string1 <- "Hi!"
string2 <- 'Hello, I am C-3PO, it is a pleasure to meet you.'

You can combine strings in a vector.

string3 <- c("It's against", "my programming", "to use inconsistent notation.")
string3
## [1] "It's against" "my programming"
## [3] "to use inconsistent notation."
2

stringr

3
library(stringr)

... but it's also included in the tidyverse!

4
library(stringr)

... but it's also included in the tidyverse!

stringr provides many tools to work with strings, including functions that

  • count the characters in a string: str_count()

  • concatenate string vectors str_c()

  • detect patterns str_detect()

  • trim whitespace str_trim()

4
library(stringr)

... but it's also included in the tidyverse!

stringr provides many tools to work with strings, including functions that

  • count the characters in a string: str_count()

  • concatenate string vectors str_c()

  • detect patterns str_detect()

  • trim whitespace str_trim()

Begin with str_

All take a vector of strings as their first argument

4

Include a quotation in a string?

Why doesn't the code below work?

string3 <- "I say "Hello" to the class"
## Error: <text>:1:20: unexpected symbol
## 1: string3 <- "I say "Hello
## ^
5

Include a quotation in a string?

Why doesn't the code below work?

string3 <- "I say "Hello" to the class"
## Error: <text>:1:20: unexpected symbol
## 1: string3 <- "I say "Hello
## ^

To include a double quote in a string, escape it using a backslash \.

5

Include a quotation in a string?

Why doesn't the code below work?

string3 <- "I say "Hello" to the class"
## Error: <text>:1:20: unexpected symbol
## 1: string3 <- "I say "Hello
## ^

To include a double quote in a string, escape it using a backslash \.

string4 <- "I say \"Hello\" to the class"
5

Include a quotation in a string?

Why doesn't the code below work?

string3 <- "I say "Hello" to the class"
## Error: <text>:1:20: unexpected symbol
## 1: string3 <- "I say "Hello
## ^

To include a double quote in a string, escape it using a backslash \.

string4 <- "I say \"Hello\" to the class"

What if you want to include an actual backslash?

5

Include a quotation in a string?

Why doesn't the code below work?

string3 <- "I say "Hello" to the class"
## Error: <text>:1:20: unexpected symbol
## 1: string3 <- "I say "Hello
## ^

To include a double quote in a string, escape it using a backslash \.

string4 <- "I say \"Hello\" to the class"

What if you want to include an actual backslash?

string5 <- "\\"

This may seem tedious but it will come up later!

5

writeLines

writeLines shows the contents of the string not including escapes.

string4
## [1] "I say \"Hello\" to the class"
writeLines(string4)
## I say "Hello" to the class
string5
## [1] "\\"
writeLines(string5)
## \
6

rockyou.txt

  • RockYou developed software for social media platforms such as MySpace and Facebook

  • Stored user passwords in plain text files

  • Hacked in 2009 and over 32 million passwords leaked

Let's look at the first 20

rockyou20 <- rockyou[1:20]
rockyou20
## [1] "123456" "12345" "123456789" "password" "iloveyou" "princess"
## [7] "1234567" "rockyou" "12345678" "abc123" "nicole" "daniel"
## [13] "babygirl" "monkey" "lovely" "jessica" "654321" "michael"
## [19] "ashley" "qwerty"
7

str_length

Given a string, return the number of characters.

password = "qwerty"
str_length(password)
## [1] 6

Given a vector of strings, return the number of characters in each string.

str_length(rockyou20)
## [1] 6 5 9 8 8 8 7 7 8 6 6 6 8 6 6 7 6 7 6 6
rockyou20
## [1] "123456" "12345" "123456789" "password" "iloveyou" "princess"
## [7] "1234567" "rockyou" "12345678" "abc123" "nicole" "daniel"
## [13] "babygirl" "monkey" "lovely" "jessica" "654321" "michael"
## [19] "ashley" "qwerty"
8

str_length

Given a string, return the number of characters.

password = "qwerty"
str_length(password)
## [1] 6

Given a vector of strings, return the number of characters in each string.

str_length(rockyou20)
## [1] 6 5 9 8 8 8 7 7 8 6 6 6 8 6 6 7 6 7 6 6
rockyou20
## [1] "123456" "12345" "123456789" "password" "iloveyou" "princess"
## [7] "1234567" "rockyou" "12345678" "abc123" "nicole" "daniel"
## [13] "babygirl" "monkey" "lovely" "jessica" "654321" "michael"
## [19] "ashley" "qwerty"
  • Alabama: 7
  • Alaska: 6
  • Arizona: 7
  • Arkansas: 8
  • California: 10
  • Colorado: 8
  • Connecticut: 11
  • ...
8

str_c

Combine two or more strings.

str_c("My", "password", "is", "qwerty")
## [1] "Mypasswordisqwerty"
9

str_c

Combine two or more strings.

str_c("My", "password", "is", "qwerty")
## [1] "Mypasswordisqwerty"

Use sep to specify how the strings are separated.

str_c("My", "password", "is", "qwerty", sep = " ")
## [1] "My password is qwerty"
9

str_to_lower and str_to_upper

Convert the case of a string from lower to upper or upper to lower.

str_to_upper(rockyou20)
## [1] "123456" "12345" "123456789" "PASSWORD" "ILOVEYOU" "PRINCESS"
## [7] "1234567" "ROCKYOU" "12345678" "ABC123" "NICOLE" "DANIEL"
## [13] "BABYGIRL" "MONKEY" "LOVELY" "JESSICA" "654321" "MICHAEL"
## [19] "ASHLEY" "QWERTY"
10

str_sub

Extract parts of a string from start to end, inclusive.

str_sub(rockyou20, 1, 4)
## [1] "1234" "1234" "1234" "pass" "ilov" "prin" "1234" "rock" "1234" "abc1"
## [11] "nico" "dani" "baby" "monk" "love" "jess" "6543" "mich" "ashl" "qwer"
11

str_sub

Extract parts of a string from start to end, inclusive.

str_sub(rockyou20, 1, 4)
## [1] "1234" "1234" "1234" "pass" "ilov" "prin" "1234" "rock" "1234" "abc1"
## [11] "nico" "dani" "baby" "monk" "love" "jess" "6543" "mich" "ashl" "qwer"
str_sub(rockyou20, -4, -1)
## [1] "3456" "2345" "6789" "word" "eyou" "cess" "4567" "kyou" "5678" "c123"
## [11] "cole" "niel" "girl" "nkey" "vely" "sica" "4321" "hael" "hley" "erty"
11

str_sub and str_to_upper

Can combine str_sub and str_to_upper to capitalize each password.

str_sub(rockyou20, 1, 1) <- str_to_upper(str_sub(rockyou20, 1, 1))
rockyou20
## [1] "123456" "12345" "123456789" "Password" "Iloveyou" "Princess"
## [7] "1234567" "Rockyou" "12345678" "Abc123" "Nicole" "Daniel"
## [13] "Babygirl" "Monkey" "Lovely" "Jessica" "654321" "Michael"
## [19] "Ashley" "Qwerty"
12

str_sort

Sort a string. Here we sort in decreasing alphabetical order.

str_sort(rockyou20, decreasing = TRUE)
## [1] "Rockyou" "Qwerty" "Princess" "Password" "Nicole" "Monkey"
## [7] "Michael" "Lovely" "Jessica" "Iloveyou" "Daniel" "Babygirl"
## [13] "Ashley" "Abc123" "654321" "123456789" "12345678" "1234567"
## [19] "123456" "12345"
13

Regular Expressions

A regular expression is a sequence of characters that allows you to describe string patterns. We use them to search for patterns.

  • extract a phone number from text data
  • determine if an email address is valid
  • determine if a password has the required number of letters, characters, and symbols
  • count the number of times "statistics" occurs in a corpus of text
  • ...
14

Basic Match

To demonstrate the power of regular expressions, let's see if any of the 32 million leaked passwords contain the exact phrase "dog"

str_subset(rockyou, "dog")[1:30]
## [1] "catdog" "hotdog" "bulldogs" "bulldog" "doggie"
## [6] "bigdog" "maddog" "snoopdogg" "puppydog" "doggy"
## [11] "dog123" "snoopdog" "ilovedogs" "doggies" "luckydog"
## [16] "catdog1" "dogdog" "reddog" "bulldog1" "mollydog"
## [21] "hotdog1" "bulldogs1" "dogcat" "doggy1" "hotdogs"
## [26] "dogsrule" "thedog" "catsanddogs" "topdog" "daisydog"
15

Basic Match

What about "d-g"? Match any character using .

str_subset(rockyou, "d.g")[1:30]
## [1] "asdfgh" "asdfghjkl" "catdog" "hotdog" "bulldogs"
## [6] "bulldog" "asdfg" "doggie" "bigdog" "maddog"
## [11] "digger" "digimon" "digital" "candygirl" "snoopdogg"
## [16] "puppydog" "doggy" "dog123" "snoopdog" "asdfghj"
## [21] "ilovedogs" "doggies" "asdfghjk" "luckydog" "catdog1"
## [26] "indigo" "dogdog" "madagascar" "reddog" "bulldog1"
16

Anchors

Match the start of a string using ^

str_view_all(rockyou20, "^P")
  • 123456
  • 12345
  • 123456789
  • Password
  • Iloveyou
  • Princess
  • 1234567
  • Rockyou
  • 12345678
rockyou20
## [1] "123456" "12345" "123456789" "password" "iloveyou" "princess"
## [7] "1234567" "rockyou" "12345678" "abc123" "nicole" "daniel"
## [13] "babygirl" "monkey" "lovely" "jessica" "654321" "michael"
## [19] "ashley" "qwerty"
17

Anchors

Match the end of a string using $

str_view_all(rockyou20, "u$", match = TRUE)
  • iloveyou
  • rockyou
18

str_detect

Determine if a character vector matches a pattern.

rockyou20
## [1] "123456" "12345" "123456789" "password" "iloveyou" "princess"
## [7] "1234567" "rockyou" "12345678" "abc123" "nicole" "daniel"
## [13] "babygirl" "monkey" "lovely" "jessica" "654321" "michael"
## [19] "ashley" "qwerty"
str_detect(rockyou20, "a")
## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
## [13] TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE
19

str_count

How many matches are there in a string?

rockyou20
## [1] "123456" "12345" "123456789" "password" "iloveyou" "princess"
## [7] "1234567" "rockyou" "12345678" "abc123" "nicole" "daniel"
## [13] "babygirl" "monkey" "lovely" "jessica" "654321" "michael"
## [19] "ashley" "qwerty"
str_count(rockyou20, "s")
## [1] 0 0 0 2 0 2 0 0 0 0 0 0 0 0 0 2 0 0 1 0
20

str_replace_all

Replace all matches with new strings.

str_replace_all(rockyou20, "s", "-")
## [1] "123456" "12345" "123456789" "pa--word" "iloveyou" "prince--"
## [7] "1234567" "rockyou" "12345678" "abc123" "nicole" "daniel"
## [13] "babygirl" "monkey" "lovely" "je--ica" "654321" "michael"
## [19] "a-hley" "qwerty"
21

Many Matches

The regular expressions below match more than one character.

  • Match any digit using \d or [[:digit:]]
  • Match any whitespace using \s or [[:space:]]
  • Match f, g, or h using [fgh]
  • Match anything but f, g, or h using [^fgh]
  • Match lower-case letters using [a-z] or [[:lower:]]
  • Match upper-case letters using [A-Z] or [[:upper:]]
  • Match alphabetic characters using [A-z] or [[:alpha:]]

Remember these are regular expressions! To match digits you'll need to escape the string, so use "\\d", not "\d"

22

Additional resources

23

What is a "string"?

A string is a collection of characters placed between quotes.

A character is a single input from your keyboard (e.g. a single letter or a single punctuation mark).

string1 <- "Hi!"
string2 <- 'Hello, I am C-3PO, it is a pleasure to meet you.'

You can combine strings in a vector.

string3 <- c("It's against", "my programming", "to use inconsistent notation.")
string3
## [1] "It's against" "my programming"
## [3] "to use inconsistent notation."
2
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow