class: center, middle, inverse, title-slide # Strings in R ##
Introduction to Data Science ###
introds.org
--- ## What is a "string"? A string is a collection of characters placed between quotes. A character is a single input from your keyboard (e.g. a single letter or a single punctuation mark). ```r string1 <- "Hi!" string2 <- 'Hello, I am C-3PO, it is a pleasure to meet you.' ``` You can combine strings in a vector. ```r string3 <- c("It's against", "my programming", "to use inconsistent notation.") string3 ``` ``` ## [1] "It's against" "my programming" ## [3] "to use inconsistent notation." ``` --- class: middle, center `stringr` --- ```r library(stringr) ``` ... but it's also included in the `tidyverse`! -- `stringr` provides many tools to work with strings, including functions that - count the characters in a string: `str_count()` - concatenate string vectors `str_c()` - detect patterns `str_detect()` - trim whitespace `str_trim()` -- Begin with `str_` All take a vector of strings as their first argument --- ## Include a quotation in a string? Why doesn't the code below work? .midi[ ```r string3 <- "I say "Hello" to the class" ``` ``` ## Error: <text>:1:20: unexpected symbol ## 1: string3 <- "I say "Hello ## ^ ``` ] -- To include a double quote in a string, *escape it* using a backslash `\`. -- .midi[ ```r string4 <- "I say \"Hello\" to the class" ``` ] -- What if you want to include an actual backslash? -- .midi[ ```r string5 <- "\\" ``` ] This may seem tedious but it will come up later! --- ## `writeLines` `writeLines` shows the contents of the string not including escapes. .pull-left[ ```r string4 ``` ``` ## [1] "I say \"Hello\" to the class" ``` ```r writeLines(string4) ``` ``` ## I say "Hello" to the class ``` ] .pull-right[ ```r string5 ``` ``` ## [1] "\\" ``` ```r writeLines(string5) ``` ``` ## \ ``` ] --- ## rockyou.txt - RockYou developed software for social media platforms such as MySpace and Facebook - Stored user passwords in plain text files - Hacked in 2009 and over 32 million passwords leaked Let's look at the first 20 ```r rockyou20 <- rockyou[1:20] rockyou20 ``` ``` ## [1] "123456" "12345" "123456789" "password" "iloveyou" "princess" ## [7] "1234567" "rockyou" "12345678" "abc123" "nicole" "daniel" ## [13] "babygirl" "monkey" "lovely" "jessica" "654321" "michael" ## [19] "ashley" "qwerty" ``` --- ## `str_length` Given a string, return the number of characters. .midi[ ```r password = "qwerty" str_length(password) ``` ``` ## [1] 6 ``` ] Given a vector of strings, return the number of characters in each string. .midi[ ```r str_length(rockyou20) ``` ``` ## [1] 6 5 9 8 8 8 7 7 8 6 6 6 8 6 6 7 6 7 6 6 ``` ```r rockyou20 ``` ``` ## [1] "123456" "12345" "123456789" "password" "iloveyou" "princess" ## [7] "1234567" "rockyou" "12345678" "abc123" "nicole" "daniel" ## [13] "babygirl" "monkey" "lovely" "jessica" "654321" "michael" ## [19] "ashley" "qwerty" ``` ] -- .pull-left[ - Alabama: 7 - Alaska: 6 - Arizona: 7 - Arkansas: 8 ] .pull-right[ - California: 10 - Colorado: 8 - Connecticut: 11 - ... ] --- ## `str_c` Combine two or more strings. ```r str_c("My", "password", "is", "qwerty") ``` ``` ## [1] "Mypasswordisqwerty" ``` -- Use `sep` to specify how the strings are separated. ```r str_c("My", "password", "is", "qwerty", sep = " ") ``` ``` ## [1] "My password is qwerty" ``` --- ## `str_to_lower` and `str_to_upper` Convert the case of a string from lower to upper or upper to lower. .midi[ ```r str_to_upper(rockyou20) ``` ``` ## [1] "123456" "12345" "123456789" "PASSWORD" "ILOVEYOU" "PRINCESS" ## [7] "1234567" "ROCKYOU" "12345678" "ABC123" "NICOLE" "DANIEL" ## [13] "BABYGIRL" "MONKEY" "LOVELY" "JESSICA" "654321" "MICHAEL" ## [19] "ASHLEY" "QWERTY" ``` ] --- ## `str_sub` Extract parts of a string from `start` to `end`, inclusive. .midi[ ```r str_sub(rockyou20, 1, 4) ``` ``` ## [1] "1234" "1234" "1234" "pass" "ilov" "prin" "1234" "rock" "1234" "abc1" ## [11] "nico" "dani" "baby" "monk" "love" "jess" "6543" "mich" "ashl" "qwer" ``` ] -- .midi[ ```r str_sub(rockyou20, -4, -1) ``` ``` ## [1] "3456" "2345" "6789" "word" "eyou" "cess" "4567" "kyou" "5678" "c123" ## [11] "cole" "niel" "girl" "nkey" "vely" "sica" "4321" "hael" "hley" "erty" ``` ] --- ## `str_sub` and `str_to_upper` Can combine `str_sub` and `str_to_upper` to capitalize each password. .midi[ ```r str_sub(rockyou20, 1, 1) <- str_to_upper(str_sub(rockyou20, 1, 1)) rockyou20 ``` ``` ## [1] "123456" "12345" "123456789" "Password" "Iloveyou" "Princess" ## [7] "1234567" "Rockyou" "12345678" "Abc123" "Nicole" "Daniel" ## [13] "Babygirl" "Monkey" "Lovely" "Jessica" "654321" "Michael" ## [19] "Ashley" "Qwerty" ``` ] --- ## `str_sort` Sort a string. Here we sort in decreasing alphabetical order. .midi[ ```r str_sort(rockyou20, decreasing = TRUE) ``` ``` ## [1] "Rockyou" "Qwerty" "Princess" "Password" "Nicole" "Monkey" ## [7] "Michael" "Lovely" "Jessica" "Iloveyou" "Daniel" "Babygirl" ## [13] "Ashley" "Abc123" "654321" "123456789" "12345678" "1234567" ## [19] "123456" "12345" ``` ] --- ## Regular Expressions A .vocab[regular expression] is a sequence of characters that allows you to describe string patterns. We use them to search for patterns. - extract a phone number from text data - determine if an email address is valid - determine if a password has the required number of letters, characters, and symbols - count the number of times "statistics" occurs in a corpus of text - ... --- ## Basic Match To demonstrate the power of regular expressions, let's see if any of the 32 million leaked passwords contain the exact phrase "dog" ```r str_subset(rockyou, "dog")[1:30] ``` ``` ## [1] "catdog" "hotdog" "bulldogs" "bulldog" "doggie" ## [6] "bigdog" "maddog" "snoopdogg" "puppydog" "doggy" ## [11] "dog123" "snoopdog" "ilovedogs" "doggies" "luckydog" ## [16] "catdog1" "dogdog" "reddog" "bulldog1" "mollydog" ## [21] "hotdog1" "bulldogs1" "dogcat" "doggy1" "hotdogs" ## [26] "dogsrule" "thedog" "catsanddogs" "topdog" "daisydog" ``` --- ## Basic Match What about "d-g"? Match any character using `.` ```r str_subset(rockyou, "d.g")[1:30] ``` ``` ## [1] "asdfgh" "asdfghjkl" "catdog" "hotdog" "bulldogs" ## [6] "bulldog" "asdfg" "doggie" "bigdog" "maddog" ## [11] "digger" "digimon" "digital" "candygirl" "snoopdogg" ## [16] "puppydog" "doggy" "dog123" "snoopdog" "asdfghj" ## [21] "ilovedogs" "doggies" "asdfghjk" "luckydog" "catdog1" ## [26] "indigo" "dogdog" "madagascar" "reddog" "bulldog1" ``` --- ## Anchors Match the start of a string using `^` .inverse[ ```r str_view_all(rockyou20, "^P") ```
] ```r rockyou20 ``` ``` ## [1] "123456" "12345" "123456789" "password" "iloveyou" "princess" ## [7] "1234567" "rockyou" "12345678" "abc123" "nicole" "daniel" ## [13] "babygirl" "monkey" "lovely" "jessica" "654321" "michael" ## [19] "ashley" "qwerty" ``` --- ## Anchors Match the end of a string using `$` .inverse[ ```r str_view_all(rockyou20, "u$", match = TRUE) ```
] --- ## `str_detect` Determine if a character vector matches a pattern. ```r rockyou20 ``` ``` ## [1] "123456" "12345" "123456789" "password" "iloveyou" "princess" ## [7] "1234567" "rockyou" "12345678" "abc123" "nicole" "daniel" ## [13] "babygirl" "monkey" "lovely" "jessica" "654321" "michael" ## [19] "ashley" "qwerty" ``` ```r str_detect(rockyou20, "a") ``` ``` ## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE ## [13] TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE ``` --- ## `str_count` How many matches are there in a string? ```r rockyou20 ``` ``` ## [1] "123456" "12345" "123456789" "password" "iloveyou" "princess" ## [7] "1234567" "rockyou" "12345678" "abc123" "nicole" "daniel" ## [13] "babygirl" "monkey" "lovely" "jessica" "654321" "michael" ## [19] "ashley" "qwerty" ``` ```r str_count(rockyou20, "s") ``` ``` ## [1] 0 0 0 2 0 2 0 0 0 0 0 0 0 0 0 2 0 0 1 0 ``` --- ## `str_replace_all` Replace all matches with new strings. ```r str_replace_all(rockyou20, "s", "-") ``` ``` ## [1] "123456" "12345" "123456789" "pa--word" "iloveyou" "prince--" ## [7] "1234567" "rockyou" "12345678" "abc123" "nicole" "daniel" ## [13] "babygirl" "monkey" "lovely" "je--ica" "654321" "michael" ## [19] "a-hley" "qwerty" ``` --- ## Many Matches The regular expressions below match more than one character. - Match any digit using `\d` or `[[:digit:]]` - Match any whitespace using `\s` or `[[:space:]]` - Match f, g, or h using `[fgh]` - Match anything but f, g, or h using `[^fgh]` - Match lower-case letters using `[a-z]` or `[[:lower:]]` - Match upper-case letters using `[A-Z]` or `[[:upper:]]` - Match alphabetic characters using `[A-z]` or `[[:alpha:]]` Remember these are regular expressions! To match digits you'll need to *escape* the string, so use `"\\d"`, not `"\d"` --- ## Additional resources - `stringr` website: https://stringr.tidyverse.org/ - `stringr` and `regex` [cheat sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/master/strings.pdf) - [Chapter 14: Strings](https://r4ds.had.co.nz/strings.html#matching-patterns-with-regular-expressions) in R for Data Science