Tidying Datatidyr 
2020-08-221 / 39

tidyr

Functions for tidying data.

What is tidy data?

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham

2 / 39

Tidy Data

3 / 39

Tidy Data

Each column is a single variable

4 / 39

Tidy Data

Each column is a single variable

Each row is a single observation

5 / 39

Tidy Data

Each column is a single variable

Each row is a single observation

Each cell is a value

6 / 39

`pivot_longer()`

pivot_longer(<DATA>, <NAMES TO>, <VALUES TO>, <VARIABLES>)

7 / 39

Lord of the Rings

lotr <- tribble(
                         ~film,    ~race, ~female, ~male,
  "The Fellowship Of The Ring",    "Elf",   1229L,  971L,
  "The Fellowship Of The Ring", "Hobbit",     14L, 3644L,
  "The Fellowship Of The Ring",    "Man",      0L, 1995L,
              "The Two Towers",    "Elf",    331L,  513L,
              "The Two Towers", "Hobbit",      0L, 2463L,
              "The Two Towers",    "Man",    401L, 3589L,
      "The Return Of The King",    "Elf",    183L,  510L,
      "The Return Of The King", "Hobbit",      2L, 2673L,
      "The Return Of The King",    "Man",    268L, 2459L
)

8 / 39

Lord of the Rings

lotr

## # A tibble: 9 x 4
##   film                       race   female  male
##   <chr>                      <chr>   <int> <int>
## 1 The Fellowship Of The Ring Elf      1229   971
## 2 The Fellowship Of The Ring Hobbit     14  3644
## 3 The Fellowship Of The Ring Man         0  1995
## 4 The Two Towers             Elf       331   513
## 5 The Two Towers             Hobbit      0  2463
## 6 The Two Towers             Man       401  3589
## 7 The Return Of The King     Elf       183   510
## 8 The Return Of The King     Hobbit      2  2673
## 9 The Return Of The King     Man       268  2459

9 / 39

10 / 39

`pivot_longer()`

lotr %>%
  pivot_longer(
    names_to = "sex",
    values_to = "words",
    cols = female:male
  )

11 / 39

lotr %>%
  pivot_longer(
    names_to = "sex",
    values_to = "words",
    cols = female:male
  )

## # A tibble: 18 x 4
##    film                       race   sex    words
##    <chr>                      <chr>  <chr>  <int>
##  1 The Fellowship Of The Ring Elf    female  1229
##  2 The Fellowship Of The Ring Elf    male     971
##  3 The Fellowship Of The Ring Hobbit female    14
##  4 The Fellowship Of The Ring Hobbit male    3644
##  5 The Fellowship Of The Ring Man    female     0
##  6 The Fellowship Of The Ring Man    male    1995
##  7 The Two Towers             Elf    female   331
##  8 The Two Towers             Elf    male     513
##  9 The Two Towers             Hobbit female     0
## 10 The Two Towers             Hobbit male    2463
## # … with 8 more rows

12 / 39

13 / 39

Your Turn 1Use pivot_longer() to reorganize table4a into three columns: country, year, and cases.14 / 39

Your Turn 1

table4a %>%
  pivot_longer(
    names_to = "year",
    values_to = "cases",
    cols = -country
  )

## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Afghanistan 2000    2666
## 3 Brazil      1999   37737
## 4 Brazil      2000   80488
## 5 China       1999  212258
## 6 China       2000  213766

15 / 39

`pivot_wider()`

pivot_wider(<DATA>, <NAMES FROM>, <VALUES FROM>)

16 / 39

`pivot_wider()`

lotr %>%
  pivot_longer(
    names_to = "sex",
    values_to = "words",
    cols = female:male
  ) %>%
  pivot_wider(
    names_from = race,
    values_from = words
  )

17 / 39

`pivot_wider()`

lotr %>%
  pivot_longer(
    names_to = "sex",
    values_to = "words",
    cols = female:male
  ) %>%
  pivot_wider(
    names_from = race,
    values_from = words
  )

18 / 39

`pivot_wider()`

lotr %>%
  pivot_longer(
    names_to = "sex",
    values_to = "words",
    cols = female:male
  ) %>%
  pivot_wider(
    names_from = race,
    values_from = words
  )

## # A tibble: 6 x 5
##   film                       sex      Elf Hobbit   Man
##   <chr>                      <chr>  <int>  <int> <int>
## 1 The Fellowship Of The Ring female  1229     14     0
## 2 The Fellowship Of The Ring male     971   3644  1995
## 3 The Two Towers             female   331      0   401
## 4 The Two Towers             male     513   2463  3589
## 5 The Return Of The King     female   183      2   268
## 6 The Return Of The King     male     510   2673  2459

19 / 39

Your Turn 2

Use `pivot_wider()` to reorganize `table2` into four columns: country, year, cases, and population.

Create a new variable called `prevalence` that divides `cases` by `population` multiplied by 100000.

Pass the data frame to a ggplot. Make a scatter plot with `year` on the x axis and `prevalence` on the y axis. Set the color aesthetic (`aes()`) to `country`. Use `size = 2` for the points. Add a line geom.

table2

20 / 39

Your Turn 2

table2 %>%
  pivot_wider(
    names_from = type,
    values_from = count
  ) %>% 
  mutate(prevalence = (cases / population) * 100000)

## # A tibble: 6 x 5
##   country      year  cases population prevalence
##   <chr>       <int>  <int>      <int>      <dbl>
## 1 Afghanistan  1999    745   19987071       3.73
## 2 Afghanistan  2000   2666   20595360      12.9 
## 3 Brazil       1999  37737  172006362      21.9 
## 4 Brazil       2000  80488  174504898      46.1 
## 5 China        1999 212258 1272915272      16.7 
## 6 China        2000 213766 1280428583      16.7

21 / 39

Your Turn 2

table2 %>%
  pivot_wider(
    names_from = type,
    values_from = count
  ) %>%
  mutate(prevalence = (cases / population) * 100000) %>%
  ggplot(aes(x = year, y = prevalence, color = country)) +
  geom_point(size = 2) +
  geom_line() +
  scale_x_continuous(breaks = c(1999L, 2000L))

22 / 39

Your Turn 2

23 / 39

`pivot_longer()` and `pivot_wider()`

Animation: Garrick Aden-Buie

24 / 39

Your Turn 3

Pivot the 5th through 60th columns of `who` into a key column: value column pair named codes and n. Then select just the `county`, `year`, `codes` and `n` variables.

who

25 / 39

Your Turn 3

who %>%
  pivot_longer(
    names_to = "codes",
    values_to = "n",
    cols = 5:60
  ) %>% 
  select(country, year, codes, n)

26 / 39

Your Turn 3

## # A tibble: 405,440 x 4
##    country      year codes            n
##    <chr>       <int> <chr>        <int>
##  1 Afghanistan  1980 new_sp_m014     NA
##  2 Afghanistan  1980 new_sp_m1524    NA
##  3 Afghanistan  1980 new_sp_m2534    NA
##  4 Afghanistan  1980 new_sp_m3544    NA
##  5 Afghanistan  1980 new_sp_m4554    NA
##  6 Afghanistan  1980 new_sp_m5564    NA
##  7 Afghanistan  1980 new_sp_m65      NA
##  8 Afghanistan  1980 new_sp_f014     NA
##  9 Afghanistan  1980 new_sp_f1524    NA
## 10 Afghanistan  1980 new_sp_f2534    NA
## # … with 405,430 more rows

27 / 39

`separate()`/`unite()`

separate(<DATA>, <VARIABLE>, into = c("<VARIABLE1>", "<VARIABLE2>"))
unite(<DATA>, <VARIABLES>)

28 / 39

Your Turn 4

Use the `cases` data below. Separate the `sex_age` column into sex and age columns.

cases <- tribble(
   ~id,     ~sex_age,
   "1",    "male_56",
   "2",  "female_77",    
   "3",  "female_49"
)
separate(_______, _______, into = c("_______", "_______"))

29 / 39

Your Turn 4

cases <- tribble(
   ~id,     ~sex_age,
   "1",    "male_56",
   "2",  "female_77",    
   "3",  "female_49"
)
separate(cases, sex_age, into = c("sex", "age"))

30 / 39

Your Turn 4

cases <- tribble(
   ~id,     ~sex_age,
   "1",    "male_56",
   "2",  "female_77",    
   "3",  "female_49"
)
separate(cases, sex_age, into = c("sex", "age"))

## # A tibble: 3 x 3
##   id    sex    age  
##   <chr> <chr>  <chr>
## 1 1     male   56   
## 2 2     female 77   
## 3 3     female 49

31 / 39

Your Turn 4

cases <- tribble(
   ~id,     ~sex_age,
   "1",    "male_56",
   "2",  "female_77",    
   "3",  "female_49"
)
separate(cases, sex_age, into = c("sex", "age"))

## # A tibble: 3 x 3
##   id    sex    age  
##   <chr> <chr>  <chr>
## 1 1     male   56   
## 2 2     female 77   
## 3 3     female 49

32 / 39

Your Turn 4

cases <- tribble(
   ~id,     ~sex_age,
   "1",    "male_56",
   "2",  "female_77",    
   "3",  "female_49"
)
separate(cases, sex_age, into = c("sex", "age"), convert = TRUE)

## # A tibble: 3 x 3
##   id    sex      age
##   <chr> <chr>  <int>
## 1 1     male      56
## 2 2     female    77
## 3 3     female    49

33 / 39

Your Turn 5: Challenge!There are two CSV files in this folder containing SEER data in breast cancer incidence in white and black women. For both sets of data:Import the dataPivot the columns into 2 new columns called year and incidenceAdd a new variable called race. Remember that each data set corresponds to a single race.Bind the data sets together using bind_rows() from the dplyr package. Either save it as a new object or pipe the result directly into the ggplot2 code.Plot the data using the code below. Fill in the blanks to have year on the x-axis, incidence on the y-axis, and race as the color aesthetic.34 / 39

Other neat tidyr tools

Uncounting frequency tables

lotr %>%
  pivot_longer(
    names_to = "sex",
    values_to = "count",
    cols = c(female, male)
  ) %>% 
  uncount(count)

35 / 39

Other neat tidyr tools

Uncounting frequency tables

## # A tibble: 21,245 x 3
##    film                       race  sex   
##    <chr>                      <chr> <chr> 
##  1 The Fellowship Of The Ring Elf   female
##  2 The Fellowship Of The Ring Elf   female
##  3 The Fellowship Of The Ring Elf   female
##  4 The Fellowship Of The Ring Elf   female
##  5 The Fellowship Of The Ring Elf   female
##  6 The Fellowship Of The Ring Elf   female
##  7 The Fellowship Of The Ring Elf   female
##  8 The Fellowship Of The Ring Elf   female
##  9 The Fellowship Of The Ring Elf   female
## 10 The Fellowship Of The Ring Elf   female
## # … with 21,235 more rows

36 / 39

Other neat tidyr toolsWork with data framescrossing() and expand()nest() and unnest()37 / 39

Other neat tidyr toolsWork with missing datacomplete()drop_na() and replace_na()38 / 39

Resources

R for Data Science: A comprehensive but friendly introduction to the tidyverse. Free online.

RStudio Primers: Free interactive courses in the Tidyverse

39 / 39

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Tidying Data

tidyr

2020-08-22

tidyr

Functions for tidying data.

What is tidy data?

Tidy Data

Tidy Data

Each column is a single variable

Tidy Data

Each column is a single variable

Each row is a single observation

Tidy Data

Each column is a single variable

Each row is a single observation

Each cell is a value

pivot_longer()

Lord of the Rings

Lord of the Rings

pivot_longer()

Your Turn 1

Use pivot_longer() to reorganize table4a into three columns: country, year, and cases.

Your Turn 1

pivot_wider()

pivot_wider()

pivot_wider()

pivot_wider()

Your Turn 2

Use pivot_wider() to reorganize table2 into four columns: country, year, cases, and population.

Create a new variable called prevalence that divides cases by population multiplied by 100000.

Pass the data frame to a ggplot. Make a scatter plot with year on the x axis and prevalence on the y axis. Set the color aesthetic (aes()) to country. Use size = 2 for the points. Add a line geom.

Your Turn 2

Your Turn 2

Your Turn 2

pivot_longer() and pivot_wider()

Your Turn 3

Pivot the 5th through 60th columns of who into a key column: value column pair named codes and n. Then select just the county, year, codes and n variables.

Your Turn 3

Your Turn 3

separate()/unite()

Your Turn 4

Use the cases data below. Separate the sex_age column into sex and age columns.

Your Turn 4

Your Turn 4

Your Turn 4

Your Turn 4

Your Turn 5: Challenge!

There are two CSV files in this folder containing SEER data in breast cancer incidence in white and black women. For both sets of data:

Import the data

Pivot the columns into 2 new columns called year and incidence

Add a new variable called race. Remember that each data set corresponds to a single race.

Bind the data sets together using bind_rows() from the dplyr package. Either save it as a new object or pipe the result directly into the ggplot2 code.

Plot the data using the code below. Fill in the blanks to have year on the x-axis, incidence on the y-axis, and race as the color aesthetic.

Other neat tidyr tools

Uncounting frequency tables

Other neat tidyr tools

Uncounting frequency tables

Other neat tidyr tools

Work with data frames

crossing() and expand()

nest() and unnest()

Other neat tidyr tools

Work with missing data

complete()

drop_na() and replace_na()

Resources

R for Data Science: A comprehensive but friendly introduction to the tidyverse. Free online.

RStudio Primers: Free interactive courses in the Tidyverse

tidyr

Functions for tidying data.

What is tidy data?

Help

`pivot_longer()`

`pivot_longer()`

Use `pivot_longer()` to reorganize `table4a` into three columns: country, year, and cases.

`pivot_wider()`

`pivot_wider()`

`pivot_wider()`

`pivot_wider()`

Use `pivot_wider()` to reorganize `table2` into four columns: country, year, cases, and population.

Create a new variable called `prevalence` that divides `cases` by `population` multiplied by 100000.

Pass the data frame to a ggplot. Make a scatter plot with `year` on the x axis and `prevalence` on the y axis. Set the color aesthetic (`aes()`) to `country`. Use `size = 2` for the points. Add a line geom.

`pivot_longer()` and `pivot_wider()`

Pivot the 5th through 60th columns of `who` into a key column: value column pair named codes and n. Then select just the `county`, `year`, `codes` and `n` variables.

`separate()`/`unite()`

Use the `cases` data below. Separate the `sex_age` column into sex and age columns.

Pivot the columns into 2 new columns called `year` and `incidence`

Add a new variable called `race`. Remember that each data set corresponds to a single race.

Bind the data sets together using `bind_rows()` from the dplyr package. Either save it as a new object or pipe the result directly into the ggplot2 code.

Plot the data using the code below. Fill in the blanks to have `year` on the x-axis, `incidence` on the y-axis, and `race` as the color aesthetic.

`crossing()` and `expand()`

`nest()` and `unnest()`

`complete()`

`drop_na()` and `replace_na()`