Wrangling data with dplyr2020-08-221 / 75

Art by Allison Horst

2 / 75

The main verbs of dplyrselect()filter()mutate()arrange()summarize()group_by()3 / 75

The main verbs of dplyrselect() = Subset columns (variables)filter()mutate()arrange()summarize()group_by()4 / 75

`select()`

select(<DATA>, <VARIABLES>)

5 / 75

`select()`

select(<DATA>, <VARIABLES>)

diamonds

## # A tibble: 53,940 x 10
##    carat cut     color clarity depth table price     x     y
##    <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl>
##  1 0.23  Ideal   E     SI2      61.5    55   326  3.95  3.98
##  2 0.21  Premium E     SI1      59.8    61   326  3.89  3.84
##  3 0.23  Good    E     VS1      56.9    65   327  4.05  4.07
##  4 0.290 Premium I     VS2      62.4    58   334  4.2   4.23
##  5 0.31  Good    J     SI2      63.3    58   335  4.34  4.35
##  6 0.24  Very G… J     VVS2     62.8    57   336  3.94  3.96
##  7 0.24  Very G… I     VVS1     62.3    57   336  3.95  3.98
##  8 0.26  Very G… H     SI1      61.9    55   337  4.07  4.11
##  9 0.22  Fair    E     VS2      65.1    61   337  3.87  3.78
## 10 0.23  Very G… H     VS1      59.4    61   338  4     4.05
## # … with 53,930 more rows, and 1 more variable: z <dbl>

6 / 75

7 / 75

`select()`

select(diamonds, carat, cut, color, clarity)

8 / 75

`select()`

select(diamonds, carat, cut, color, clarity)

## # A tibble: 53,940 x 4
##    carat cut       color clarity
##    <dbl> <ord>     <ord> <ord>  
##  1 0.23  Ideal     E     SI2    
##  2 0.21  Premium   E     SI1    
##  3 0.23  Good      E     VS1    
##  4 0.290 Premium   I     VS2    
##  5 0.31  Good      J     SI2    
##  6 0.24  Very Good J     VVS2   
##  7 0.24  Very Good I     VVS1   
##  8 0.26  Very Good H     SI1    
##  9 0.22  Fair      E     VS2    
## 10 0.23  Very Good H     VS1    
## # … with 53,930 more rows

9 / 75

`select()`

select(diamonds, carat, cut, color, clarity)
select(diamonds, carat:clarity)
select(diamonds, 1:4)
select(diamonds, starts_with("c"))
?select_helpers

10 / 75

gapminder

library(gapminder)
gapminder

## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # … with 1,694 more rows

11 / 75

12 / 75

Your turn 1

Alter the code to select just the `pop` column:

select(gapminder, year, lifeExp)

13 / 75

Your Turn 1

select(gapminder, pop)

## # A tibble: 1,704 x 1
##         pop
##       <int>
##  1  8425333
##  2  9240934
##  3 10267083
##  4 11537966
##  5 13079460
##  6 14880372
##  7 12881816
##  8 13867957
##  9 16317921
## 10 22227415
## # … with 1,694 more rows

14 / 75

Show of Hands

Which of these is NOT a way to select the `country` and `continent` columns together?

select(gapminder, -c(year, lifeExp, pop, gdpPercap))
select(gapminder, country:continent)
select(gapminder, starts_with("c"))
select(gapminder, ends_with("t"))

15 / 75

Show of Hands

Which of these is NOT a way to select the `country` and `continent` columns together?

select(gapminder, ends_with("t"))

## # A tibble: 1,704 x 1
##    continent
##    <fct>    
##  1 Asia     
##  2 Asia     
##  3 Asia     
##  4 Asia     
##  5 Asia     
##  6 Asia     
##  7 Asia     
##  8 Asia     
##  9 Asia     
## 10 Asia     
## # … with 1,694 more rows

16 / 75

The main verbs of dplyrselect()filter() = Subset rows by valuemutate()arrange()summarize()group_by()17 / 75

`filter()`

filter(<DATA>, <PREDICATES>)

18 / 75

`filter()`

filter(<DATA>, <PREDICATES>)

Predicates: `TRUE` or `FALSE` statements

18 / 75

`filter()`

filter(<DATA>, <PREDICATES>)

Predicates: `TRUE` or `FALSE` statements

Comparisons: `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal).

19 / 75

`filter()`

filter(<DATA>, <PREDICATES>)

Predicates: `TRUE` or `FALSE` statements

Comparisons: `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal).

Operators: `&` is "and", `|` is "or", and `!` is "not"

20 / 75

`filter()`

filter(<DATA>, <PREDICATES>)

Predicates: `TRUE` or `FALSE` statements

Comparisons: `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal).

Operators: `&` is "and", `|` is "or", and `!` is "not"

`%in%`

21 / 75

`filter()`

filter(<DATA>, <PREDICATES>)

Predicates: `TRUE` or `FALSE` statements

Comparisons: `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal).

Operators: `&` is "and", `|` is "or", and `!` is "not"

`%in%`

"a" %in% c("a", "b", "c")

## [1] TRUE

21 / 75

`filter()`

filter(diamonds, cut == "Ideal", carat > 3)

22 / 75

`filter()`

filter(diamonds, cut == "Ideal", carat > 3)

## # A tibble: 4 x 10
##   carat cut   color clarity depth table price     x     y
##   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl>
## 1  3.22 Ideal I     I1       62.6    55 12545  9.49  9.42
## 2  3.5  Ideal H     I1       62.8    57 12587  9.65  9.59
## 3  3.01 Ideal J     SI2      61.7    58 16037  9.25  9.2 
## 4  3.01 Ideal J     I1       65.4    60 16538  8.99  8.93
## # … with 1 more variable: z <dbl>

23 / 75

Your turn 2Show:All of the rows where pop is greater than or equal to 100000All of the rows for El SalvadorAll of the rows that have a missing value for year  (no need to edit this code)24 / 75

Your turn 2

Show:

All of the rows where `pop` is greater than or equal to 100000

All of the rows for El Salvador

All of the rows that have a missing value for `year` (no need to edit this code)

filter(gapminder, pop >= 100000)
filter(gapminder, country == "El Salvador")
filter(gapminder, is.na(year))

25 / 75

`filter()`

filter(diamonds, cut == "Ideal" | cut == "Very Good", carat > 3)

## # A tibble: 6 x 10
##   carat cut      color clarity depth table price     x     y
##   <dbl> <ord>    <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl>
## 1  3.22 Ideal    I     I1       62.6    55 12545  9.49  9.42
## 2  3.5  Ideal    H     I1       62.8    57 12587  9.65  9.59
## 3  3.04 Very Go… I     SI2      63.2    59 15354  9.14  9.07
## 4  4    Very Go… I     I1       63.3    58 15984 10.0   9.94
## 5  3.01 Ideal    J     SI2      61.7    58 16037  9.25  9.2 
## 6  3.01 Ideal    J     I1       65.4    60 16538  8.99  8.93
## # … with 1 more variable: z <dbl>

26 / 75

Your turn 3

Use Boolean operators to alter the code below to return only the rows that contain:

El Salvador

Countries that had populations over 100000 in 1960 or earlier

filter(gapminder, country == "El Salvador" | country == "Oman")
filter(______, _______)

27 / 75

Your turn 3

Use Boolean operators to alter the code below to return only the rows that contain:

El Salvador

Countries that had populations over 100000 in 1960 or earlier

filter(gapminder, country == "El Salvador")
filter(gapminder, pop > 100000, year <= 1960)

28 / 75

The main verbs of dplyrselect()filter()mutate() = Change or add a variablearrange()summarize()group_by()29 / 75

`mutate()`

mutate(<DATA>, <NAME> = <FUNCTION>)

30 / 75

`mutate()`

mutate(diamonds, log_price = log(price), log_pricesq = log_price^2)

31 / 75

`mutate()`

mutate(diamonds, log_price = log(price), log_pricesq = log_price^2)

## # A tibble: 53,940 x 12
##    carat cut     color clarity depth table price     x     y
##    <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl>
##  1 0.23  Ideal   E     SI2      61.5    55   326  3.95  3.98
##  2 0.21  Premium E     SI1      59.8    61   326  3.89  3.84
##  3 0.23  Good    E     VS1      56.9    65   327  4.05  4.07
##  4 0.290 Premium I     VS2      62.4    58   334  4.2   4.23
##  5 0.31  Good    J     SI2      63.3    58   335  4.34  4.35
##  6 0.24  Very G… J     VVS2     62.8    57   336  3.94  3.96
##  7 0.24  Very G… I     VVS1     62.3    57   336  3.95  3.98
##  8 0.26  Very G… H     SI1      61.9    55   337  4.07  4.11
##  9 0.22  Fair    E     VS2      65.1    61   337  3.87  3.78
## 10 0.23  Very G… H     VS1      59.4    61   338  4     4.05
## # … with 53,930 more rows, and 3 more variables: z <dbl>,
## #   log_price <dbl>, log_pricesq <dbl>

32 / 75

`mutate(across())`

mutate(
  <DATA>, 
  across(c(<VARIABLES>), list(<NAMES> = <FUNCTIONS>))
)

33 / 75

mutate(
  diamonds, 
  across(c("carat", "depth"), list(sd = sd, mean = mean))
)

34 / 75

mutate(
  diamonds, 
  across(c("carat", "depth"), list(sd = sd, mean = mean))
)

## # A tibble: 53,940 x 14
##    carat cut     color clarity depth table price     x     y
##    <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl>
##  1 0.23  Ideal   E     SI2      61.5    55   326  3.95  3.98
##  2 0.21  Premium E     SI1      59.8    61   326  3.89  3.84
##  3 0.23  Good    E     VS1      56.9    65   327  4.05  4.07
##  4 0.290 Premium I     VS2      62.4    58   334  4.2   4.23
##  5 0.31  Good    J     SI2      63.3    58   335  4.34  4.35
##  6 0.24  Very G… J     VVS2     62.8    57   336  3.94  3.96
##  7 0.24  Very G… I     VVS1     62.3    57   336  3.95  3.98
##  8 0.26  Very G… H     SI1      61.9    55   337  4.07  4.11
##  9 0.22  Fair    E     VS2      65.1    61   337  3.87  3.78
## 10 0.23  Very G… H     VS1      59.4    61   338  4     4.05
## # … with 53,930 more rows, and 5 more variables: z <dbl>,
## #   carat_sd <dbl>, carat_mean <dbl>, depth_sd <dbl>,
## #   depth_mean <dbl>

35 / 75

`mutate(across(where()))`

mutate(
  <DATA>, 
  across(where(<CONDITION>), list(<NAMES> = <FUNCTIONS>))
)

36 / 75

mutate(
  gapminder, 
  across(where(is.numeric), list(mean = mean, median = median))
)

37 / 75

mutate(
  gapminder, 
  across(where(is.numeric), list(mean = mean, median = median))
)

## # A tibble: 1,704 x 14
##    country continent  year lifeExp    pop gdpPercap
##    <fct>   <fct>     <int>   <dbl>  <int>     <dbl>
##  1 Afghan… Asia       1952    28.8 8.43e6      779.
##  2 Afghan… Asia       1957    30.3 9.24e6      821.
##  3 Afghan… Asia       1962    32.0 1.03e7      853.
##  4 Afghan… Asia       1967    34.0 1.15e7      836.
##  5 Afghan… Asia       1972    36.1 1.31e7      740.
##  6 Afghan… Asia       1977    38.4 1.49e7      786.
##  7 Afghan… Asia       1982    39.9 1.29e7      978.
##  8 Afghan… Asia       1987    40.8 1.39e7      852.
##  9 Afghan… Asia       1992    41.7 1.63e7      649.
## 10 Afghan… Asia       1997    41.8 2.22e7      635.
## # … with 1,694 more rows, and 8 more variables:
## #   year_mean <dbl>, year_median <dbl>, lifeExp_mean <dbl>,
## #   lifeExp_median <dbl>, pop_mean <dbl>, pop_median <dbl>,
## #   …

38 / 75

The main verbs of dplyrselect()filter()mutate()arrange() = Sort the data setsummarize()group_by()39 / 75

`arrange()`

arrange(<DATA>, <SORTING VARIABLE>)

40 / 75

`arrange()`

arrange(diamonds, price)

## # A tibble: 53,940 x 10
##    carat cut     color clarity depth table price     x     y
##    <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl>
##  1 0.23  Ideal   E     SI2      61.5    55   326  3.95  3.98
##  2 0.21  Premium E     SI1      59.8    61   326  3.89  3.84
##  3 0.23  Good    E     VS1      56.9    65   327  4.05  4.07
##  4 0.290 Premium I     VS2      62.4    58   334  4.2   4.23
##  5 0.31  Good    J     SI2      63.3    58   335  4.34  4.35
##  6 0.24  Very G… J     VVS2     62.8    57   336  3.94  3.96
##  7 0.24  Very G… I     VVS1     62.3    57   336  3.95  3.98
##  8 0.26  Very G… H     SI1      61.9    55   337  4.07  4.11
##  9 0.22  Fair    E     VS2      65.1    61   337  3.87  3.78
## 10 0.23  Very G… H     VS1      59.4    61   338  4     4.05
## # … with 53,930 more rows, and 1 more variable: z <dbl>

41 / 75

`arrange()`

arrange(diamonds, cut, price)

## # A tibble: 53,940 x 10
##    carat cut   color clarity depth table price     x     y
##    <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl>
##  1  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78
##  2  0.25 Fair  E     VS1      55.2    64   361  4.21  4.23
##  3  0.23 Fair  G     VVS2     61.4    66   369  3.87  3.91
##  4  0.27 Fair  E     VS1      66.4    58   371  3.99  4.02
##  5  0.3  Fair  J     VS2      64.8    58   416  4.24  4.16
##  6  0.3  Fair  F     SI1      63.1    58   496  4.3   4.22
##  7  0.34 Fair  J     SI1      64.5    57   497  4.38  4.36
##  8  0.37 Fair  F     SI1      65.3    56   527  4.53  4.47
##  9  0.3  Fair  D     SI2      64.6    54   536  4.29  4.25
## 10  0.25 Fair  D     VS1      61.2    55   563  4.09  4.11
## # … with 53,930 more rows, and 1 more variable: z <dbl>

42 / 75

`desc()`

arrange(diamonds, cut, desc(price))

## # A tibble: 53,940 x 10
##    carat cut   color clarity depth table price     x     y
##    <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl>
##  1  2.01 Fair  G     SI1      70.6    64 18574  7.43  6.64
##  2  2.02 Fair  H     VS2      64.5    57 18565  8     7.95
##  3  4.5  Fair  J     I1       65.8    58 18531 10.2  10.2 
##  4  2    Fair  G     VS2      67.6    58 18515  7.65  7.61
##  5  2.51 Fair  H     SI2      64.7    57 18308  8.44  8.5 
##  6  3.01 Fair  I     SI2      65.8    56 18242  8.99  8.94
##  7  3.01 Fair  I     SI2      65.8    56 18242  8.99  8.94
##  8  2.32 Fair  H     SI1      62      62 18026  8.47  8.31
##  9  5.01 Fair  J     I1       65.5    59 18018 10.7  10.5 
## 10  1.93 Fair  F     VS1      58.9    62 17995  8.17  7.97
## # … with 53,930 more rows, and 1 more variable: z <dbl>

43 / 75

Your turn 4Arrange gapminder by year. Add lifeExp as a second (tie breaking) variable to arrange on.Which country had the lowest life expectancy in 1952?44 / 75

Your turn 4

arrange(gapminder, year, lifeExp)

## # A tibble: 1,704 x 6
##    country       continent  year lifeExp     pop gdpPercap
##    <fct>         <fct>     <int>   <dbl>   <int>     <dbl>
##  1 Afghanistan   Asia       1952    28.8 8425333      779.
##  2 Gambia        Africa     1952    30    284320      485.
##  3 Angola        Africa     1952    30.0 4232095     3521.
##  4 Sierra Leone  Africa     1952    30.3 2143249      880.
##  5 Mozambique    Africa     1952    31.3 6446316      469.
##  6 Burkina Faso  Africa     1952    32.0 4469979      543.
##  7 Guinea-Bissau Africa     1952    32.5  580653      300.
##  8 Yemen, Rep.   Asia       1952    32.5 4963829      782.
##  9 Somalia       Africa     1952    33.0 2526994     1136.
## 10 Guinea        Africa     1952    33.6 2664249      510.
## # … with 1,694 more rows

45 / 75

Your turn 5Use desc() to find the country with the highest gdpPercap.46 / 75

Your turn 5

arrange(gapminder, desc(gdpPercap))

## # A tibble: 1,704 x 6
##    country   continent  year lifeExp     pop gdpPercap
##    <fct>     <fct>     <int>   <dbl>   <int>     <dbl>
##  1 Kuwait    Asia       1957    58.0  212846   113523.
##  2 Kuwait    Asia       1972    67.7  841934   109348.
##  3 Kuwait    Asia       1952    55.6  160000   108382.
##  4 Kuwait    Asia       1962    60.5  358266    95458.
##  5 Kuwait    Asia       1967    64.6  575003    80895.
##  6 Kuwait    Asia       1977    69.3 1140357    59265.
##  7 Norway    Europe     2007    80.2 4627926    49357.
##  8 Kuwait    Asia       2007    77.6 2505559    47307.
##  9 Singapore Asia       2007    80.0 4553009    47143.
## 10 Norway    Europe     2002    79.0 4535591    44684.
## # … with 1,694 more rows

47 / 75

Detour: The Pipe%>%Passes the result on one function to another function48 / 75

Detour: The Pipe

diamonds <- arrange(diamonds, price)
diamonds <- filter(diamonds, price > 300)
diamonds <- mutate(diamonds, log_price = log(price))
diamonds

49 / 75

Detour: The Pipe

diamonds <- diamonds %>% 
  arrange(price) %>% 
  filter(price > 300) %>% 
  mutate(log_price = log(price))
diamonds

50 / 75

Keyboard shortcutsInsert <- with alt/opt + -51 / 75

Keyboard shortcutsInsert <- with alt/opt + -Insert %>% with ctrl/cmd + shift + m51 / 75

Your turn 6Use %>% to write a sequence of functions that:1. Filter only countries that are in the continent of Oceania.2. Select the country, year and lifeExp columns3. Arrange the results so that the highest life expetency is at the top.52 / 75

Your turn 6

gapminder %>% 
  filter(continent == "Oceania") %>% 
  select(country, year, lifeExp) %>% 
  arrange(desc(lifeExp))

## # A tibble: 24 x 3
##    country      year lifeExp
##    <fct>       <int>   <dbl>
##  1 Australia    2007    81.2
##  2 Australia    2002    80.4
##  3 New Zealand  2007    80.2
##  4 New Zealand  2002    79.1
##  5 Australia    1997    78.8
##  6 Australia    1992    77.6
##  7 New Zealand  1997    77.6
##  8 New Zealand  1992    76.3
##  9 Australia    1987    76.3
## 10 Australia    1982    74.7
## # … with 14 more rows

53 / 75

Challenge!1. Import the diabetes data from the importing data. A copy of the CSV file is available in this folder.2. Add the variable bmi to the data set using height and weight using the formula: (weight / height^2) * 7033. Select just id, glyhb, and the new variable you created.4. Filter rows that have BMI > 35. How many rows and columns are in your new data set?54 / 75

diabetes <- read_csv("diabetes.csv")
diabetes %>% 
  mutate(bmi = (weight / height^2) * 703) %>% 
  select(id, glyhb, bmi) %>% 
  filter(bmi > 35)

55 / 75

diabetes <- read_csv("diabetes.csv")
diabetes %>% 
  mutate(bmi = (weight / height^2) * 703) %>% 
  select(id, glyhb, bmi) %>% 
  filter(bmi > 35)

## # A tibble: 61 x 3
##       id glyhb   bmi
##    <dbl> <dbl> <dbl>
##  1  1001  4.44  37.4
##  2  1002  4.64  48.4
##  3  1022  5.78  35.8
##  4  1029  4.97  40.8
##  5  1253  4.67  36.0
##  6  1254 12.7   42.5
##  7  1280  5.10  38.3
##  8  1501  4.41  40.0
##  9  2753  5.57  35.3
## 10  2757  6.33  35.3
## # … with 51 more rows

56 / 75

The main verbs of dplyrselect()filter()mutate()arrange()summarize() = Summarize the datagroup_by() = Group the data57 / 75

`summarize()`

summarize(<DATA>, <NAME> = <FUNCTION>)

58 / 75

`summarize()`

summarize(diamonds, n = n(), mean_price = mean(price))

## # A tibble: 1 x 2
##       n mean_price
##   <int>      <dbl>
## 1 53940      3933.

59 / 75

Your turn 7Use summarise() to compute three statistics about the gapminder data set:1. The first (min()) year in the data2. The last (max()) year in the data3. The total number of observations (n()) and the total number of unique countries in the data (n_distinct())60 / 75

Your turn 7

gapminder %>% 
  summarize(
    first = min(year), 
    last = max(year), 
    n = n(), 
    n_countries = n_distinct(country)
  )

## # A tibble: 1 x 4
##   first  last     n n_countries
##   <int> <int> <int>       <int>
## 1  1952  2007  1704         142

61 / 75

`group_by()`

group_by(<DATA>, <VARIABLE>)

62 / 75

`group_by()`

diamonds %>% 
  group_by(cut)

63 / 75

`group_by()`

diamonds %>% 
  group_by(cut)

## # A tibble: 53,940 x 10
## # Groups:   cut [5]
##    carat cut     color clarity depth table price     x     y
##    <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl>
##  1 0.23  Ideal   E     SI2      61.5    55   326  3.95  3.98
##  2 0.21  Premium E     SI1      59.8    61   326  3.89  3.84
##  3 0.23  Good    E     VS1      56.9    65   327  4.05  4.07
##  4 0.290 Premium I     VS2      62.4    58   334  4.2   4.23
##  5 0.31  Good    J     SI2      63.3    58   335  4.34  4.35
##  6 0.24  Very G… J     VVS2     62.8    57   336  3.94  3.96
##  7 0.24  Very G… I     VVS1     62.3    57   336  3.95  3.98
##  8 0.26  Very G… H     SI1      61.9    55   337  4.07  4.11
##  9 0.22  Fair    E     VS2      65.1    61   337  3.87  3.78
## 10 0.23  Very G… H     VS1      59.4    61   338  4     4.05
## # … with 53,930 more rows, and 1 more variable: z <dbl>

64 / 75

`group_by()`

diamonds %>% 
  group_by(cut) %>%
  summarize(n = n(), mean_price = mean(price))

65 / 75

`group_by()`

diamonds %>% 
  group_by(cut) %>% 
  summarize(n = n(), mean_price = mean(price))

## # A tibble: 5 x 3
##   cut           n mean_price
##   <ord>     <int>      <dbl>
## 1 Fair       1610      4359.
## 2 Good       4906      3929.
## 3 Very Good 12082      3982.
## 4 Premium   13791      4584.
## 5 Ideal     21551      3458.

66 / 75

`group_by()`

diamonds %>% 
  group_by(cut) %>% 
  mutate(n = n(), mean_price = mean(price))

67 / 75

`group_by()`

diamonds %>% 
  group_by(cut) %>% 
  mutate(n = n(), mean_price = mean(price))

## # A tibble: 53,940 x 12
## # Groups:   cut [5]
##    carat cut     color clarity depth table price     x     y
##    <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl>
##  1 0.23  Ideal   E     SI2      61.5    55   326  3.95  3.98
##  2 0.21  Premium E     SI1      59.8    61   326  3.89  3.84
##  3 0.23  Good    E     VS1      56.9    65   327  4.05  4.07
##  4 0.290 Premium I     VS2      62.4    58   334  4.2   4.23
##  5 0.31  Good    J     SI2      63.3    58   335  4.34  4.35
##  6 0.24  Very G… J     VVS2     62.8    57   336  3.94  3.96
##  7 0.24  Very G… I     VVS1     62.3    57   336  3.95  3.98
##  8 0.26  Very G… H     SI1      61.9    55   337  4.07  4.11
##  9 0.22  Fair    E     VS2      65.1    61   337  3.87  3.78
## 10 0.23  Very G… H     VS1      59.4    61   338  4     4.05
## # … with 53,930 more rows, and 3 more variables: z <dbl>,
## #   n <int>, mean_price <dbl>

68 / 75

Your turn 8Extract the rows where continent == "Europe". Then use group_by() to group by country. Finally, use summarize() to compute:1. The total number of observations for each country in Europe2. The lowest observed life expectancy for each country69 / 75

Your turn 8

gapminder %>% 
  filter(continent == "Europe") %>% 
  group_by(country) %>% 
  summarize(n = n(), min_le = min(lifeExp))

## # A tibble: 30 x 3
##    country                    n min_le
##    <fct>                  <int>  <dbl>
##  1 Albania                   12   55.2
##  2 Austria                   12   66.8
##  3 Belgium                   12   68  
##  4 Bosnia and Herzegovina    12   53.8
##  5 Bulgaria                  12   59.6
##  6 Croatia                   12   61.2
##  7 Czech Republic            12   66.9
##  8 Denmark                   12   70.8
##  9 Finland                   12   66.6
## 10 France                    12   67.4
## # … with 20 more rows

70 / 75

Your turn 9

Use grouping to calculate the mean life expectancy for each continent and year. Call the mean life expectancy variable `mean_le`. Plot the life expectancy over time (no need to change the plot code).

gapminder %>% 
  __________ %>% 
  __________ %>% 
  ggplot(aes(x = year, y = mean_le, col = continent)) +
    geom_line() + 
    scale_color_manual(values = continent_colors)

71 / 75

Your turn 9

Use grouping to calculate the mean life expectancy for each continent and year. Call the mean life expectancy variable `mean_le`. Plot the life expectancy over time (no need to change the plot code).

gapminder %>% 
  group_by(continent, year) %>% 
  summarize(mean_le = mean(lifeExp)) %>% 
  ggplot(aes(x = year, y = mean_le, col = continent)) +
    geom_line() + 
    scale_color_manual(values = continent_colors)

72 / 75

73 / 75

Joining dataUse left_join(), right_join(), full_join(), or inner_join() to join datasetsUse semi_join() or anti_join() to filter datasets against each other74 / 75

Resources

R for Data Science: A comprehensive but friendly introduction to the tidyverse. Free online.

RStudio Primers: Free interactive courses in the Tidyverse

10 dplyr tips: a Twitter thread on other useful aspects of dplyr

75 / 75

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Wrangling data with dplyr

2020-08-22

The main verbs of dplyr

select()

filter()

mutate()

arrange()

summarize()

group_by()

The main verbs of dplyr

select() = Subset columns (variables)

filter()

mutate()

arrange()

summarize()

group_by()

select()

select()

select()

select()

select()

gapminder

Your turn 1

Alter the code to select just the pop column:

Your Turn 1

Show of Hands

Which of these is NOT a way to select the country and continent columns together?

Show of Hands

Which of these is NOT a way to select the country and continent columns together?

The main verbs of dplyr

select()

filter() = Subset rows by value

mutate()

arrange()

summarize()

group_by()

filter()

filter()

Predicates: TRUE or FALSE statements

filter()

Predicates: TRUE or FALSE statements

Comparisons: >, >=, <, <=, != (not equal), and == (equal).

filter()

Predicates: TRUE or FALSE statements

Comparisons: >, >=, <, <=, != (not equal), and == (equal).

Operators: & is "and", | is "or", and ! is "not"

filter()

Predicates: TRUE or FALSE statements

Comparisons: >, >=, <, <=, != (not equal), and == (equal).

Operators: & is "and", | is "or", and ! is "not"

%in%

filter()

Predicates: TRUE or FALSE statements

Comparisons: >, >=, <, <=, != (not equal), and == (equal).

Operators: & is "and", | is "or", and ! is "not"

%in%

filter()

filter()

Your turn 2

Show:

All of the rows where pop is greater than or equal to 100000

All of the rows for El Salvador

All of the rows that have a missing value for year (no need to edit this code)

Your turn 2

Show:

All of the rows where pop is greater than or equal to 100000

All of the rows for El Salvador

All of the rows that have a missing value for year (no need to edit this code)

filter()

Your turn 3

Use Boolean operators to alter the code below to return only the rows that contain:

El Salvador

Countries that had populations over 100000 in 1960 or earlier

Your turn 3

Use Boolean operators to alter the code below to return only the rows that contain:

El Salvador

Countries that had populations over 100000 in 1960 or earlier

The main verbs of dplyr

select()

filter()

The main verbs of `dplyr`

`select()`

`filter()`

`mutate()`

`arrange()`

`summarize()`

`group_by()`

The main verbs of `dplyr`

`select()` = Subset columns (variables)

`filter()`

`mutate()`

`arrange()`

`summarize()`

`group_by()`

`select()`

`select()`

`select()`

`select()`

`select()`

Alter the code to select just the `pop` column:

Which of these is NOT a way to select the `country` and `continent` columns together?

Which of these is NOT a way to select the `country` and `continent` columns together?

The main verbs of `dplyr`

`select()`

`filter()` = Subset rows by value

`mutate()`

`arrange()`

`summarize()`

`group_by()`

`filter()`

`filter()`

Predicates: `TRUE` or `FALSE` statements

`filter()`

Predicates: `TRUE` or `FALSE` statements

Comparisons: `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal).

`filter()`

Predicates: `TRUE` or `FALSE` statements

Comparisons: `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal).

Operators: `&` is "and", `|` is "or", and `!` is "not"

`filter()`

Predicates: `TRUE` or `FALSE` statements

Comparisons: `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal).

Operators: `&` is "and", `|` is "or", and `!` is "not"

`%in%`

`filter()`

Predicates: `TRUE` or `FALSE` statements

Comparisons: `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal).

Operators: `&` is "and", `|` is "or", and `!` is "not"

`%in%`

`filter()`

`filter()`

All of the rows where `pop` is greater than or equal to 100000

All of the rows that have a missing value for `year` (no need to edit this code)

All of the rows where `pop` is greater than or equal to 100000

All of the rows that have a missing value for `year` (no need to edit this code)

`filter()`

The main verbs of `dplyr`

`select()`

`filter()`

`mutate()` = Change or add a variable

`arrange()`

`summarize()`

`group_by()`

`mutate()`

`mutate()`

`mutate()`

`mutate(across())`

`mutate(across(where()))`

The main verbs of `dplyr`

`select()`

`filter()`

`mutate()`

`arrange()` = Sort the data set

`summarize()`

`group_by()`

`arrange()`

`arrange()`

`arrange()`

`desc()`

Arrange gapminder by `year`. Add `lifeExp` as a second (tie breaking) variable to arrange on.