+ - 0:00:00
Notes for current slide
Notes for next slide

Wrangling data with dplyr

2020-08-22

1 / 75

Art by Allison Horst

2 / 75

The main verbs of dplyr

select()

filter()

mutate()

arrange()

summarize()

group_by()

3 / 75

The main verbs of dplyr

select() = Subset columns (variables)

filter()

mutate()

arrange()

summarize()

group_by()

4 / 75

select()

select(<DATA>, <VARIABLES>)
5 / 75

select()

select(<DATA>, <VARIABLES>)
diamonds
## # A tibble: 53,940 x 10
## carat cut color clarity depth table price x y
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35
## 6 0.24 Very G… J VVS2 62.8 57 336 3.94 3.96
## 7 0.24 Very G… I VVS1 62.3 57 336 3.95 3.98
## 8 0.26 Very G… H SI1 61.9 55 337 4.07 4.11
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78
## 10 0.23 Very G… H VS1 59.4 61 338 4 4.05
## # … with 53,930 more rows, and 1 more variable: z <dbl>
6 / 75

7 / 75

select()

select(diamonds, carat, cut, color, clarity)
8 / 75

select()

select(diamonds, carat, cut, color, clarity)
## # A tibble: 53,940 x 4
## carat cut color clarity
## <dbl> <ord> <ord> <ord>
## 1 0.23 Ideal E SI2
## 2 0.21 Premium E SI1
## 3 0.23 Good E VS1
## 4 0.290 Premium I VS2
## 5 0.31 Good J SI2
## 6 0.24 Very Good J VVS2
## 7 0.24 Very Good I VVS1
## 8 0.26 Very Good H SI1
## 9 0.22 Fair E VS2
## 10 0.23 Very Good H VS1
## # … with 53,930 more rows
9 / 75

select()

select(diamonds, carat, cut, color, clarity)
select(diamonds, carat:clarity)
select(diamonds, 1:4)
select(diamonds, starts_with("c"))
?select_helpers
10 / 75

gapminder

library(gapminder)
gapminder
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # … with 1,694 more rows
11 / 75

12 / 75

Your turn 1

Alter the code to select just the pop column:

select(gapminder, year, lifeExp)
13 / 75

Your Turn 1

select(gapminder, pop)
## # A tibble: 1,704 x 1
## pop
## <int>
## 1 8425333
## 2 9240934
## 3 10267083
## 4 11537966
## 5 13079460
## 6 14880372
## 7 12881816
## 8 13867957
## 9 16317921
## 10 22227415
## # … with 1,694 more rows
14 / 75

Show of Hands

Which of these is NOT a way to select the country and continent columns together?

select(gapminder, -c(year, lifeExp, pop, gdpPercap))
select(gapminder, country:continent)
select(gapminder, starts_with("c"))
select(gapminder, ends_with("t"))
15 / 75

Show of Hands

Which of these is NOT a way to select the country and continent columns together?

select(gapminder, ends_with("t"))
## # A tibble: 1,704 x 1
## continent
## <fct>
## 1 Asia
## 2 Asia
## 3 Asia
## 4 Asia
## 5 Asia
## 6 Asia
## 7 Asia
## 8 Asia
## 9 Asia
## 10 Asia
## # … with 1,694 more rows
16 / 75

The main verbs of dplyr

select()

filter() = Subset rows by value

mutate()

arrange()

summarize()

group_by()

17 / 75

filter()

filter(<DATA>, <PREDICATES>)
18 / 75

filter()

filter(<DATA>, <PREDICATES>)

Predicates: TRUE or FALSE statements

18 / 75

filter()

filter(<DATA>, <PREDICATES>)

Predicates: TRUE or FALSE statements

Comparisons: >, >=, <, <=, != (not equal), and == (equal).

19 / 75

filter()

filter(<DATA>, <PREDICATES>)

Predicates: TRUE or FALSE statements

Comparisons: >, >=, <, <=, != (not equal), and == (equal).

Operators: & is "and", | is "or", and ! is "not"

20 / 75

filter()

filter(<DATA>, <PREDICATES>)

Predicates: TRUE or FALSE statements

Comparisons: >, >=, <, <=, != (not equal), and == (equal).

Operators: & is "and", | is "or", and ! is "not"

%in%

21 / 75

filter()

filter(<DATA>, <PREDICATES>)

Predicates: TRUE or FALSE statements

Comparisons: >, >=, <, <=, != (not equal), and == (equal).

Operators: & is "and", | is "or", and ! is "not"

%in%

"a" %in% c("a", "b", "c")
## [1] TRUE
21 / 75

filter()

filter(diamonds, cut == "Ideal", carat > 3)
22 / 75

filter()

filter(diamonds, cut == "Ideal", carat > 3)
## # A tibble: 4 x 10
## carat cut color clarity depth table price x y
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl>
## 1 3.22 Ideal I I1 62.6 55 12545 9.49 9.42
## 2 3.5 Ideal H I1 62.8 57 12587 9.65 9.59
## 3 3.01 Ideal J SI2 61.7 58 16037 9.25 9.2
## 4 3.01 Ideal J I1 65.4 60 16538 8.99 8.93
## # … with 1 more variable: z <dbl>
23 / 75

Your turn 2

Show:

All of the rows where pop is greater than or equal to 100000

All of the rows for El Salvador

All of the rows that have a missing value for year (no need to edit this code)

24 / 75

Your turn 2

Show:

All of the rows where pop is greater than or equal to 100000

All of the rows for El Salvador

All of the rows that have a missing value for year (no need to edit this code)

filter(gapminder, pop >= 100000)
filter(gapminder, country == "El Salvador")
filter(gapminder, is.na(year))
25 / 75

filter()

filter(diamonds, cut == "Ideal" | cut == "Very Good", carat > 3)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl>
## 1 3.22 Ideal I I1 62.6 55 12545 9.49 9.42
## 2 3.5 Ideal H I1 62.8 57 12587 9.65 9.59
## 3 3.04 Very Go… I SI2 63.2 59 15354 9.14 9.07
## 4 4 Very Go… I I1 63.3 58 15984 10.0 9.94
## 5 3.01 Ideal J SI2 61.7 58 16037 9.25 9.2
## 6 3.01 Ideal J I1 65.4 60 16538 8.99 8.93
## # … with 1 more variable: z <dbl>
26 / 75

Your turn 3

Use Boolean operators to alter the code below to return only the rows that contain:

El Salvador

Countries that had populations over 100000 in 1960 or earlier

filter(gapminder, country == "El Salvador" | country == "Oman")
filter(______, _______)
27 / 75

Your turn 3

Use Boolean operators to alter the code below to return only the rows that contain:

El Salvador

Countries that had populations over 100000 in 1960 or earlier

filter(gapminder, country == "El Salvador")
filter(gapminder, pop > 100000, year <= 1960)
28 / 75

The main verbs of dplyr

select()

filter()

mutate() = Change or add a variable

arrange()

summarize()

group_by()

29 / 75

mutate()

mutate(<DATA>, <NAME> = <FUNCTION>)
30 / 75

mutate()

mutate(diamonds, log_price = log(price), log_pricesq = log_price^2)
31 / 75

mutate()

mutate(diamonds, log_price = log(price), log_pricesq = log_price^2)
## # A tibble: 53,940 x 12
## carat cut color clarity depth table price x y
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35
## 6 0.24 Very G… J VVS2 62.8 57 336 3.94 3.96
## 7 0.24 Very G… I VVS1 62.3 57 336 3.95 3.98
## 8 0.26 Very G… H SI1 61.9 55 337 4.07 4.11
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78
## 10 0.23 Very G… H VS1 59.4 61 338 4 4.05
## # … with 53,930 more rows, and 3 more variables: z <dbl>,
## # log_price <dbl>, log_pricesq <dbl>
32 / 75

mutate(across())

mutate(
<DATA>,
across(c(<VARIABLES>), list(<NAMES> = <FUNCTIONS>))
)
33 / 75
mutate(
diamonds,
across(c("carat", "depth"), list(sd = sd, mean = mean))
)
34 / 75
mutate(
diamonds,
across(c("carat", "depth"), list(sd = sd, mean = mean))
)
## # A tibble: 53,940 x 14
## carat cut color clarity depth table price x y
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35
## 6 0.24 Very G… J VVS2 62.8 57 336 3.94 3.96
## 7 0.24 Very G… I VVS1 62.3 57 336 3.95 3.98
## 8 0.26 Very G… H SI1 61.9 55 337 4.07 4.11
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78
## 10 0.23 Very G… H VS1 59.4 61 338 4 4.05
## # … with 53,930 more rows, and 5 more variables: z <dbl>,
## # carat_sd <dbl>, carat_mean <dbl>, depth_sd <dbl>,
## # depth_mean <dbl>
35 / 75

mutate(across(where()))

mutate(
<DATA>,
across(where(<CONDITION>), list(<NAMES> = <FUNCTIONS>))
)
36 / 75
mutate(
gapminder,
across(where(is.numeric), list(mean = mean, median = median))
)
37 / 75
mutate(
gapminder,
across(where(is.numeric), list(mean = mean, median = median))
)
## # A tibble: 1,704 x 14
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghan… Asia 1952 28.8 8.43e6 779.
## 2 Afghan… Asia 1957 30.3 9.24e6 821.
## 3 Afghan… Asia 1962 32.0 1.03e7 853.
## 4 Afghan… Asia 1967 34.0 1.15e7 836.
## 5 Afghan… Asia 1972 36.1 1.31e7 740.
## 6 Afghan… Asia 1977 38.4 1.49e7 786.
## 7 Afghan… Asia 1982 39.9 1.29e7 978.
## 8 Afghan… Asia 1987 40.8 1.39e7 852.
## 9 Afghan… Asia 1992 41.7 1.63e7 649.
## 10 Afghan… Asia 1997 41.8 2.22e7 635.
## # … with 1,694 more rows, and 8 more variables:
## # year_mean <dbl>, year_median <dbl>, lifeExp_mean <dbl>,
## # lifeExp_median <dbl>, pop_mean <dbl>, pop_median <dbl>,
## # …
38 / 75

The main verbs of dplyr

select()

filter()

mutate()

arrange() = Sort the data set

summarize()

group_by()

39 / 75

arrange()

arrange(<DATA>, <SORTING VARIABLE>)
40 / 75

arrange()

arrange(diamonds, price)
## # A tibble: 53,940 x 10
## carat cut color clarity depth table price x y
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35
## 6 0.24 Very G… J VVS2 62.8 57 336 3.94 3.96
## 7 0.24 Very G… I VVS1 62.3 57 336 3.95 3.98
## 8 0.26 Very G… H SI1 61.9 55 337 4.07 4.11
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78
## 10 0.23 Very G… H VS1 59.4 61 338 4 4.05
## # … with 53,930 more rows, and 1 more variable: z <dbl>
41 / 75

arrange()

arrange(diamonds, cut, price)
## # A tibble: 53,940 x 10
## carat cut color clarity depth table price x y
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl>
## 1 0.22 Fair E VS2 65.1 61 337 3.87 3.78
## 2 0.25 Fair E VS1 55.2 64 361 4.21 4.23
## 3 0.23 Fair G VVS2 61.4 66 369 3.87 3.91
## 4 0.27 Fair E VS1 66.4 58 371 3.99 4.02
## 5 0.3 Fair J VS2 64.8 58 416 4.24 4.16
## 6 0.3 Fair F SI1 63.1 58 496 4.3 4.22
## 7 0.34 Fair J SI1 64.5 57 497 4.38 4.36
## 8 0.37 Fair F SI1 65.3 56 527 4.53 4.47
## 9 0.3 Fair D SI2 64.6 54 536 4.29 4.25
## 10 0.25 Fair D VS1 61.2 55 563 4.09 4.11
## # … with 53,930 more rows, and 1 more variable: z <dbl>
42 / 75

desc()

arrange(diamonds, cut, desc(price))
## # A tibble: 53,940 x 10
## carat cut color clarity depth table price x y
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl>
## 1 2.01 Fair G SI1 70.6 64 18574 7.43 6.64
## 2 2.02 Fair H VS2 64.5 57 18565 8 7.95
## 3 4.5 Fair J I1 65.8 58 18531 10.2 10.2
## 4 2 Fair G VS2 67.6 58 18515 7.65 7.61
## 5 2.51 Fair H SI2 64.7 57 18308 8.44 8.5
## 6 3.01 Fair I SI2 65.8 56 18242 8.99 8.94
## 7 3.01 Fair I SI2 65.8 56 18242 8.99 8.94
## 8 2.32 Fair H SI1 62 62 18026 8.47 8.31
## 9 5.01 Fair J I1 65.5 59 18018 10.7 10.5
## 10 1.93 Fair F VS1 58.9 62 17995 8.17 7.97
## # … with 53,930 more rows, and 1 more variable: z <dbl>
43 / 75

Your turn 4

Arrange gapminder by year. Add lifeExp as a second (tie breaking) variable to arrange on.

Which country had the lowest life expectancy in 1952?

44 / 75

Your turn 4

arrange(gapminder, year, lifeExp)
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Gambia Africa 1952 30 284320 485.
## 3 Angola Africa 1952 30.0 4232095 3521.
## 4 Sierra Leone Africa 1952 30.3 2143249 880.
## 5 Mozambique Africa 1952 31.3 6446316 469.
## 6 Burkina Faso Africa 1952 32.0 4469979 543.
## 7 Guinea-Bissau Africa 1952 32.5 580653 300.
## 8 Yemen, Rep. Asia 1952 32.5 4963829 782.
## 9 Somalia Africa 1952 33.0 2526994 1136.
## 10 Guinea Africa 1952 33.6 2664249 510.
## # … with 1,694 more rows
45 / 75

Your turn 5

Use desc() to find the country with the highest gdpPercap.

46 / 75

Your turn 5

arrange(gapminder, desc(gdpPercap))
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Kuwait Asia 1957 58.0 212846 113523.
## 2 Kuwait Asia 1972 67.7 841934 109348.
## 3 Kuwait Asia 1952 55.6 160000 108382.
## 4 Kuwait Asia 1962 60.5 358266 95458.
## 5 Kuwait Asia 1967 64.6 575003 80895.
## 6 Kuwait Asia 1977 69.3 1140357 59265.
## 7 Norway Europe 2007 80.2 4627926 49357.
## 8 Kuwait Asia 2007 77.6 2505559 47307.
## 9 Singapore Asia 2007 80.0 4553009 47143.
## 10 Norway Europe 2002 79.0 4535591 44684.
## # … with 1,694 more rows
47 / 75

Detour: The Pipe

%>%

Passes the result on one function to another function

48 / 75

Detour: The Pipe

diamonds <- arrange(diamonds, price)
diamonds <- filter(diamonds, price > 300)
diamonds <- mutate(diamonds, log_price = log(price))
diamonds
49 / 75

Detour: The Pipe

diamonds <- diamonds %>%
arrange(price) %>%
filter(price > 300) %>%
mutate(log_price = log(price))
diamonds
50 / 75

Keyboard shortcuts

Insert <- with alt/opt + -

51 / 75

Keyboard shortcuts

Insert <- with alt/opt + -

Insert %>% with ctrl/cmd + shift + m

51 / 75

Your turn 6

Use %>% to write a sequence of functions that:

1. Filter only countries that are in the continent of Oceania.

2. Select the country, year and lifeExp columns

3. Arrange the results so that the highest life expetency is at the top.

52 / 75

Your turn 6

gapminder %>%
filter(continent == "Oceania") %>%
select(country, year, lifeExp) %>%
arrange(desc(lifeExp))
## # A tibble: 24 x 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Australia 2007 81.2
## 2 Australia 2002 80.4
## 3 New Zealand 2007 80.2
## 4 New Zealand 2002 79.1
## 5 Australia 1997 78.8
## 6 Australia 1992 77.6
## 7 New Zealand 1997 77.6
## 8 New Zealand 1992 76.3
## 9 Australia 1987 76.3
## 10 Australia 1982 74.7
## # … with 14 more rows
53 / 75

Challenge!

1. Import the diabetes data from the importing data. A copy of the CSV file is available in this folder.

2. Add the variable bmi to the data set using height and weight using the formula: (weight / height^2) * 703

3. Select just id, glyhb, and the new variable you created.

4. Filter rows that have BMI > 35. How many rows and columns are in your new data set?

54 / 75
diabetes <- read_csv("diabetes.csv")
diabetes %>%
mutate(bmi = (weight / height^2) * 703) %>%
select(id, glyhb, bmi) %>%
filter(bmi > 35)
55 / 75
diabetes <- read_csv("diabetes.csv")
diabetes %>%
mutate(bmi = (weight / height^2) * 703) %>%
select(id, glyhb, bmi) %>%
filter(bmi > 35)
## # A tibble: 61 x 3
## id glyhb bmi
## <dbl> <dbl> <dbl>
## 1 1001 4.44 37.4
## 2 1002 4.64 48.4
## 3 1022 5.78 35.8
## 4 1029 4.97 40.8
## 5 1253 4.67 36.0
## 6 1254 12.7 42.5
## 7 1280 5.10 38.3
## 8 1501 4.41 40.0
## 9 2753 5.57 35.3
## 10 2757 6.33 35.3
## # … with 51 more rows
56 / 75

The main verbs of dplyr

select()

filter()

mutate()

arrange()

summarize() = Summarize the data

group_by() = Group the data

57 / 75

summarize()

summarize(<DATA>, <NAME> = <FUNCTION>)
58 / 75

summarize()

summarize(diamonds, n = n(), mean_price = mean(price))
## # A tibble: 1 x 2
## n mean_price
## <int> <dbl>
## 1 53940 3933.
59 / 75

Your turn 7

Use summarise() to compute three statistics about the gapminder data set:

1. The first (min()) year in the data

2. The last (max()) year in the data

3. The total number of observations (n()) and the total number of unique countries in the data (n_distinct())

60 / 75

Your turn 7

gapminder %>%
summarize(
first = min(year),
last = max(year),
n = n(),
n_countries = n_distinct(country)
)
## # A tibble: 1 x 4
## first last n n_countries
## <int> <int> <int> <int>
## 1 1952 2007 1704 142
61 / 75

group_by()

group_by(<DATA>, <VARIABLE>)
62 / 75

group_by()

diamonds %>%
group_by(cut)
63 / 75

group_by()

diamonds %>%
group_by(cut)
## # A tibble: 53,940 x 10
## # Groups: cut [5]
## carat cut color clarity depth table price x y
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35
## 6 0.24 Very G… J VVS2 62.8 57 336 3.94 3.96
## 7 0.24 Very G… I VVS1 62.3 57 336 3.95 3.98
## 8 0.26 Very G… H SI1 61.9 55 337 4.07 4.11
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78
## 10 0.23 Very G… H VS1 59.4 61 338 4 4.05
## # … with 53,930 more rows, and 1 more variable: z <dbl>
64 / 75

group_by()

diamonds %>%
group_by(cut) %>%
summarize(n = n(), mean_price = mean(price))
65 / 75

group_by()

diamonds %>%
group_by(cut) %>%
summarize(n = n(), mean_price = mean(price))
## # A tibble: 5 x 3
## cut n mean_price
## <ord> <int> <dbl>
## 1 Fair 1610 4359.
## 2 Good 4906 3929.
## 3 Very Good 12082 3982.
## 4 Premium 13791 4584.
## 5 Ideal 21551 3458.
66 / 75

group_by()

diamonds %>%
group_by(cut) %>%
mutate(n = n(), mean_price = mean(price))
67 / 75

group_by()

diamonds %>%
group_by(cut) %>%
mutate(n = n(), mean_price = mean(price))
## # A tibble: 53,940 x 12
## # Groups: cut [5]
## carat cut color clarity depth table price x y
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35
## 6 0.24 Very G… J VVS2 62.8 57 336 3.94 3.96
## 7 0.24 Very G… I VVS1 62.3 57 336 3.95 3.98
## 8 0.26 Very G… H SI1 61.9 55 337 4.07 4.11
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78
## 10 0.23 Very G… H VS1 59.4 61 338 4 4.05
## # … with 53,930 more rows, and 3 more variables: z <dbl>,
## # n <int>, mean_price <dbl>
68 / 75

Your turn 8

Extract the rows where continent == "Europe". Then use group_by() to group by country. Finally, use summarize() to compute:

1. The total number of observations for each country in Europe

2. The lowest observed life expectancy for each country

69 / 75

Your turn 8

gapminder %>%
filter(continent == "Europe") %>%
group_by(country) %>%
summarize(n = n(), min_le = min(lifeExp))
## # A tibble: 30 x 3
## country n min_le
## <fct> <int> <dbl>
## 1 Albania 12 55.2
## 2 Austria 12 66.8
## 3 Belgium 12 68
## 4 Bosnia and Herzegovina 12 53.8
## 5 Bulgaria 12 59.6
## 6 Croatia 12 61.2
## 7 Czech Republic 12 66.9
## 8 Denmark 12 70.8
## 9 Finland 12 66.6
## 10 France 12 67.4
## # … with 20 more rows
70 / 75

Your turn 9

Use grouping to calculate the mean life expectancy for each continent and year. Call the mean life expectancy variable mean_le. Plot the life expectancy over time (no need to change the plot code).

gapminder %>%
__________ %>%
__________ %>%
ggplot(aes(x = year, y = mean_le, col = continent)) +
geom_line() +
scale_color_manual(values = continent_colors)
71 / 75

Your turn 9

Use grouping to calculate the mean life expectancy for each continent and year. Call the mean life expectancy variable mean_le. Plot the life expectancy over time (no need to change the plot code).

gapminder %>%
group_by(continent, year) %>%
summarize(mean_le = mean(lifeExp)) %>%
ggplot(aes(x = year, y = mean_le, col = continent)) +
geom_line() +
scale_color_manual(values = continent_colors)
72 / 75

73 / 75

Joining data

Use left_join(), right_join(), full_join(), or inner_join() to join datasets

Use semi_join() or anti_join() to filter datasets against each other

74 / 75

Resources

R for Data Science: A comprehensive but friendly introduction to the tidyverse. Free online.

RStudio Primers: Free interactive courses in the Tidyverse

10 dplyr tips: a Twitter thread on other useful aspects of dplyr

75 / 75

Art by Allison Horst

2 / 75
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow