Lesson 3 of 7 · Course overview
Most data analysis is the same six things, over and over: load data, filter rows, pick columns, derive new columns, group, summarise. The tidyverse is a family of R packages designed around these operations, and once you learn the vocabulary it becomes second-nature.
In this lesson we’ll work with the built-in mtcars
dataset (32 cars, fuel efficiency stats from a 1974 issue of Motor
Trend). Tiny, but enough to show every concept.
Install the tidyverse if you haven’t already:
install.packages("tidyverse")
And load it:
library(dplyr)
library(tidyr)
library(readr)
We’re loading three packages: dplyr (the manipulation
verbs), tidyr (reshaping), and readr (fast CSV
reading). library(tidyverse) would load all of these and a
few more in one shot.
library() vs ::
You can also call any function with package::function()
— e.g. dplyr::filter(mtcars, mpg > 25) — without ever
calling library(). That’s what production R code typically
does, because it’s explicit about where every function comes from. For
learning, library() is fine and what most tutorials do.
For most analyses you’ll start with a CSV.
readr::read_csv() reads a CSV into a tibble (a
tidyverse-flavored data frame):
my_data <- read_csv("path/to/file.csv")
read_csv() accepts a URL too, which is great for
tutorials and reproducible examples:
penguins <- read_csv(
"https://raw.githubusercontent.com/allisonhorst/palmerpenguins/main/inst/extdata/penguins.csv"
)
For Excel files use readxl::read_excel(). For databases
use DBI plus a driver package. For R’s native binary format
use readRDS() and saveRDS().
For this lesson we’ll use the built-in mtcars. Let’s
convert it to a tibble and add a model column from the row
names:
cars <- mtcars |>
tibble::rownames_to_column("model") |>
tibble::as_tibble()
cars
#> # A tibble: 32 × 12
#> model mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda RX4 … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 Hornet 4 D… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ℹ 22 more rows
A tibble prints nicer than a base data frame (only the first 10 rows by default, with column types in the header) and is otherwise mostly the same thing.
You just saw |>. That’s R’s pipe
operator, and it’s what makes tidyverse code readable. It takes
the thing on its left and feeds it to the function on its right as the
first argument:
mean(c(1, 2, 3))
# is equivalent to
c(1, 2, 3) |> mean()
By itself that’s a wash. The win is when you chain operations:
# without the pipe — read inside-out:
arrange(filter(select(cars, model, mpg, cyl), cyl == 4), desc(mpg))
# with the pipe — read top-to-bottom:
cars |>
select(model, mpg, cyl) |>
filter(cyl == 4) |>
arrange(desc(mpg))
The pipe version reads almost like English: “take cars, select these columns, filter to 4-cylinder cars, sort by mpg descending.”
RStudio has a built-in shortcut for the pipe: Cmd/Ctrl + Shift + M. Use it. You’ll type the pipe thousands of times.
You may also see %>% in older code — that’s the
original pipe from the magrittr package. They’re nearly
equivalent; the new built-in |> is the modern
default.
These are the verbs you’ll use 95% of the time. Each takes a data frame and returns a data frame.
filter() — keep rowscars |> filter(cyl == 6)
#> # A tibble: 7 × 12
#> model mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda RX4 W… 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Hornet 4 Dr… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 4 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 5 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> 6 Merc 280C 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
#> 7 Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
cars |> filter(mpg > 25, hp < 100)
#> # A tibble: 5 × 12
#> model mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Fiat 128 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
#> 2 Honda Civic 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
#> 3 Toyota Coro… 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
#> 4 Fiat X1-9 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
#> 5 Porsche 914… 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
cars |> filter(cyl == 4 | cyl == 6)
#> # A tibble: 18 × 12
#> model mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda RX4 … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 Hornet 4 D… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 6 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 7 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 8 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> 9 Merc 280C 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
#> 10 Fiat 128 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
#> 11 Honda Civic 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
#> 12 Toyota Cor… 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
#> 13 Toyota Cor… 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
#> 14 Fiat X1-9 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
#> 15 Porsche 91… 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
#> 16 Lotus Euro… 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
#> 17 Ferrari Di… 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
#> 18 Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
Multiple conditions separated by commas are combined with
& (AND).
select() — pick columnscars |> select(model, mpg, cyl)
#> # A tibble: 32 × 3
#> model mpg cyl
#> <chr> <dbl> <dbl>
#> 1 Mazda RX4 21 6
#> 2 Mazda RX4 Wag 21 6
#> 3 Datsun 710 22.8 4
#> 4 Hornet 4 Drive 21.4 6
#> 5 Hornet Sportabout 18.7 8
#> 6 Valiant 18.1 6
#> 7 Duster 360 14.3 8
#> 8 Merc 240D 24.4 4
#> 9 Merc 230 22.8 4
#> 10 Merc 280 19.2 6
#> # ℹ 22 more rows
cars |> select(-disp, -hp) # drop columns
#> # A tibble: 32 × 10
#> model mpg cyl drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4 21 6 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda RX4 Wag 21 6 3.9 2.88 17.0 0 1 4 4
#> 3 Datsun 710 22.8 4 3.85 2.32 18.6 1 1 4 1
#> 4 Hornet 4 Drive 21.4 6 3.08 3.22 19.4 1 0 3 1
#> 5 Hornet Sportabout 18.7 8 3.15 3.44 17.0 0 0 3 2
#> 6 Valiant 18.1 6 2.76 3.46 20.2 1 0 3 1
#> 7 Duster 360 14.3 8 3.21 3.57 15.8 0 0 3 4
#> 8 Merc 240D 24.4 4 3.69 3.19 20 1 0 4 2
#> 9 Merc 230 22.8 4 3.92 3.15 22.9 1 0 4 2
#> 10 Merc 280 19.2 6 3.92 3.44 18.3 1 0 4 4
#> # ℹ 22 more rows
cars |> select(model, mpg:hp) # range
#> # A tibble: 32 × 5
#> model mpg cyl disp hp
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4 21 6 160 110
#> 2 Mazda RX4 Wag 21 6 160 110
#> 3 Datsun 710 22.8 4 108 93
#> 4 Hornet 4 Drive 21.4 6 258 110
#> 5 Hornet Sportabout 18.7 8 360 175
#> 6 Valiant 18.1 6 225 105
#> 7 Duster 360 14.3 8 360 245
#> 8 Merc 240D 24.4 4 147. 62
#> 9 Merc 230 22.8 4 141. 95
#> 10 Merc 280 19.2 6 168. 123
#> # ℹ 22 more rows
cars |> select(starts_with("d")) # by name pattern
#> # A tibble: 32 × 2
#> disp drat
#> <dbl> <dbl>
#> 1 160 3.9
#> 2 160 3.9
#> 3 108 3.85
#> 4 258 3.08
#> 5 360 3.15
#> 6 225 2.76
#> 7 360 3.21
#> 8 147. 3.69
#> 9 141. 3.92
#> 10 168. 3.92
#> # ℹ 22 more rows
mutate() — create or change columnscars |>
mutate(
kpl = mpg * 0.425, # km per liter
weight_lbs = wt * 1000,
is_efficient = mpg > 25
) |>
select(model, mpg, kpl, weight_lbs, is_efficient)
#> # A tibble: 32 × 5
#> model mpg kpl weight_lbs is_efficient
#> <chr> <dbl> <dbl> <dbl> <lgl>
#> 1 Mazda RX4 21 8.92 2620 FALSE
#> 2 Mazda RX4 Wag 21 8.92 2875 FALSE
#> 3 Datsun 710 22.8 9.69 2320 FALSE
#> 4 Hornet 4 Drive 21.4 9.09 3215 FALSE
#> 5 Hornet Sportabout 18.7 7.95 3440 FALSE
#> 6 Valiant 18.1 7.69 3460 FALSE
#> 7 Duster 360 14.3 6.08 3570 FALSE
#> 8 Merc 240D 24.4 10.4 3190 FALSE
#> 9 Merc 230 22.8 9.69 3150 FALSE
#> 10 Merc 280 19.2 8.16 3440 FALSE
#> # ℹ 22 more rows
mutate() can use any R function — including ones you
write yourself.
arrange() — sort rowscars |> arrange(mpg) # ascending
#> # A tibble: 32 × 12
#> model mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Cadillac F… 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
#> 2 Lincoln Co… 10.4 8 460 215 3 5.42 17.8 0 0 3 4
#> 3 Camaro Z28 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
#> 4 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 5 Chrysler I… 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
#> 6 Maserati B… 15 8 301 335 3.54 3.57 14.6 0 1 5 8
#> 7 Merc 450SLC 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
#> 8 AMC Javelin 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2
#> 9 Dodge Chal… 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2
#> 10 Ford Pante… 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
#> # ℹ 22 more rows
cars |> arrange(desc(mpg)) # descending
#> # A tibble: 32 × 12
#> model mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Toyota Cor… 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
#> 2 Fiat 128 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
#> 3 Honda Civic 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
#> 4 Lotus Euro… 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
#> 5 Fiat X1-9 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
#> 6 Porsche 91… 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
#> 7 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 8 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 Toyota Cor… 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
#> # ℹ 22 more rows
cars |> arrange(cyl, desc(mpg)) # multi-key
#> # A tibble: 32 × 12
#> model mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Toyota Cor… 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
#> 2 Fiat 128 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
#> 3 Honda Civic 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
#> 4 Lotus Euro… 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
#> 5 Fiat X1-9 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
#> 6 Porsche 91… 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
#> 7 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 8 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 Toyota Cor… 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
#> # ℹ 22 more rows
summarise() + group_by() — collapse
rowssummarise() reduces a data frame to a single row. By
itself it’s not that interesting:
cars |> summarise(mean_mpg = mean(mpg), n = n())
#> # A tibble: 1 × 2
#> mean_mpg n
#> <dbl> <int>
#> 1 20.1 32
But pair it with group_by() and you have a pivot
table:
cars |>
group_by(cyl) |>
summarise(
n = n(),
mean_mpg = mean(mpg),
sd_mpg = sd(mpg),
.groups = "drop"
)
#> # A tibble: 3 × 4
#> cyl n mean_mpg sd_mpg
#> <dbl> <int> <dbl> <dbl>
#> 1 4 11 26.7 4.51
#> 2 6 7 19.7 1.45
#> 3 8 14 15.1 2.56
This is one of the most-used patterns in all of data analysis. “For each group, compute X.”
You can group by multiple variables:
cars |>
group_by(cyl, am) |> # am: 0 = automatic, 1 = manual
summarise(mean_mpg = mean(mpg), n = n(), .groups = "drop")
#> # A tibble: 6 × 4
#> cyl am mean_mpg n
#> <dbl> <dbl> <dbl> <int>
#> 1 4 0 22.9 3
#> 2 4 1 28.1 8
#> 3 6 0 19.1 4
#> 4 6 1 20.6 3
#> 5 8 0 15.0 12
#> 6 8 1 15.4 2
Suppose you want to know: for cars with more than 100 horsepower, what’s the average MPG by number of cylinders, sorted from worst to best?
cars |>
filter(hp > 100) |>
group_by(cyl) |>
summarise(
mean_mpg = mean(mpg),
n = n(),
.groups = "drop"
) |>
arrange(mean_mpg)
#> # A tibble: 3 × 3
#> cyl mean_mpg n
#> <dbl> <dbl> <int>
#> 1 8 15.1 14
#> 2 6 19.7 7
#> 3 4 25.9 2
That’s the whole answer in 6 lines. No loops, no manual indexing, no temp variables. This is why people like the tidyverse.
count() — quick frequency tablecars |> count(cyl)
#> # A tibble: 3 × 2
#> cyl n
#> <dbl> <int>
#> 1 4 11
#> 2 6 7
#> 3 8 14
cars |> count(cyl, sort = TRUE)
#> # A tibble: 3 × 2
#> cyl n
#> <dbl> <int>
#> 1 8 14
#> 2 4 11
#> 3 6 7
cars |> count(cyl, am)
#> # A tibble: 6 × 3
#> cyl am n
#> <dbl> <dbl> <int>
#> 1 4 0 3
#> 2 4 1 8
#> 3 6 0 4
#> 4 6 1 3
#> 5 8 0 12
#> 6 8 1 2
case_when() — multi-branch ifInside mutate(), case_when() is the right
tool for “if-elseif-else” logic. Each ~ line is “condition
~ value”:
cars |>
mutate(
mpg_class = case_when(
mpg < 15 ~ "thirsty",
mpg < 22 ~ "average",
mpg < 28 ~ "good",
TRUE ~ "great"
)
) |>
count(mpg_class)
#> # A tibble: 4 × 2
#> mpg_class n
#> <chr> <int>
#> 1 average 18
#> 2 good 5
#> 3 great 4
#> 4 thirsty 5
The final TRUE ~ ... is the catch-all “else.”
slice_*() — top/bottom/random rowscars |> slice_max(mpg, n = 3) # 3 highest mpg
#> # A tibble: 4 × 12
#> model mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Toyota Coro… 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
#> 2 Fiat 128 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
#> 3 Honda Civic 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
#> 4 Lotus Europa 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
cars |> slice_min(hp, n = 3) # 3 lowest hp
#> # A tibble: 3 × 12
#> model mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Honda Civic 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
#> 2 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 3 Toyota Coro… 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
cars |> slice_sample(n = 3) # 3 random rows
#> # A tibble: 3 × 12
#> model mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
#> 2 Lincoln Con… 10.4 8 460 215 3 5.42 17.8 0 0 3 4
#> 3 Merc 450SLC 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
When you have two related tables, *_join() combines them
on a shared key:
classes <- tibble::tibble(
cyl = c(4, 6, 8),
class = c("compact", "midsize", "muscle")
)
cars |>
select(model, cyl, mpg) |>
left_join(classes, by = "cyl") |>
head()
#> # A tibble: 6 × 4
#> model cyl mpg class
#> <chr> <dbl> <dbl> <chr>
#> 1 Mazda RX4 6 21 midsize
#> 2 Mazda RX4 Wag 6 21 midsize
#> 3 Datsun 710 4 22.8 compact
#> 4 Hornet 4 Drive 6 21.4 midsize
#> 5 Hornet Sportabout 8 18.7 muscle
#> 6 Valiant 6 18.1 midsize
left_join() keeps every row from the left table and
pulls in matching rows from the right. There’s also
inner_join(), right_join(), and
full_join() for different behaviors when keys don’t match
on both sides.
tidyrSometimes data comes in wide format (one column per
category) when you need it long (one row per
observation), or vice versa. tidyr has
pivot_longer() and pivot_wider() for that.
wide <- tibble::tibble(
city = c("LA", "NYC", "Chicago"),
jan = c(58, 32, 25),
jul = c(83, 78, 76)
)
wide
#> # A tibble: 3 × 3
#> city jan jul
#> <chr> <dbl> <dbl>
#> 1 LA 58 83
#> 2 NYC 32 78
#> 3 Chicago 25 76
long <- wide |>
pivot_longer(cols = c(jan, jul), names_to = "month", values_to = "temp_f")
long
#> # A tibble: 6 × 3
#> city month temp_f
#> <chr> <chr> <dbl>
#> 1 LA jan 58
#> 2 LA jul 83
#> 3 NYC jan 32
#> 4 NYC jul 78
#> 5 Chicago jan 25
#> 6 Chicago jul 76
long |> pivot_wider(names_from = month, values_from = temp_f)
#> # A tibble: 3 × 3
#> city jan jul
#> <chr> <dbl> <dbl>
#> 1 LA 58 83
#> 2 NYC 32 78
#> 3 Chicago 25 76
pivot_longer() is the one you’ll reach for most — most
analysis, plotting, and modeling functions expect long data.
Here’s a more realistic mini-analysis. We’ll classify cars by efficiency, count how many fall into each class by transmission type, and report it as a clean wide table.
cars |>
mutate(
transmission = if_else(am == 1, "manual", "automatic"),
efficiency = case_when(
mpg < 18 ~ "low",
mpg < 24 ~ "medium",
TRUE ~ "high"
)
) |>
count(transmission, efficiency) |>
pivot_wider(names_from = efficiency, values_from = n, values_fill = 0)
#> # A tibble: 2 × 4
#> transmission high low medium
#> <chr> <int> <int> <int>
#> 1 automatic 1 11 7
#> 2 manual 6 2 5
The whole pipeline is one connected thought, and you can comment out any line to see the intermediate result.
Using cars, find the 5 cars with the worst MPG. Show
only the model, mpg, and hp
columns.
cars |>
arrange(mpg) |>
select(model, mpg, hp) |>
head(5)
#> # A tibble: 5 × 3
#> model mpg hp
#> <chr> <dbl> <dbl>
#> 1 Cadillac Fleetwood 10.4 205
#> 2 Lincoln Continental 10.4 215
#> 3 Camaro Z28 13.3 245
#> 4 Duster 360 14.3 245
#> 5 Chrysler Imperial 14.7 230
Or with slice_min:
cars |>
slice_min(mpg, n = 5) |>
select(model, mpg, hp)
#> # A tibble: 5 × 3
#> model mpg hp
#> <chr> <dbl> <dbl>
#> 1 Cadillac Fleetwood 10.4 205
#> 2 Lincoln Continental 10.4 215
#> 3 Camaro Z28 13.3 245
#> 4 Duster 360 14.3 245
#> 5 Chrysler Imperial 14.7 230
For each unique combination of cylinders (cyl) and gears
(gear), compute:
n)Sort the result with the highest mean MPG first.
cars |>
group_by(cyl, gear) |>
summarise(
n = n(),
mean_mpg = mean(mpg),
mean_hp = mean(hp),
.groups = "drop"
) |>
arrange(desc(mean_mpg))
#> # A tibble: 8 × 5
#> cyl gear n mean_mpg mean_hp
#> <dbl> <dbl> <int> <dbl> <dbl>
#> 1 4 5 2 28.2 102
#> 2 4 4 8 26.9 76
#> 3 4 3 1 21.5 97
#> 4 6 3 2 19.8 108.
#> 5 6 4 4 19.8 116.
#> 6 6 5 1 19.7 175
#> 7 8 5 2 15.4 300.
#> 8 8 3 12 15.0 194.
Add two new columns to cars:
power_to_weight — horsepower per 1000 lbs
(i.e. hp / wt).category — "sporty" if
power_to_weight > 50, otherwise
"normal".Then count the cars in each category.
cars |>
mutate(
power_to_weight = hp / wt,
category = if_else(power_to_weight > 50, "sporty", "normal")
) |>
count(category)
#> # A tibble: 2 × 2
#> category n
#> <chr> <int>
#> 1 normal 25
#> 2 sporty 7
You now have the toolkit to load, clean, and summarise data — easily
60% of any real analysis. Next up: making the results visible. Lesson 4
is about plotting with ggplot2, which pairs cleanly with
everything you just learned.