Lesson 3 of 7 · Course overview

Data Manipulation in R

Most data analysis is the same six things, over and over: load data, filter rows, pick columns, derive new columns, group, summarise. The tidyverse is a family of R packages designed around these operations, and once you learn the vocabulary it becomes second-nature.

In this lesson we’ll work with the built-in mtcars dataset (32 cars, fuel efficiency stats from a 1974 issue of Motor Trend). Tiny, but enough to show every concept.

Setup

Install the tidyverse if you haven’t already:

install.packages("tidyverse")

And load it:

library(dplyr)
library(tidyr)
library(readr)

We’re loading three packages: dplyr (the manipulation verbs), tidyr (reshaping), and readr (fast CSV reading). library(tidyverse) would load all of these and a few more in one shot.

📝 library() vs ::

You can also call any function with package::function() — e.g. dplyr::filter(mtcars, mpg > 25) — without ever calling library(). That’s what production R code typically does, because it’s explicit about where every function comes from. For learning, library() is fine and what most tutorials do.

Reading data

For most analyses you’ll start with a CSV. readr::read_csv() reads a CSV into a tibble (a tidyverse-flavored data frame):

my_data <- read_csv("path/to/file.csv")

read_csv() accepts a URL too, which is great for tutorials and reproducible examples:

penguins <- read_csv(
  "https://raw.githubusercontent.com/allisonhorst/palmerpenguins/main/inst/extdata/penguins.csv"
)

For Excel files use readxl::read_excel(). For databases use DBI plus a driver package. For R’s native binary format use readRDS() and saveRDS().

For this lesson we’ll use the built-in mtcars. Let’s convert it to a tibble and add a model column from the row names:

cars <- mtcars |>
  tibble::rownames_to_column("model") |>
  tibble::as_tibble()
cars
#> # A tibble: 32 × 12
#>    model         mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Mazda RX4    21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2 Mazda RX4 …  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3 Datsun 710   22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4 Hornet 4 D…  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5 Hornet Spo…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6 Valiant      18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8 Merc 240D    24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9 Merc 230     22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10 Merc 280     19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # ℹ 22 more rows

A tibble prints nicer than a base data frame (only the first 10 rows by default, with column types in the header) and is otherwise mostly the same thing.

The pipe

You just saw |>. That’s R’s pipe operator, and it’s what makes tidyverse code readable. It takes the thing on its left and feeds it to the function on its right as the first argument:

mean(c(1, 2, 3))
# is equivalent to
c(1, 2, 3) |> mean()

By itself that’s a wash. The win is when you chain operations:

# without the pipe — read inside-out:
arrange(filter(select(cars, model, mpg, cyl), cyl == 4), desc(mpg))

# with the pipe — read top-to-bottom:
cars |>
  select(model, mpg, cyl) |>
  filter(cyl == 4) |>
  arrange(desc(mpg))

The pipe version reads almost like English: “take cars, select these columns, filter to 4-cylinder cars, sort by mpg descending.”

💡 The pipe shortcut

RStudio has a built-in shortcut for the pipe: Cmd/Ctrl + Shift + M. Use it. You’ll type the pipe thousands of times.

You may also see %>% in older code — that’s the original pipe from the magrittr package. They’re nearly equivalent; the new built-in |> is the modern default.

The five core dplyr verbs

These are the verbs you’ll use 95% of the time. Each takes a data frame and returns a data frame.

filter() — keep rows

cars |> filter(cyl == 6)
#> # A tibble: 7 × 12
#>   model          mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4     21       6  160    110  3.9   2.62  16.5     0     1     4     4
#> 2 Mazda RX4 W…  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#> 3 Hornet 4 Dr…  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#> 4 Valiant       18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#> 5 Merc 280      19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> 6 Merc 280C     17.8     6  168.   123  3.92  3.44  18.9     1     0     4     4
#> 7 Ferrari Dino  19.7     6  145    175  3.62  2.77  15.5     0     1     5     6
cars |> filter(mpg > 25, hp < 100)
#> # A tibble: 5 × 12
#>   model          mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Fiat 128      32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
#> 2 Honda Civic   30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
#> 3 Toyota Coro…  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
#> 4 Fiat X1-9     27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
#> 5 Porsche 914…  26       4 120.     91  4.43  2.14  16.7     0     1     5     2
cars |> filter(cyl == 4 | cyl == 6)
#> # A tibble: 18 × 12
#>    model         mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Mazda RX4    21       6 160     110  3.9   2.62  16.5     0     1     4     4
#>  2 Mazda RX4 …  21       6 160     110  3.9   2.88  17.0     0     1     4     4
#>  3 Datsun 710   22.8     4 108      93  3.85  2.32  18.6     1     1     4     1
#>  4 Hornet 4 D…  21.4     6 258     110  3.08  3.22  19.4     1     0     3     1
#>  5 Valiant      18.1     6 225     105  2.76  3.46  20.2     1     0     3     1
#>  6 Merc 240D    24.4     4 147.     62  3.69  3.19  20       1     0     4     2
#>  7 Merc 230     22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
#>  8 Merc 280     19.2     6 168.    123  3.92  3.44  18.3     1     0     4     4
#>  9 Merc 280C    17.8     6 168.    123  3.92  3.44  18.9     1     0     4     4
#> 10 Fiat 128     32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
#> 11 Honda Civic  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
#> 12 Toyota Cor…  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
#> 13 Toyota Cor…  21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1
#> 14 Fiat X1-9    27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
#> 15 Porsche 91…  26       4 120.     91  4.43  2.14  16.7     0     1     5     2
#> 16 Lotus Euro…  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2
#> 17 Ferrari Di…  19.7     6 145     175  3.62  2.77  15.5     0     1     5     6
#> 18 Volvo 142E   21.4     4 121     109  4.11  2.78  18.6     1     1     4     2

Multiple conditions separated by commas are combined with & (AND).

select() — pick columns

cars |> select(model, mpg, cyl)
#> # A tibble: 32 × 3
#>    model               mpg   cyl
#>    <chr>             <dbl> <dbl>
#>  1 Mazda RX4          21       6
#>  2 Mazda RX4 Wag      21       6
#>  3 Datsun 710         22.8     4
#>  4 Hornet 4 Drive     21.4     6
#>  5 Hornet Sportabout  18.7     8
#>  6 Valiant            18.1     6
#>  7 Duster 360         14.3     8
#>  8 Merc 240D          24.4     4
#>  9 Merc 230           22.8     4
#> 10 Merc 280           19.2     6
#> # ℹ 22 more rows
cars |> select(-disp, -hp)              # drop columns
#> # A tibble: 32 × 10
#>    model               mpg   cyl  drat    wt  qsec    vs    am  gear  carb
#>    <chr>             <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Mazda RX4          21       6  3.9   2.62  16.5     0     1     4     4
#>  2 Mazda RX4 Wag      21       6  3.9   2.88  17.0     0     1     4     4
#>  3 Datsun 710         22.8     4  3.85  2.32  18.6     1     1     4     1
#>  4 Hornet 4 Drive     21.4     6  3.08  3.22  19.4     1     0     3     1
#>  5 Hornet Sportabout  18.7     8  3.15  3.44  17.0     0     0     3     2
#>  6 Valiant            18.1     6  2.76  3.46  20.2     1     0     3     1
#>  7 Duster 360         14.3     8  3.21  3.57  15.8     0     0     3     4
#>  8 Merc 240D          24.4     4  3.69  3.19  20       1     0     4     2
#>  9 Merc 230           22.8     4  3.92  3.15  22.9     1     0     4     2
#> 10 Merc 280           19.2     6  3.92  3.44  18.3     1     0     4     4
#> # ℹ 22 more rows
cars |> select(model, mpg:hp)           # range
#> # A tibble: 32 × 5
#>    model               mpg   cyl  disp    hp
#>    <chr>             <dbl> <dbl> <dbl> <dbl>
#>  1 Mazda RX4          21       6  160    110
#>  2 Mazda RX4 Wag      21       6  160    110
#>  3 Datsun 710         22.8     4  108     93
#>  4 Hornet 4 Drive     21.4     6  258    110
#>  5 Hornet Sportabout  18.7     8  360    175
#>  6 Valiant            18.1     6  225    105
#>  7 Duster 360         14.3     8  360    245
#>  8 Merc 240D          24.4     4  147.    62
#>  9 Merc 230           22.8     4  141.    95
#> 10 Merc 280           19.2     6  168.   123
#> # ℹ 22 more rows
cars |> select(starts_with("d"))        # by name pattern
#> # A tibble: 32 × 2
#>     disp  drat
#>    <dbl> <dbl>
#>  1  160   3.9 
#>  2  160   3.9 
#>  3  108   3.85
#>  4  258   3.08
#>  5  360   3.15
#>  6  225   2.76
#>  7  360   3.21
#>  8  147.  3.69
#>  9  141.  3.92
#> 10  168.  3.92
#> # ℹ 22 more rows

mutate() — create or change columns

cars |>
  mutate(
    kpl = mpg * 0.425,                  # km per liter
    weight_lbs = wt * 1000,
    is_efficient = mpg > 25
  ) |>
  select(model, mpg, kpl, weight_lbs, is_efficient)
#> # A tibble: 32 × 5
#>    model               mpg   kpl weight_lbs is_efficient
#>    <chr>             <dbl> <dbl>      <dbl> <lgl>       
#>  1 Mazda RX4          21    8.92       2620 FALSE       
#>  2 Mazda RX4 Wag      21    8.92       2875 FALSE       
#>  3 Datsun 710         22.8  9.69       2320 FALSE       
#>  4 Hornet 4 Drive     21.4  9.09       3215 FALSE       
#>  5 Hornet Sportabout  18.7  7.95       3440 FALSE       
#>  6 Valiant            18.1  7.69       3460 FALSE       
#>  7 Duster 360         14.3  6.08       3570 FALSE       
#>  8 Merc 240D          24.4 10.4        3190 FALSE       
#>  9 Merc 230           22.8  9.69       3150 FALSE       
#> 10 Merc 280           19.2  8.16       3440 FALSE       
#> # ℹ 22 more rows

mutate() can use any R function — including ones you write yourself.

arrange() — sort rows

cars |> arrange(mpg)              # ascending
#> # A tibble: 32 × 12
#>    model         mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Cadillac F…  10.4     8  472    205  2.93  5.25  18.0     0     0     3     4
#>  2 Lincoln Co…  10.4     8  460    215  3     5.42  17.8     0     0     3     4
#>  3 Camaro Z28   13.3     8  350    245  3.73  3.84  15.4     0     0     3     4
#>  4 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  5 Chrysler I…  14.7     8  440    230  3.23  5.34  17.4     0     0     3     4
#>  6 Maserati B…  15       8  301    335  3.54  3.57  14.6     0     1     5     8
#>  7 Merc 450SLC  15.2     8  276.   180  3.07  3.78  18       0     0     3     3
#>  8 AMC Javelin  15.2     8  304    150  3.15  3.44  17.3     0     0     3     2
#>  9 Dodge Chal…  15.5     8  318    150  2.76  3.52  16.9     0     0     3     2
#> 10 Ford Pante…  15.8     8  351    264  4.22  3.17  14.5     0     1     5     4
#> # ℹ 22 more rows
cars |> arrange(desc(mpg))        # descending
#> # A tibble: 32 × 12
#>    model         mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Toyota Cor…  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
#>  2 Fiat 128     32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
#>  3 Honda Civic  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
#>  4 Lotus Euro…  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2
#>  5 Fiat X1-9    27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
#>  6 Porsche 91…  26       4 120.     91  4.43  2.14  16.7     0     1     5     2
#>  7 Merc 240D    24.4     4 147.     62  3.69  3.19  20       1     0     4     2
#>  8 Datsun 710   22.8     4 108      93  3.85  2.32  18.6     1     1     4     1
#>  9 Merc 230     22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
#> 10 Toyota Cor…  21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1
#> # ℹ 22 more rows
cars |> arrange(cyl, desc(mpg))   # multi-key
#> # A tibble: 32 × 12
#>    model         mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Toyota Cor…  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
#>  2 Fiat 128     32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
#>  3 Honda Civic  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
#>  4 Lotus Euro…  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2
#>  5 Fiat X1-9    27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
#>  6 Porsche 91…  26       4 120.     91  4.43  2.14  16.7     0     1     5     2
#>  7 Merc 240D    24.4     4 147.     62  3.69  3.19  20       1     0     4     2
#>  8 Datsun 710   22.8     4 108      93  3.85  2.32  18.6     1     1     4     1
#>  9 Merc 230     22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
#> 10 Toyota Cor…  21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1
#> # ℹ 22 more rows

summarise() + group_by() — collapse rows

summarise() reduces a data frame to a single row. By itself it’s not that interesting:

cars |> summarise(mean_mpg = mean(mpg), n = n())
#> # A tibble: 1 × 2
#>   mean_mpg     n
#>      <dbl> <int>
#> 1     20.1    32

But pair it with group_by() and you have a pivot table:

cars |>
  group_by(cyl) |>
  summarise(
    n = n(),
    mean_mpg = mean(mpg),
    sd_mpg = sd(mpg),
    .groups = "drop"
  )
#> # A tibble: 3 × 4
#>     cyl     n mean_mpg sd_mpg
#>   <dbl> <int>    <dbl>  <dbl>
#> 1     4    11     26.7   4.51
#> 2     6     7     19.7   1.45
#> 3     8    14     15.1   2.56

This is one of the most-used patterns in all of data analysis. “For each group, compute X.”

You can group by multiple variables:

cars |>
  group_by(cyl, am) |>     # am: 0 = automatic, 1 = manual
  summarise(mean_mpg = mean(mpg), n = n(), .groups = "drop")
#> # A tibble: 6 × 4
#>     cyl    am mean_mpg     n
#>   <dbl> <dbl>    <dbl> <int>
#> 1     4     0     22.9     3
#> 2     4     1     28.1     8
#> 3     6     0     19.1     4
#> 4     6     1     20.6     3
#> 5     8     0     15.0    12
#> 6     8     1     15.4     2

A recipe-style example

Suppose you want to know: for cars with more than 100 horsepower, what’s the average MPG by number of cylinders, sorted from worst to best?

cars |>
  filter(hp > 100) |>
  group_by(cyl) |>
  summarise(
    mean_mpg = mean(mpg),
    n = n(),
    .groups = "drop"
  ) |>
  arrange(mean_mpg)
#> # A tibble: 3 × 3
#>     cyl mean_mpg     n
#>   <dbl>    <dbl> <int>
#> 1     8     15.1    14
#> 2     6     19.7     7
#> 3     4     25.9     2

That’s the whole answer in 6 lines. No loops, no manual indexing, no temp variables. This is why people like the tidyverse.

A few more useful verbs

count() — quick frequency table

cars |> count(cyl)
#> # A tibble: 3 × 2
#>     cyl     n
#>   <dbl> <int>
#> 1     4    11
#> 2     6     7
#> 3     8    14
cars |> count(cyl, sort = TRUE)
#> # A tibble: 3 × 2
#>     cyl     n
#>   <dbl> <int>
#> 1     8    14
#> 2     4    11
#> 3     6     7
cars |> count(cyl, am)
#> # A tibble: 6 × 3
#>     cyl    am     n
#>   <dbl> <dbl> <int>
#> 1     4     0     3
#> 2     4     1     8
#> 3     6     0     4
#> 4     6     1     3
#> 5     8     0    12
#> 6     8     1     2

case_when() — multi-branch if

Inside mutate(), case_when() is the right tool for “if-elseif-else” logic. Each ~ line is “condition ~ value”:

cars |>
  mutate(
    mpg_class = case_when(
      mpg < 15  ~ "thirsty",
      mpg < 22  ~ "average",
      mpg < 28  ~ "good",
      TRUE      ~ "great"
    )
  ) |>
  count(mpg_class)
#> # A tibble: 4 × 2
#>   mpg_class     n
#>   <chr>     <int>
#> 1 average      18
#> 2 good          5
#> 3 great         4
#> 4 thirsty       5

The final TRUE ~ ... is the catch-all “else.”

slice_*() — top/bottom/random rows

cars |> slice_max(mpg, n = 3)         # 3 highest mpg
#> # A tibble: 4 × 12
#>   model          mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Toyota Coro…  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
#> 2 Fiat 128      32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
#> 3 Honda Civic   30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
#> 4 Lotus Europa  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2
cars |> slice_min(hp, n = 3)          # 3 lowest hp
#> # A tibble: 3 × 12
#>   model          mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Honda Civic   30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
#> 2 Merc 240D     24.4     4 147.     62  3.69  3.19  20       1     0     4     2
#> 3 Toyota Coro…  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
cars |> slice_sample(n = 3)           # 3 random rows
#> # A tibble: 3 × 12
#>   model          mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Ferrari Dino  19.7     6  145    175  3.62  2.77  15.5     0     1     5     6
#> 2 Lincoln Con…  10.4     8  460    215  3     5.42  17.8     0     0     3     4
#> 3 Merc 450SLC   15.2     8  276.   180  3.07  3.78  18       0     0     3     3

Joining tables

When you have two related tables, *_join() combines them on a shared key:

classes <- tibble::tibble(
  cyl = c(4, 6, 8),
  class = c("compact", "midsize", "muscle")
)

cars |>
  select(model, cyl, mpg) |>
  left_join(classes, by = "cyl") |>
  head()
#> # A tibble: 6 × 4
#>   model               cyl   mpg class  
#>   <chr>             <dbl> <dbl> <chr>  
#> 1 Mazda RX4             6  21   midsize
#> 2 Mazda RX4 Wag         6  21   midsize
#> 3 Datsun 710            4  22.8 compact
#> 4 Hornet 4 Drive        6  21.4 midsize
#> 5 Hornet Sportabout     8  18.7 muscle 
#> 6 Valiant               6  18.1 midsize

left_join() keeps every row from the left table and pulls in matching rows from the right. There’s also inner_join(), right_join(), and full_join() for different behaviors when keys don’t match on both sides.

Reshaping with tidyr

Sometimes data comes in wide format (one column per category) when you need it long (one row per observation), or vice versa. tidyr has pivot_longer() and pivot_wider() for that.

wide <- tibble::tibble(
  city = c("LA", "NYC", "Chicago"),
  jan = c(58, 32, 25),
  jul = c(83, 78, 76)
)
wide
#> # A tibble: 3 × 3
#>   city      jan   jul
#>   <chr>   <dbl> <dbl>
#> 1 LA         58    83
#> 2 NYC        32    78
#> 3 Chicago    25    76
long <- wide |>
  pivot_longer(cols = c(jan, jul), names_to = "month", values_to = "temp_f")
long
#> # A tibble: 6 × 3
#>   city    month temp_f
#>   <chr>   <chr>  <dbl>
#> 1 LA      jan       58
#> 2 LA      jul       83
#> 3 NYC     jan       32
#> 4 NYC     jul       78
#> 5 Chicago jan       25
#> 6 Chicago jul       76
long |> pivot_wider(names_from = month, values_from = temp_f)
#> # A tibble: 3 × 3
#>   city      jan   jul
#>   <chr>   <dbl> <dbl>
#> 1 LA         58    83
#> 2 NYC        32    78
#> 3 Chicago    25    76

pivot_longer() is the one you’ll reach for most — most analysis, plotting, and modeling functions expect long data.

Putting it together

Here’s a more realistic mini-analysis. We’ll classify cars by efficiency, count how many fall into each class by transmission type, and report it as a clean wide table.

cars |>
  mutate(
    transmission = if_else(am == 1, "manual", "automatic"),
    efficiency = case_when(
      mpg < 18 ~ "low",
      mpg < 24 ~ "medium",
      TRUE     ~ "high"
    )
  ) |>
  count(transmission, efficiency) |>
  pivot_wider(names_from = efficiency, values_from = n, values_fill = 0)
#> # A tibble: 2 × 4
#>   transmission  high   low medium
#>   <chr>        <int> <int>  <int>
#> 1 automatic        1    11      7
#> 2 manual           6     2      5

The whole pipeline is one connected thought, and you can comment out any line to see the intermediate result.

✏️ Exercise 3.1 — Filter and sort

Using cars, find the 5 cars with the worst MPG. Show only the model, mpg, and hp columns.

Show solution
cars |>
  arrange(mpg) |>
  select(model, mpg, hp) |>
  head(5)
#> # A tibble: 5 × 3
#>   model                 mpg    hp
#>   <chr>               <dbl> <dbl>
#> 1 Cadillac Fleetwood   10.4   205
#> 2 Lincoln Continental  10.4   215
#> 3 Camaro Z28           13.3   245
#> 4 Duster 360           14.3   245
#> 5 Chrysler Imperial    14.7   230

Or with slice_min:

cars |>
  slice_min(mpg, n = 5) |>
  select(model, mpg, hp)
#> # A tibble: 5 × 3
#>   model                 mpg    hp
#>   <chr>               <dbl> <dbl>
#> 1 Cadillac Fleetwood   10.4   205
#> 2 Lincoln Continental  10.4   215
#> 3 Camaro Z28           13.3   245
#> 4 Duster 360           14.3   245
#> 5 Chrysler Imperial    14.7   230
✏️ Exercise 3.2 — Group and summarise

For each unique combination of cylinders (cyl) and gears (gear), compute:

  • the number of cars (n)
  • the mean MPG
  • the mean horsepower

Sort the result with the highest mean MPG first.

Show solution
cars |>
  group_by(cyl, gear) |>
  summarise(
    n = n(),
    mean_mpg = mean(mpg),
    mean_hp = mean(hp),
    .groups = "drop"
  ) |>
  arrange(desc(mean_mpg))
#> # A tibble: 8 × 5
#>     cyl  gear     n mean_mpg mean_hp
#>   <dbl> <dbl> <int>    <dbl>   <dbl>
#> 1     4     5     2     28.2    102 
#> 2     4     4     8     26.9     76 
#> 3     4     3     1     21.5     97 
#> 4     6     3     2     19.8    108.
#> 5     6     4     4     19.8    116.
#> 6     6     5     1     19.7    175 
#> 7     8     5     2     15.4    300.
#> 8     8     3    12     15.0    194.
✏️ Exercise 3.3 — Build a feature

Add two new columns to cars:

  1. power_to_weight — horsepower per 1000 lbs (i.e. hp / wt).
  2. category"sporty" if power_to_weight > 50, otherwise "normal".

Then count the cars in each category.

Show solution
cars |>
  mutate(
    power_to_weight = hp / wt,
    category = if_else(power_to_weight > 50, "sporty", "normal")
  ) |>
  count(category)
#> # A tibble: 2 × 2
#>   category     n
#>   <chr>    <int>
#> 1 normal      25
#> 2 sporty       7

What’s next

You now have the toolkit to load, clean, and summarise data — easily 60% of any real analysis. Next up: making the results visible. Lesson 4 is about plotting with ggplot2, which pairs cleanly with everything you just learned.

Feel free to contact me: