Lesson 2 of 7 · Course overview
This lesson covers the building blocks: variables, data types, vectors, operators, control flow, and functions. None of it is exciting on its own, but every single thing in R is built out of these pieces. Get comfortable here and the rest of the course is mostly pattern-matching.
A variable is a name that points to a value. Assign
with <-:
x <- 5
y <- 12
x + y
#> [1] 17
= works too, but <- is the convention in
R for assignment. Stick with it.
You can reassign a variable as often as you like — R doesn’t care:
x <- 5
x <- x + 1
x
#> [1] 6
R has a small set of basic types you’ll use constantly.
n <- 3.14 # numeric (double)
i <- 42L # integer (the L makes it an integer, not a double)
s <- "hello" # character (string)
b <- TRUE # logical (TRUE / FALSE)
m <- NA # missing value
To check what type something is, use class() or one of
the is.* family:
class(n)
#> [1] "numeric"
class(s)
#> [1] "character"
class(b)
#> [1] "logical"
is.numeric(n)
#> [1] TRUE
is.character(s)
#> [1] TRUE
Missing values (NA) are everywhere in real data. R takes
them seriously — most functions return NA if their input
has any NAs, unless you tell them otherwise:
mean(c(1, 2, NA))
#> [1] NA
mean(c(1, 2, NA), na.rm = TRUE)
#> [1] 1.5
NA is contagious
NA == NA is NA, not TRUE. To
check for missingness use is.na(x). This trips up beginners
constantly.
The usual suspects:
10 + 3
#> [1] 13
10 - 3
#> [1] 7
10 * 3
#> [1] 30
10 / 3
#> [1] 3.333333
10 %% 3 # remainder (modulo)
#> [1] 1
10 %/% 3 # integer division
#> [1] 3
10 ^ 3 # exponentiation
#> [1] 1000
Plus a stack of built-in math functions:
sqrt(16)
#> [1] 4
log(100) # natural log
#> [1] 4.60517
log(100, base = 10)
#> [1] 2
exp(1) # e
#> [1] 2.718282
abs(-7)
#> [1] 7
round(3.14159, 2)
#> [1] 3.14
ceiling(2.1)
#> [1] 3
floor(2.9)
#> [1] 2
Comparisons return TRUE or FALSE:
5 > 3
#> [1] TRUE
5 == 5 # equality is double-equals; single-equals is assignment
#> [1] TRUE
5 != 6
#> [1] TRUE
5 >= 5
#> [1] TRUE
"a" < "b" # alphabetical
#> [1] TRUE
Logical operators combine TRUE/FALSE
values:
TRUE & FALSE # AND
#> [1] FALSE
TRUE | FALSE # OR
#> [1] TRUE
!TRUE # NOT
#> [1] FALSE
The single-character versions (&, |)
work elementwise on vectors. The double-character versions
(&&, ||) only look at the first
element and short-circuit — use them inside if statements,
not for filtering data.
A vector is an ordered collection of values, all of the same type. It’s the most fundamental data structure in R — even a single number is technically a vector of length 1.
Create a vector with c() (“combine”):
ages <- c(23, 31, 27, 45, 19)
fruit <- c("apple", "banana", "cherry")
flags <- c(TRUE, FALSE, TRUE, TRUE)
A few handy ways to create vectors:
1:5 # integer sequence
#> [1] 1 2 3 4 5
seq(0, 1, by = 0.25)
#> [1] 0.00 0.25 0.50 0.75 1.00
rep("a", times = 3)
#> [1] "a" "a" "a"
Subsetting a vector by position uses [ ]. R is
1-indexed — the first element is at position 1, not 0.
ages[1] # first element
#> [1] 23
ages[c(1, 3)] # first and third
#> [1] 23 27
ages[-1] # everything except the first (negative indexing drops)
#> [1] 31 27 45 19
ages[ages > 25] # elements that match a condition
#> [1] 31 27 45
That last one — subsetting by a logical vector — is one of the
most-used patterns in R. The expression ages > 25
returns a logical vector the same length as ages, and
ages[...] keeps the elements where it’s
TRUE.
R operations work on entire vectors at once. You almost never need a loop for elementwise math:
ages
#> [1] 23 31 27 45 19
ages + 1 # add 1 to every element
#> [1] 24 32 28 46 20
ages * 2
#> [1] 46 62 54 90 38
ages > 30
#> [1] FALSE TRUE FALSE TRUE FALSE
sum(ages)
#> [1] 145
mean(ages)
#> [1] 29
length(ages)
#> [1] 5
This is the single biggest stylistic difference between R and most other languages, and getting comfortable with it is half the battle.
Vectors can have names attached to each element:
prices <- c(apple = 1.20, banana = 0.50, cherry = 3.00)
prices
#> apple banana cherry
#> 1.2 0.5 3.0
prices["banana"]
#> banana
#> 0.5
A list is like a vector, but each element can be a different type — even another list.
person <- list(name = "Ada", age = 36, hobbies = c("math", "logic"))
person$name
#> [1] "Ada"
person$hobbies
#> [1] "math" "logic"
person[["age"]]
#> [1] 36
You’ll meet lists most often as the return value of model-fitting
functions like lm() (we’ll get there in Lesson 5).
A data frame is a table — rows are observations, columns are variables, and each column is a vector of the same length. It’s the workhorse data structure for most analysis.
df <- data.frame(
name = c("Alice", "Bob", "Charlie"),
age = c(25, 30, 22),
score = c(95, 88, 75)
)
df
#> name age score
#> 1 Alice 25 95
#> 2 Bob 30 88
#> 3 Charlie 22 75
Subset by column with $ or [[ ]], and by
row/column with [ , ]:
df$age
#> [1] 25 30 22
df[1, ] # first row
#> name age score
#> 1 Alice 25 95
df[, "name"] # name column
#> [1] "Alice" "Bob" "Charlie"
df[df$age > 23, ] # all rows where age > 23
#> name age score
#> 1 Alice 25 95
#> 2 Bob 30 88
We’ll spend most of Lesson 3 working with data frames using a much
nicer toolset (dplyr). For now, just know they exist and
look like a spreadsheet.
if / elsescore <- 78
if (score >= 90) {
"A"
} else if (score >= 80) {
"B"
} else if (score >= 70) {
"C"
} else {
"F"
}
#> [1] "C"
for loopsfor (n in 1:5) {
print(n^2)
}
#> [1] 1
#> [1] 4
#> [1] 9
#> [1] 16
#> [1] 25
In R, most things you’d loop over in another language can be done
with vectorized operations or with
sapply/lapply. Loops aren’t wrong — they’re
just often unnecessary. We’ll see this in Lesson 3.
while loopsn <- 1
while (n < 100) {
n <- n * 2
}
n
#> [1] 128
Writing your own functions is how you go from “running examples” to “actually using R for something.” Syntax:
celsius_to_fahrenheit <- function(c) {
c * 9 / 5 + 32
}
celsius_to_fahrenheit(0)
#> [1] 32
celsius_to_fahrenheit(c(0, 20, 100)) # works on a vector for free
#> [1] 32 68 212
A few rules:
return(...) explicitly, but it’s not
required.function(c, offset = 0) { ... }.A slightly more useful one — given a numeric vector, compute summary stats:
quick_summary <- function(x, na_rm = TRUE) {
list(
n = length(x),
n_missing = sum(is.na(x)),
mean = mean(x, na.rm = na_rm),
sd = sd(x, na.rm = na_rm),
min = min(x, na.rm = na_rm),
max = max(x, na.rm = na_rm)
)
}
quick_summary(c(2, 4, 6, 8, 10, NA))
#> $n
#> [1] 6
#>
#> $n_missing
#> [1] 1
#>
#> $mean
#> [1] 6
#>
#> $sd
#> [1] 3.162278
#>
#> $min
#> [1] 2
#>
#> $max
#> [1] 10
Let’s combine everything in this lesson into a tiny analysis. Given a vector of test scores, classify each into a letter grade and report how many of each.
scores <- c(72, 88, 95, 67, 81, 59, 91, 78, 84, 100)
grade <- function(score) {
if (score >= 90) {
"A"
} else if (score >= 80) {
"B"
} else if (score >= 70) {
"C"
} else if (score >= 60) {
"D"
} else {
"F"
}
}
grades <- sapply(scores, grade)
grades
#> [1] "C" "B" "A" "D" "B" "F" "A" "C" "B" "A"
table(grades)
#> grades
#> A B C D F
#> 3 3 2 1 1
sapply applies a function to each element of a vector
and returns the results. We need it here because grade()
uses if, which only works on a single value at a time. (In
Lesson 3 we’ll see a much cleaner way to do this with
dplyr::case_when().)
Given the temperatures vector below (in Fahrenheit), convert all of them to Celsius without writing a loop.
temps_f <- c(32, 50, 68, 86, 104)
temps_f <- c(32, 50, 68, 86, 104)
temps_c <- (temps_f - 32) * 5 / 9
temps_c
#> [1] 0 10 20 30 40
Or wrap it in a function so you can reuse it:
fahrenheit_to_celsius <- function(f) (f - 32) * 5 / 9
fahrenheit_to_celsius(temps_f)
#> [1] 0 10 20 30 40
Using the built-in mtcars dataset:
mpg column.mpg > 25.mpg of cars with 4 cylinders
(cyl == 4)?mtcars$mpg
#> [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
#> [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
#> [31] 15.0 21.4
mtcars[mtcars$mpg > 25, ]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
#> Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
mean(mtcars$mpg[mtcars$cyl == 4])
#> [1] 26.66364
Write a function is_leap_year(year) that returns
TRUE if year is a leap year and
FALSE otherwise. (A year is a leap year if it’s divisible
by 4, except century years are only leap years if also
divisible by 400. So 2000 is, 1900 isn’t.)
Test it on c(2000, 1900, 2020, 2023, 2024).
is_leap_year <- function(year) {
(year %% 4 == 0 & year %% 100 != 0) | (year %% 400 == 0)
}
is_leap_year(c(2000, 1900, 2020, 2023, 2024))
#> [1] TRUE FALSE TRUE FALSE TRUE
Note we used & and | (single character)
so it works on a whole vector at once — exactly the vectorization point
from earlier.
You now have variables, types, vectors, control flow, and functions. That’s enough to write real code. Lesson 3 introduces the tidyverse — a much nicer way to manipulate data frames than what you saw above.