Lesson 2 of 7 · Course overview

Basic Operations in R

This lesson covers the building blocks: variables, data types, vectors, operators, control flow, and functions. None of it is exciting on its own, but every single thing in R is built out of these pieces. Get comfortable here and the rest of the course is mostly pattern-matching.

Variables and assignment

A variable is a name that points to a value. Assign with <-:

x <- 5
y <- 12
x + y
#> [1] 17

= works too, but <- is the convention in R for assignment. Stick with it.

You can reassign a variable as often as you like — R doesn’t care:

x <- 5
x <- x + 1
x
#> [1] 6

Data types

R has a small set of basic types you’ll use constantly.

n  <- 3.14         # numeric (double)
i  <- 42L          # integer (the L makes it an integer, not a double)
s  <- "hello"      # character (string)
b  <- TRUE         # logical (TRUE / FALSE)
m  <- NA           # missing value

To check what type something is, use class() or one of the is.* family:

class(n)
#> [1] "numeric"
class(s)
#> [1] "character"
class(b)
#> [1] "logical"
is.numeric(n)
#> [1] TRUE
is.character(s)
#> [1] TRUE

Missing values (NA) are everywhere in real data. R takes them seriously — most functions return NA if their input has any NAs, unless you tell them otherwise:

mean(c(1, 2, NA))
#> [1] NA
mean(c(1, 2, NA), na.rm = TRUE)
#> [1] 1.5
⚠️ NA is contagious

NA == NA is NA, not TRUE. To check for missingness use is.na(x). This trips up beginners constantly.

Arithmetic and math

The usual suspects:

10 + 3
#> [1] 13
10 - 3
#> [1] 7
10 * 3
#> [1] 30
10 / 3
#> [1] 3.333333
10 %% 3   # remainder (modulo)
#> [1] 1
10 %/% 3  # integer division
#> [1] 3
10 ^ 3    # exponentiation
#> [1] 1000

Plus a stack of built-in math functions:

sqrt(16)
#> [1] 4
log(100)        # natural log
#> [1] 4.60517
log(100, base = 10)
#> [1] 2
exp(1)          # e
#> [1] 2.718282
abs(-7)
#> [1] 7
round(3.14159, 2)
#> [1] 3.14
ceiling(2.1)
#> [1] 3
floor(2.9)
#> [1] 2

Comparison and logical operators

Comparisons return TRUE or FALSE:

5 > 3
#> [1] TRUE
5 == 5     # equality is double-equals; single-equals is assignment
#> [1] TRUE
5 != 6
#> [1] TRUE
5 >= 5
#> [1] TRUE
"a" < "b"  # alphabetical
#> [1] TRUE

Logical operators combine TRUE/FALSE values:

TRUE & FALSE   # AND
#> [1] FALSE
TRUE | FALSE   # OR
#> [1] TRUE
!TRUE          # NOT
#> [1] FALSE

The single-character versions (&, |) work elementwise on vectors. The double-character versions (&&, ||) only look at the first element and short-circuit — use them inside if statements, not for filtering data.

Vectors

A vector is an ordered collection of values, all of the same type. It’s the most fundamental data structure in R — even a single number is technically a vector of length 1.

Create a vector with c() (“combine”):

ages <- c(23, 31, 27, 45, 19)
fruit <- c("apple", "banana", "cherry")
flags <- c(TRUE, FALSE, TRUE, TRUE)

A few handy ways to create vectors:

1:5                  # integer sequence
#> [1] 1 2 3 4 5
seq(0, 1, by = 0.25)
#> [1] 0.00 0.25 0.50 0.75 1.00
rep("a", times = 3)
#> [1] "a" "a" "a"

Indexing vectors

Subsetting a vector by position uses [ ]. R is 1-indexed — the first element is at position 1, not 0.

ages[1]            # first element
#> [1] 23
ages[c(1, 3)]      # first and third
#> [1] 23 27
ages[-1]           # everything except the first (negative indexing drops)
#> [1] 31 27 45 19
ages[ages > 25]    # elements that match a condition
#> [1] 31 27 45

That last one — subsetting by a logical vector — is one of the most-used patterns in R. The expression ages > 25 returns a logical vector the same length as ages, and ages[...] keeps the elements where it’s TRUE.

Vectorized operations

R operations work on entire vectors at once. You almost never need a loop for elementwise math:

ages
#> [1] 23 31 27 45 19
ages + 1            # add 1 to every element
#> [1] 24 32 28 46 20
ages * 2
#> [1] 46 62 54 90 38
ages > 30
#> [1] FALSE  TRUE FALSE  TRUE FALSE
sum(ages)
#> [1] 145
mean(ages)
#> [1] 29
length(ages)
#> [1] 5

This is the single biggest stylistic difference between R and most other languages, and getting comfortable with it is half the battle.

Named vectors

Vectors can have names attached to each element:

prices <- c(apple = 1.20, banana = 0.50, cherry = 3.00)
prices
#>  apple banana cherry 
#>    1.2    0.5    3.0
prices["banana"]
#> banana 
#>    0.5

Lists

A list is like a vector, but each element can be a different type — even another list.

person <- list(name = "Ada", age = 36, hobbies = c("math", "logic"))
person$name
#> [1] "Ada"
person$hobbies
#> [1] "math"  "logic"
person[["age"]]
#> [1] 36

You’ll meet lists most often as the return value of model-fitting functions like lm() (we’ll get there in Lesson 5).

Data frames

A data frame is a table — rows are observations, columns are variables, and each column is a vector of the same length. It’s the workhorse data structure for most analysis.

df <- data.frame(
  name = c("Alice", "Bob", "Charlie"),
  age = c(25, 30, 22),
  score = c(95, 88, 75)
)
df
#>      name age score
#> 1   Alice  25    95
#> 2     Bob  30    88
#> 3 Charlie  22    75

Subset by column with $ or [[ ]], and by row/column with [ , ]:

df$age
#> [1] 25 30 22
df[1, ]            # first row
#>    name age score
#> 1 Alice  25    95
df[, "name"]       # name column
#> [1] "Alice"   "Bob"     "Charlie"
df[df$age > 23, ]  # all rows where age > 23
#>    name age score
#> 1 Alice  25    95
#> 2   Bob  30    88

We’ll spend most of Lesson 3 working with data frames using a much nicer toolset (dplyr). For now, just know they exist and look like a spreadsheet.

Control flow

if / else

score <- 78
if (score >= 90) {
  "A"
} else if (score >= 80) {
  "B"
} else if (score >= 70) {
  "C"
} else {
  "F"
}
#> [1] "C"

for loops

for (n in 1:5) {
  print(n^2)
}
#> [1] 1
#> [1] 4
#> [1] 9
#> [1] 16
#> [1] 25
💡 You’ll write fewer loops than you think

In R, most things you’d loop over in another language can be done with vectorized operations or with sapply/lapply. Loops aren’t wrong — they’re just often unnecessary. We’ll see this in Lesson 3.

while loops

n <- 1
while (n < 100) {
  n <- n * 2
}
n
#> [1] 128

Functions

Writing your own functions is how you go from “running examples” to “actually using R for something.” Syntax:

celsius_to_fahrenheit <- function(c) {
  c * 9 / 5 + 32
}

celsius_to_fahrenheit(0)
#> [1] 32
celsius_to_fahrenheit(c(0, 20, 100))   # works on a vector for free
#> [1]  32  68 212

A few rules:

  • The last expression in the body is what the function returns. You can also write return(...) explicitly, but it’s not required.
  • Arguments can have default values: function(c, offset = 0) { ... }.
  • Functions are first-class — you can pass them around as values.

A slightly more useful one — given a numeric vector, compute summary stats:

quick_summary <- function(x, na_rm = TRUE) {
  list(
    n = length(x),
    n_missing = sum(is.na(x)),
    mean = mean(x, na.rm = na_rm),
    sd = sd(x, na.rm = na_rm),
    min = min(x, na.rm = na_rm),
    max = max(x, na.rm = na_rm)
  )
}

quick_summary(c(2, 4, 6, 8, 10, NA))
#> $n
#> [1] 6
#> 
#> $n_missing
#> [1] 1
#> 
#> $mean
#> [1] 6
#> 
#> $sd
#> [1] 3.162278
#> 
#> $min
#> [1] 2
#> 
#> $max
#> [1] 10

Putting it together

Let’s combine everything in this lesson into a tiny analysis. Given a vector of test scores, classify each into a letter grade and report how many of each.

scores <- c(72, 88, 95, 67, 81, 59, 91, 78, 84, 100)

grade <- function(score) {
  if (score >= 90) {
    "A"
  } else if (score >= 80) {
    "B"
  } else if (score >= 70) {
    "C"
  } else if (score >= 60) {
    "D"
  } else {
    "F"
  }
}

grades <- sapply(scores, grade)
grades
#>  [1] "C" "B" "A" "D" "B" "F" "A" "C" "B" "A"
table(grades)
#> grades
#> A B C D F 
#> 3 3 2 1 1

sapply applies a function to each element of a vector and returns the results. We need it here because grade() uses if, which only works on a single value at a time. (In Lesson 3 we’ll see a much cleaner way to do this with dplyr::case_when().)

✏️ Exercise 2.1 — Vectorized math

Given the temperatures vector below (in Fahrenheit), convert all of them to Celsius without writing a loop.

temps_f <- c(32, 50, 68, 86, 104)
Show solution
temps_f <- c(32, 50, 68, 86, 104)
temps_c <- (temps_f - 32) * 5 / 9
temps_c
#> [1]  0 10 20 30 40

Or wrap it in a function so you can reuse it:

fahrenheit_to_celsius <- function(f) (f - 32) * 5 / 9
fahrenheit_to_celsius(temps_f)
#> [1]  0 10 20 30 40
✏️ Exercise 2.2 — Subsetting

Using the built-in mtcars dataset:

  1. Print the mpg column.
  2. Print only the cars with mpg > 25.
  3. What’s the mean mpg of cars with 4 cylinders (cyl == 4)?
Show solution
mtcars$mpg
#>  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
#> [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
#> [31] 15.0 21.4
mtcars[mtcars$mpg > 25, ]
#>                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#> Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#> Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#> Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
#> Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
mean(mtcars$mpg[mtcars$cyl == 4])
#> [1] 26.66364
✏️ Exercise 2.3 — Write a function

Write a function is_leap_year(year) that returns TRUE if year is a leap year and FALSE otherwise. (A year is a leap year if it’s divisible by 4, except century years are only leap years if also divisible by 400. So 2000 is, 1900 isn’t.)

Test it on c(2000, 1900, 2020, 2023, 2024).

Show solution
is_leap_year <- function(year) {
  (year %% 4 == 0 & year %% 100 != 0) | (year %% 400 == 0)
}

is_leap_year(c(2000, 1900, 2020, 2023, 2024))
#> [1]  TRUE FALSE  TRUE FALSE  TRUE

Note we used & and | (single character) so it works on a whole vector at once — exactly the vectorization point from earlier.

What’s next

You now have variables, types, vectors, control flow, and functions. That’s enough to write real code. Lesson 3 introduces the tidyverse — a much nicer way to manipulate data frames than what you saw above.

Feel free to contact me: