Introduction to R

Session 2

Session Overview

  1. Objects
  2. Vectors
  3. Matrices
  4. Lists
  5. Data frames

R Training Team Today

Martin Schumann

  • Assistant Professor at QE
  • Research interests: panel data, nonlinear models, difference-in-differences, network data, innovation
  • Website
  • Sessions 1 and 2

Stephan Smeekes

  • Professor of Econometrics at QE
  • Research interests: econometrics, time series, high-dimensional statistics, bootstrap, macro- and climate econometrics
  • Website
  • Sessions 2, 3 and 4

Objects

Objects

  • In R, everything is an object.

  • Objects have a name that is assigned with <- (recommended) or =.

  • Names have to start with a letter and include only letters, numbers, and characters such as “.” and “_”.

  • R is case sensitive: \(\Rightarrow Name\neq name\)!

  • Objects can store vectors, matrices, lists, data frames, functions…

# generate object x (no output):
x <- 5
# display log(x)
log(x)
[1] 1.609438
# object X is not defined => error message 
X
Error: object 'X' not found

Vectors

Vectors

  • Vectors can store multiple types of information (e.g., numbers or “characters”).
  • To define a 3-dimensional vector named “vec”, use vec <- c(value1, value2, value3).
  • Operators and functions can be applied to vectors, which means they are applied to each of the elements individually.
# define vector named 'vec'
vec <- c(1, 2, 3)
# take the square root of 'vec' and store the result in 'sqrt_vec'
sqrt_vec <- sqrt(vec)
# display sqrt_vec
print(sqrt_vec)
[1] 1.000000 1.414214 1.732051

Vectors - some helpful shortcuts

  • R has built-in functions that generate sequences (useful for loops or plots, among other things).
  • We can also repeat elements using rep().
# generate sequence 5,6,...,10
5:10
[1]  5  6  7  8  9 10
# generate sequence from 1 to 10 in steps of 0.5
seq(from = 1, to = 5, by = 0.5)
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
#generate 4-dimensional vector of ones
rep(1, 4)
[1] 1 1 1 1

Order matters!

  • Be aware of the order of operations!
  • compare the following:
1+2:3^2 # '^2' evaluated before ':', only then '+1' is evaluated
[1]  3  4  5  6  7  8  9 10
1+2:3*4 # first ':', then '*4', then '+1'
[1]  9 13
# use brackets to avoid confusion or mistakes
(1+2):(3*4)
 [1]  3  4  5  6  7  8  9 10 11 12

Summarizing vectors

  • R has built-in functions to summarize the information stored in vectors.
  • Remark: R is very good at generating random numbers! (Such functions are studied in more detail in the next session.)
# Example: generate 100 random draws from a normal distribution with mean 1
# and standard deviation 2
norm.vec <- rnorm(n = 100, mean = 1, sd = 2)
# get mean 
mean(norm.vec)
[1] 0.9772157
# get standard deviation 
sd(norm.vec)
[1] 1.694135
# get maximum
max(norm.vec)
[1] 5.060195

Exercise 2.1: generating and summarizing vectors

  • draw 50 random numbers from a normal distribution with mean 0 and variance 1. Store your results in the object norm.vec.
  • calculate the mean and standard deviation of norm.vec.
  • use rep() to repeat each element of norm.vec 3 times. Store the result in the object norm.vec.rep.
  • Is mean(norm.vec.rep^2) equal to mean(norm.vec.rep)^2?

logical operators

  • logical operators can be either TRUE or FALSE.
  • Extremely useful for conditional statements, e.g. if(condition is TRUE){do this}else{do that}.
  • We can check if two objects are equal by ==, different by != or compare them with < and >.
  • We can combine logical statements with “AND” & and “OR” |
# define  objects 
obj1 <- 1
obj2 <- 2
obj3 <- 1 # same value as obj1
obj1 == obj2 # false statement
[1] FALSE
obj1 != obj2 # true statement
[1] TRUE
obj1 == obj2 & obj1 == obj3 # FALSE AND TRUE => FALSE
[1] FALSE
obj1 == obj2 | obj1 == obj3 # FALSE OR TRUE => TRUE
[1] TRUE
  • We can also use logical operators in vectors.
  • the AND and OR operators & and | are then applied element-wise.
vec2 <- 1:5 # defines vector vec2=(1,2,3,4,5)
vec2 == 3  # =FALSE if element is not 3, =TRUE if element is 3
[1] FALSE FALSE  TRUE FALSE FALSE
vec2 >= 2 & vec2 < 5 # Only TRUE for elements >=2 and <5
[1] FALSE  TRUE  TRUE  TRUE FALSE
vec2 >= 2 | vec2 < 5 # TRUE for all elements since either >=2 or <5
[1] TRUE TRUE TRUE TRUE TRUE

Characters

  • Vectors can also store characters.
  • characters are enclosed in ""or ''.
# define a vector of 2 cities
cities <- c('Maastricht', "Amsterdam",'Rotterdam')
print(cities)
[1] "Maastricht" "Amsterdam"  "Rotterdam" 

Exercise 2.2: type coersion

  • R tries to make objects comparable by coercing one object into the type of another.
  • This can sometimes be handy, but sometimes it leads to unforeseen errors (e.g., when loading new data). To illustrate this, do the following:
    • compare the character "1" to the numeric 1.
    • try computing the sum of "1" and "2".
    • try computing the sum of as.numeric("1") and as.numeric("2"). What happened?
    • create a mixed vector containing the numeric 1 and the character "2". Of which type are the elements of the vector?

factors

  • Many variables are qualitative rather than quantitative.
  • While they are often coded using numbers, they don’t have a numerical meaning.
  • Examples: gender, nationality…
  • Can also be ordinal, i.e., the outcomes can be ranked (e.g., “bad”, “meh”, “great”).
x <- c(1, 3, 3, 2, 1, 3)
xf <- factor(x, labels = c("bad", "ok", "good"))# no ranking
xf
[1] bad  good good ok   bad  good
Levels: bad ok good
# now with ranking
xf.ordered <- factor(x, labels = c("bad", "ok", "good"), ordered = TRUE)
xf.ordered
[1] bad  good good ok   bad  good
Levels: bad < ok < good

Names

  • You can give the elements of your vector names either directly or using the names() command.
  • This is very useful for accessing elements (see next slide)
avg_temp <- c(Maastricht = 14.2, Amsterdam = 13.4, Rotterdam = 13.7)
print(avg_temp) # names appear on top of elements
Maastricht  Amsterdam  Rotterdam 
      14.2       13.4       13.7 
names(avg_temp) # returns names of elements
[1] "Maastricht" "Amsterdam"  "Rotterdam" 
# Alternatively, we can define data and names separately
temp <- c(14.2, 13.4, 13.7)
names(temp) <- cities # recall that we have defined "cities" earlier!
print(temp)
Maastricht  Amsterdam  Rotterdam 
      14.2       13.4       13.7 

Accessing elements

  • One can access the elements of a vector either by name or position.
# return the second element of "avg_temp" defined before
avg_temp[2] 
Amsterdam 
     13.4 
# return the element corresponding to "Maastricht"
avg_temp["Maastricht"]
Maastricht 
      14.2 
# trying to access a non-existing element yields "NA"
# ( for "not available"), i.e., a missing value
avg_temp[4]
<NA> 
  NA 
  • By using the minus sign [-k], we can get the vector except for the \(k\)-th element.
  • We can also add elements to an existing vector.
# get the vector except for the third element
avg_temp[-3]
Maastricht  Amsterdam 
      14.2       13.4 
# now add another city to avg_temp
avg_temp["Tilburg"] <- 14.7
# now the fourth element is defined!
avg_temp[4]
Tilburg 
   14.7 

More on NA, NaN, Inf

  • NA (“not available”) indicates missing values.
  • Anything combined with NA yields NA.
  • NaN(“not a number”) indicates the result of a mathematically undefined operation.
#define another vector
vec3 <- c(-1.2, NA, 0)
# combine avg_temp and vec3
vec4 <- c(avg_temp, vec3) 
# divide elements by 0; notice the different outcomes
vec4 / 0 
Maastricht  Amsterdam  Rotterdam    Tilburg                                  
       Inf        Inf        Inf        Inf       -Inf         NA        NaN 

Matrices

Matrices

  • We can create a matrix with m rows directly using matrix(vector,nrow=m).
# create matrix with 3 rows; fill numbers by row
mat1 <- matrix(1:12, nrow = 3, byrow = TRUE) # by default, R fills matrices by column
mat1
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8
[3,]    9   10   11   12
  • We can also combine vectors of the same length by row with rbind(v1,v2,...) or by column by cbind(v1,v2,...).
# create vectors v1, v2 and v3 and combine them for same result
v1 <- 1:4
v2 <- 5:8
v3 <- 9:12
mat2 <- rbind(v1, v2, v3)
mat2
   [,1] [,2] [,3] [,4]
v1    1    2    3    4
v2    5    6    7    8
v3    9   10   11   12

Matrix indexing

# assign names to columns
colnames(mat1) <- c("col1", "col2", "col3", "col4")
mat1
     col1 col2 col3 col4
[1,]    1    2    3    4
[2,]    5    6    7    8
[3,]    9   10   11   12
# assign names to rows
rownames(mat1) <- c("row1","row2","row3")
mat1
     col1 col2 col3 col4
row1    1    2    3    4
row2    5    6    7    8
row3    9   10   11   12

Accessing elements

  • We can access single elements by [rownumber,colnumber], the k-th row by [k,] and the k-th column by [,k].
# get element in second row in third column
mat1[2,3]
[1] 7
# get second row
mat1[2,]
col1 col2 col3 col4 
   5    6    7    8 
# get third column
mat1[,3]
row1 row2 row3 
   3    7   11 
  • If rows/columns have names, we can also use those.
  • Using vectors, we can also create more complicated subsets of matrices.
# get sub-matrix using vectors
mat1[c(2,3),c(1,3)]
     col1 col3
row2    5    7
row3    9   11
# get second row using names (recall definition of mat2)
mat2["v2",]
[1] 5 6 7 8

Exercise 2.3: creating matrices, accessing elements

  1. Create the 3x3 identity matrix “by hand”. To do so:
    1. create 3 vectors with zeros and ones in the appropriate spots.
    2. use rbind() or cbind() to combine them into the identity matrix.
    3. store the identity matrix as the object “I_mat”.
    4. R makes your life easy: type diag(3) in your console.
  2. Replicate the following Excel-matrix:


3. Get the data for April and May by - including only the first and second row - excluding the third row - using the names

Bonus: Matrix algebra

  • R can do matrix “regular” algebra, and even lets you do operations that are not well-defined mathematically.

  • t(A) is the transpose of the matrix A.

# define matrix containing normal data
data.vec <- rnorm(9, mean = 0, sd = 1)
A <- matrix(data.vec, nrow = 3)
A # return A
           [,1]       [,2]       [,3]
[1,]  0.2716964  0.2826514  0.7622134
[2,] -1.5595010 -1.1265315 -0.2985537
[3,] -0.3099461 -1.3427173 -1.3471083
t(A) # return the transpose
          [,1]       [,2]       [,3]
[1,] 0.2716964 -1.5595010 -0.3099461
[2,] 0.2826514 -1.1265315 -1.3427173
[3,] 0.7622134 -0.2985537 -1.3471083
  • solve(A) returns the inverse of an invertible matrix.
solve(A) # return the inverse of A
          [,1]       [,2]       [,3]
[1,]  1.047873 -0.6030714  0.7265578
[2,] -1.884525 -0.1217632 -1.0393055
[3,]  1.637285  0.2601225  0.1264188
  • *does element-wise multiplication.
  • %*% does matrix multiplication .
# element-wise multiplication
A * solve(A) # NOT the identity matrix
           [,1]      [,2]       [,3]
[1,]  0.2847033 -0.170459  0.5537921
[2,]  2.9389180  0.137170  0.3102885
[3,] -0.5074700 -0.349271 -0.1702998
# matrix multiplication
A %*% solve(A) 
              [,1]          [,2]          [,3]
[1,]  1.000000e+00 -5.551115e-17  2.775558e-17
[2,]  0.000000e+00  1.000000e+00 -3.469447e-17
[3,] -4.440892e-16 -5.551115e-17  1.000000e+00
# yields the identity (up to a small error due to the
# numerical computation of the inverse)

Lists

Lists

  • A list is a generic collection of objects.
  • Unlike vectors, the components can have different types (e.g., numeric and character).
  • Many functions output lists, so knowing how to access elements is very useful.
  • Generate a list with mylist<- list(name1=component1, name2=component2,...).
mylist <- list(num.vec = 1:3, city = "Maastricht") 
print(mylist)
$num.vec
[1] 1 2 3

$city
[1] "Maastricht"
  • Get names of the components with names(mylist).
  • You can access components with the $ (dollar sign) operator, e.g., mylist$name1, or by position with [[]].
mylist$city
[1] "Maastricht"
mylist[[2]] # same result
[1] "Maastricht"

Data frames

Data frames

  • data frames are simply data sets in R terminology.
  • So-called data files can contain multiple data sets.
  • We can create a data frame by data.frame() or transform a matrix mat into a data frame by as.data.frame(mat).
  • Many functions (e.g. lm() for regressions) need a data frame as input (see later sessions).
# generate a data frame
ID <- 1:4
hourly_wage <- rnorm(n = 4, mean = 20, sd = 1) # create 4 draws from N(20,1)
city <- c("Maastricht", "Eindhoven", "Amsterdam", NA)
dats <- data.frame(ID, hourly_wage, city) # add new variable
dats
  ID hourly_wage       city
1  1    20.18216 Maastricht
2  2    20.37298  Eindhoven
3  3    20.68820  Amsterdam
4  4    20.96485       <NA>
  • As with lists, we can access variables using the $ operator.
  • We can also add new variables using the $ operator.
  • View() opens a data-viewer. Very useful (but difficult to demonstrate on these slides).
dats$city # "city" is NA for ID 4.
[1] "Maastricht" "Eindhoven"  "Amsterdam"  NA          
dats$city[4] <- 'Tilburg' # assign city to ID 4
dats$educ <- c(12, 21, 9, 10)
dats
  ID hourly_wage       city educ
1  1    20.18216 Maastricht   12
2  2    20.37298  Eindhoven   21
3  3    20.68820  Amsterdam    9
4  4    20.96485    Tilburg   10
  • Using subset(data_frame,condition), we can easily get a subset of the original data frame where condition is TRUE.
# only keep individuals with at least 10 years of education
sub_dats <- subset(dats, educ > 10)
sub_dats
  ID hourly_wage       city educ
1  1    20.18216 Maastricht   12
2  2    20.37298  Eindhoven   21

Exercise 2.4

  • Create your own data frame:
    • create a vector ID that contains the sequence 1,2,…,100.
    • create a vector income that contains 100 random draws from N(10,1).
    • create a dummy female that is 1 for ID=1,...,50 and 0 otherwise. (hint: you can achieve this by using rep() twice and combining two vectors with c())
    • collect the variables in a data frame my_df.
    • inspect your data with View(my_df)
    • bonus: create a subset sub_my_df that contains only individuals with income larger than 10.

Bonus: teaching regression with R

  • To give students a feeling for the behavior of the least squares estimator, it can be very useful to use simulated data.
  • This allows teachers to visualize the effects of various quantities of interest, e.g., sample size, variation in the observed and unobserved variables, or omitted variables.
n <- 100 # set the sample size
X <- rnorm(n, mean = 1, sd = 2)# define the observed covariate X
epsilon <- rnorm(n, mean = 0, sd = 1) # define the model error
beta0 <- 1 # define true intercept
beta1 <- 2 # define true slope
Y <- beta0 + beta1 * X + epsilon # generate Y according to a linear model
# recall the formula in a bivariate model
beta1.hat <- cov(X,Y) / var(X)
beta0.hat <- mean(Y) - beta1.hat * mean(X)
# print estimators
beta0.hat
[1] 0.9852093
beta1.hat
[1] 1.974165

Bonus exercise: simulate!

  • Repeat the previous simulation, but
    • change the sample size
    • change the mean of X. What is the effect on beta1.hat?
    • change the mean of epsilon. What is the effect on beta0.hat?
    • change the variance of X and epsilon. What is the effect?