Introduction to R

Session 3

Session Overview

  1. Functions in R
  2. Bonus: Looking Inside Functions
  3. Objects Revisited
  4. Packages
  5. Bonus: Flow Control

Today

Stephan Smeekes

  • Professor of Econometrics at QE
  • Research interests: econometrics, time series, high-dimensional statistics, bootstrap, macro- and climate econometrics
  • Website
  • Sessions 2, 3 and 4

Ines Wilms

  • Associate Professor at QE
  • Research interests: econometrics, time series, high-dimensional statistics, forecasting, outlier robustness
  • Website
  • Sessions 1, 3 and 5

Functions

Functions

We have already encountered several functions in R:

Function Overview

A function consists of three parts:

Arguments

  • A function is called by specifying the function name followed by one or more comma-separated arguments in parentheses:
function_name(argument1 = value1, argument2 = value2, ...)
  • This gives the same:
function_name(argument2 = value2, argument1 = value1, ...)
  • This also gives the same:
function_name(value1, value2, ...)
  • This does not!
function_name(value2, value1, ...)
  • Why?

More about Arguments

  • If you don’t give argument names, R assumes arguments are given in the order as defined.

  • There are also default arguments, which do not always need to be specified.

  • Let us look at an example: calculating the logarithm using the log function:

fake_data <- c(1, 1, 2, 3, 5, 8, 13, 21)
1log_exp_data <- log(fake_data)
2log_2_data <- log(fake_data, base = 2)
1
Calculates the natural logarithm;
2
Calculates the logarithm with base 2.
  • Use the help files to understand function arguments

R help pages

  • provide detailed descriptions, arguments, return values, etc. for R packages, functions

  • include usage examples that demonstrate how to use the function effectively

  • are directly accessible in RStudio

R Help

Help files are displayed in the Help window in the bottom right window below

Accessing R Help

To open the help file for a specific R function, for instance the function log(), you can use either of the following two commands:

help(log)
?log

Both commands above to the same: they open the help file of the function you specified. Note that R searches for an exact! match and will only open the R Help page if the function exists.

Exercise 3.1: R Help

  • Execute the following code:
?log

R should open the help page for the function with name log. No need to go over the help page in detail, we explore this in detail soon.

  • Execute the following code:
?logarithm

R should tell you that no documentation for the requested function exists. It did not find an exact match to the function logarithm(). It does give you a recommendation to try something instead…

  • Namely to try this:
??logarithm

R performs a full-text search of the help system for all documentation that mentions the term logarithm.

While the search with the command ? is useful to directly access the help page of an R function you know exists, the broader help search with the command ?? is useful if you are not sure about the exact name of a function.

Understanding the log function

Execute the command ?log. It gives something like this:

Understanding the log function

Execute the command ?log. It gives something like this:

Understanding the log function

Execute the command ?log. It gives something like this:

Understanding the log function

Execute the command ?log. It gives something like this:

Understanding the log function

Execute the command ?log. It gives something like this:

Understanding the log function

Execute the command ?log. It gives something like this:

Understanding the log function

Execute the command ?log. It gives something like this:

Understanding the log function

Execute the command ?log. It gives something like this:

Input and output of the log function

  • Note how the log function preserves the object type of the input:
fake_data <- c(1, 1, 2, 3, 5, 8, 13, 21)
log(fake_data)
[1] 0.0000000 0.0000000 0.6931472 1.0986123 1.6094379 2.0794415 2.5649494
[8] 3.0445224
fake_data_matrix <- matrix(fake_data, nrow = 4)
log(fake_data_matrix)
          [,1]     [,2]
[1,] 0.0000000 1.609438
[2,] 0.0000000 2.079442
[3,] 0.6931472 2.564949
[4,] 1.0986123 3.044522

Exercise 3.2: Function Arguments

The function rnorm() can be used to simulate normally distributed data. The function mean() can be used to calculate the sample mean of the data, while the function sd() can be used to calculate the sample standard deviation. You will need to use the help files of these functions to complete the question.

  1. Simulate 100 numbers from a normal distribution with mean 0 and standard deviation 1 and store these in a vector called x1. Do this with the least amount of explicitly specified arguments as possible.

  2. Simulate 200 numbers from a normal distribution with mean 0 and standard deviation 5 and store these in a vector called x2. Do this with the least amount of explicitly specified arguments as possible.

  3. Simulate 80 numbers from a normal distribution with mean -6 and standard deviation 4 and store these in a vector called x3. Do this with the least amount of explicitly specified arguments as possible.

  4. Calculate the mean and standard deviation of each of the three series.

  5. Set the 10th value of x1 to NA. Calculate the mean and standard deviation again. What do you observe? Learn from the help function how we can fix this.

The … argument

  • If you opened the help function of mean() before, you saw the last argument is

...      further arguments passed to or from other methods”

  • ... is a special argument: it allows you to put in different arguments that are then passed on to an other function internally.

  • To know how to put them correctly, you may need to look at the other function; not always easy!

  • Since the function accepts any arguments in ..., an error message will typically be given at a ‘deeper level’, which can be very confusing. Or you even may not get an error message at all.

Exercise 3.2’: function arguments revisited

  1. Repeat the last part of the previous exercise, calculating the standard deviation for x1. Make sure to name the second argument of the function explicitly. Now intentionally misspell the name of the second argument and look at the error message.

  2. Now, do the same for calculating the mean; first do it correctly, then make an intentional mistake in the name of the argument.

  3. Can you explain the difference in results?

Useful functions: summary / descriptive statistics

  • often you want to have a quick look at your data. Here are some useful functions for this purpose:

  • For (numerical) vectors:

Function Description
length(v) Number of elements in \(v\)
max(v) Largest value in \(v\)
min(v) Smallest value in \(v\)
sum(v) Sum of the elements of \(v\)
prod(v) Product of the elements of \(v\)
mean(v) mean of the elements of \(v\)!
sd(v) Standard deviation of the elements of \(v\)

Useful functions: summary / descriptive statistics

  • Often you want to have a quick look at your data. Here are some useful functions for this purpose:

  • For matrices and data frames:

Function Description
nrow(D) Number of rows in \(D\)
ncol(D) Number of columns in \(D\)
head(D) Displays the first few rows of \(D\)
tail(D) Displays the last few rows of \(D\)
summary(D) Gives a summary of \(D\)
  • For numerical matrices and data frames:
Function Description
colSums(D) Sum of the elements in each column of \(D\)
rowSums(D) Sum of the elements in each row of \(D\)
colMeans(D) Mean of the elements in each column of \(D\)
rowMeans(D) Mean of the elements in each row of \(D\)

Bonus: Looking Inside Functions

Structure of functions

  • Functions in R have the following structure:
function_name <- function(argument1, argument2 = default_value, ...) {
  Function body           # the operations that the function should do
  return(function_output) # the output to be returned by the function
}

A simple example function

  • This function converts miles to kilometres:
miles_to_km <- function(miles = 1){
  km <- 1.609344 * miles
  return(km)
}
  • How many kilometres is 60 miles?
miles_to_km(60)
[1] 96.56064
  • The default value of miles is set to one, so executing the function without argument, gives you how many kilometres is equal to 1 mile:
miles_to_km()
[1] 1.609344

Multiple function outputs

  • R can only give one object as output.

  • if you have multiple outputs, you have to combine them in one object.

  • Often, the most natural choice for that is to use a list, as this can combine outputs of different nature.

An example function with multiple outputs

  • This function converts miles to kilometres and to metres:
miles_to_metric <- function(miles = 1){
  km_miles <- 1.609344 * miles
  m_miles <- 1000 * km_miles
  out <- list(km = km_miles, m = m_miles)
  return(out)
}
  • How many (kilo)metres is 60 miles?
miles_to_metric(60)
$km
[1] 96.56064

$m
[1] 96560.64

Why create your own functions?

  • You can perfectly get around R without ever creating your own functions.

  • But there are good reasons to do so:

    • Easy if you need the same code repetitively;

    • Decreases the probability of making mistakes, as you only need to write that piece of code once;

    • You can use your functions in different script later on.

  • Creating functions is a basic skill well worth investing in.

  • Note: make sure to first execute the lines containing your function before you use it (and execute again after an update to the code).

Exercise 3.3: Create your own function

  1. Load the courses data frame, which contains information about several courses at a school of business and economics. If the file is located in a folder called “data” within your working directory, you can load it as
load("data/courses.RData")
  1. Create a function that takes as input the courses data frame. The function output should be the total number of courses that use student tutors.
  • There are many ways to count the number of courses that require student tutors. Probably the easiest is to directly do calculations with the logical values: TRUE is treated as 1, FALSE as 0:
c(TRUE + FALSE, TRUE + TRUE, TRUE * FALSE)
[1] 1 2 0

Exercise 3.3: Create your own function

  1. We now want to extend the function to give a second output. This output should contain a data frame with two columns: the first is the course code, the second is the number of tutorials for each course.
  • The number of tutorials should be calculated as the number of students divided by the number of students per tutorial group, and then rounded up.
  • The number of students per tutorial group should be a second argument of the function, with a default value of 15.
  • To round up, we can use the function ceiling().

Objects revisited

Objects and Their Labels

  • Objects can be seen as a package: there is the actual content, and there is a label explaining the purpose of the object.
  • We can obtain the label from the function class()

  • Many functions act differently, depending on the class of the input.

The summary function

  • We can illustrate this using the summary() function.

  • Summary on vectors:

x <- rnorm(100)
summary(x)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-2.78503 -0.62320 -0.07553 -0.09672  0.53134  1.85892 

The summary function

  • Summary on data frames:
load("data/courses.RData")
summary(courses)
    Course          Coordinator            Code          
 Length:30          Length:30          Length:30         
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
                  Programme  Period  Enrollments     StudentTutors  
 Econometrics          :10   1:5    Min.   : 57.00   Mode :logical  
 Economics             :10   2:6    1st Qu.: 65.75   FALSE:19       
 International Business:10   3:4    Median :290.00   TRUE :11       
                             4:6    Mean   :292.17                  
                             5:5    3rd Qu.:412.50                  
                             6:4    Max.   :610.00                  

The summary function

  • Similar ideas for functions such as plot(), print(), etc.

  • The output of summary is an object in itself:

summ_x <- summary(x)
class(summ_x)
[1] "summaryDefault" "table"         

In terms of content, this is just a (named) vector.

Behind the Scenes: the function.class structure

  • Behind the scenes, R works as follows:

Behind the Scenes: the function.class structure

  • If you start typing the summary() function name in RStudio, you will see that there are actually many summary functions …

  • … among which summary.data.frame() and summary.matrix()

  • In principle, we need not bother with the system, but it can be useful to keep in mind for two cases:

  1. Help files and documentation may differ depending on classes.

  2. Applying it to an object for which the class function has not been defined (“Why does summary() not work on my object X?”).

Exercise 3.4: Using the plot function for different objects

  1. Remember your vector x1 of normally distributed random variables created in Exercise 3.2.

  2. Make a basic plot of this vector using the simple command plot(x1).

  3. Now transform your vector x1 into a time series object using the function ts(). Save it as x1ts.

  4. Make a plot of this new object using the command plot(x1ts). Why does it look different from before?

  5. Make a plot of the variable Period from the courses data frame (see Exercise 3.3). Explain why this looks yet again different.

Packages

R packages

  • Until now, we have explored some basic functionality in R. But much of the functionality of R is extended by its big and active community.

  • These extensions are called R packages.

  • One of R’s defining features is the richness of its package management system:

    • Packages are constantly developed and adjusted by the large user community making many state-of-the-art methods quickly available

    • Packages can be installed for just about everything you want to do in statistics and data science

    • Packages offer pre-defined functions which makes it possible to use R without a deep understanding of the language and programming skills.

Overview of R packages

  • The standard distribution of R comes already with a number of packages.

  • A list of the currently installed packages can be obtained from the Packages window in the bottom right window below

  • You can install new packages depending on your needs

Installing R packages

  • To install a new R package, click on the Install button at the top of the Packages window.

You can then type the name of the R package you want to install. Let us install the package this.path

You can use this R code also directly to install a package. The package is now added to to your package list and ready to be used.

install.packages("this.path", dependencies = TRUE)

Loading R package

  • Installing an R package as we just did, only needs to be done once.

  • Yet each time you want to use the functionality of the R package in your current R session, you need to activate it. You can do this via the following command:

  • After activating the package nothing immediately happens, but R simply has access to the functionality that this.path offers.

Using functions from a package without loading the package

  • It is also possible to use functions from a package without actually loading the package.

  • Although generally you may not want to do this, it can sometimes be useful if you only need a single function; especially for a big package, you don’t need all functions stored in memory.

  • This can be done using the command

package_name::function_name()

which can be interpreted as “use the function function_name() from the package package_name.”

Exercise 3.5: R package

  • If you have not yet done, so, install the this.path package.

  • Why did we install the this.path package? Because it offers functionality to set your working directory, as an alternative to the menu-based approach we did before.

  • To set the working directory to the folder where your current R script is located, you can simply use:

setwd(this.path::here())
  • Can you explain every step in this line?

  • Can you think of an alternative way to achieve the same thing with the same functions?

  • It may be convenient to enter this at the top of your R script to ensure that you are always working in the directory where your R Script is located without having to type the name of the path!

Digging Deeper into Packages

  • We have now seen that installing packages is straightforward. But how do I know which package I need?

  • And how do packages function? Do they rely on other packages? Should I be bothered with this?

  • Let’s have a deeper look.

Finding good packages

  • Sometimes packages come recommended in books or articles.

  • If not, do a Google or ChatGPT search: ‘R package for reading Excel’; top search results are typically the most popular packages.

  • Unsure about the quality? Some advice:

    • Consider the authors. Are they respected academics? Or have a good track record of developing packages? These are good signs.

    • Read the manuals / help files. Do they make sense? Are they written by someone who understands the important aspects of the methods?

    • Install and try them, and see if they match your expectations.

More about packages

  • Most of the time you need not need to be bothered by understanding the deeper meanings of the package, and you can just install it directly. But in case you do want to have a look, here are some things you can pay attention to.

  • ‘Official’ packages are hosted on the Comprehensive R Archive Network (CRAN).

  • The package homepage can be found as https://cran.r-project.org/package=package_name

  • You can find the same info after installing a package by clicking on the package name, then on ‘DESCRIPTION file’.

  • Important fields:

    • Maintainer & Author: especially helpful for specialised packages, for which you expect the authors to be experts in the field.
    • URL: Is there dedicated documentation? What is the quality of the documentation (often correlates with the quality of the package)?
    • Imports, Depends & LinkingTo: the packages needed to make the package work (installation is normally taken care of automatically).
    • Suggests: packages that are not necessarily needed for using the package, but are needed for specific functionalities.

Checking out the bootUR package

Checking out the bootUR package

Checking out the bootUR package

Suggested packages

  • It is up to the author to decide what packages to suggest, and what packages to list as imports.

  • It is also up to the author to protect the user from strange error messages while not having installed a suggested package.

  • This is not always done properly!

  • Suggested packages are not always installed automatically!

Installing suggested packages

  • The command
install.packages("package-name")

does not install suggested packages. Use instead

install.packages("package-name", dependencies = TRUE)
  • In RStudio, make sure that the box “Install dependencies” is checked.

Bonus Exercise 3.6: Package information & installation

  1. Find out who the maintainer is of the package bigtime.

  2. Check if you have the suggested packages for bootUR installed. If you have ggplot2 installed, remove it from your installation. (This can be done by clicking at the right spot in RStudio’s Packages tab.)

  3. Install the package bootUR in the ‘naive’ way using install.packages("bootUR").

  4. This package has a function to plot missing values in a time series dataset; apply this function to the data set that comes with the package. (This obviously implies you need to find both the function and the dataset.)

  5. You will get an error message. Fix the error and repeat the steps above to produce a plot of the missing values.

Bonus: Installing packages from (other) source(s)

  • Sometimes you might find a package not available on CRAN, but hosted on different platforms, such as GitHub.

  • Such packages have to be installed from ‘source’.

  • Occasionally, CRAN may also ask you if you want to install a package from source, if a newer version is available than the standard binary package.

  • Installing from source is problematic on Windows and Mac, as it requires the installation of additional software for packages that are built on C/C++/Fortran code that needs compilation (see e.g. ?install.packages for details).

  • Advice for beginners: do not install from source!

Bonus Exercise 3.7: Installing packages from source

Important

This exercise should only be completed if you feel up for a challenge and want to get your system ready for installing from source. Please SKIP unless you know what you are doing!

  1. Go to https://github.com/Marga8/HDGCvar and follow the installation instructions to install the package HDGCvar from source. This should work on all systems.

  2. Go to https://github.com/RobertAdamek/desla and follow the installation instructions to install the package desla from source.

Caution

This will most likely not work and result in errors.

  1. Go to https://github.com/smeekes/bootUR and read the extended installation instructions for installing from source. Install the missing software and then try to install the package desla again.

Warning

On Windows, this should be relatively safe, but on Mac things still can go wrong. Be warned.

Bonus: Flow Control

Flow Control

  • Often, expressions (calculations, estimations, simulations, plots…) should only be executed under certain conditions and/or repeated multiple times.

  • In such cases, we need:

    • Conditional execution (if, else)

    • Loops (for, while)

Conditional Execution

  • The general syntax is as follows:
if ( expr ) {
  ## some code evaluated if expr == TRUE
} else {
  ## some other code
}
  • A simple example:
x <- 5
if (x > 0) {
  y <- 1       ## if x is positive, y should get value 1
} else {
  y <- -1
}
print(y)
[1] 1

Conditional Execution

  • Be careful with vector-valued expressions:
y <- c(5, 3, 2)

if (y > 4) {
    print("All elements of y a larger than 4")
}
Error in if (y > 4) {: the condition has length > 1

If you want to check whether all elements of a vector satisfy a certain property, the quantifiers any() and all() are useful:

if (all(y > 4) ) {        ## all() is "TRUE" if the argument is a vector of all TRUE values
    print("All y values are greater than 4")
} else {
    print("At least one element of y is less than or equal to 4!")
}
[1] "At least one element of y is less than or equal to 4!"

Conditional Execution with Multipe Options

  • There may be more than two cases to consider, for example: \[y = \cases{0 \text{ for } x \leq 0\\ 4 \text{ for } 0< x \leq 5\\ 6 \text{ otherwise.}}\]
if ( x <= 0 ) {
  y <- 0
} else if (x <= 5 ) {
  y <- 4
} else {
  y <- 6
}

Conditional Execution per Element

  • Sometimes we want to check vectors and return a separate value for each entry. For this, we use the ifelse() function:
x <- c(3, NA, 2)
## is.na() checks if the i put is a missing value
ifelse( is.na(x), "Missing", "Not Missing")
[1] "Not Missing" "Missing"     "Not Missing"
## an elaborate way to calculate absolute values!
x <- c(-3, 5, -8, 2)
ifelse( x < 0, -x, x)
[1] 3 5 8 2

Exercise 3.8: Conditional Execution

  • Write a function if_test that takes two objects, x and y, and checks whether x is numeric and y is of type character.

  • If both conditions are met, the function should print “Super”; otherwise, it should indicate which of the two objects x or y does not meet the required condition.

  • Test your function with:

if_test(5, "char")

if_test("abc", 2)

if_test(list(1,2), matrix("a", 3, 3))

Loops

Frequently, small code blocks need to be executed multiple times. Loops help with this. In R, there are multiple types of loops:

  • while( Condition) { Code }: The code Code executes as long as Condition is TRUE.

  • for (counter in values) { Code }: The code Code runs once for each element in the vector values. The variable counter takes on the value of each element during each iteration.

Loops: Examples

  • Some examples
vec <- c("One", "Two", "Three")
for (v in vec) {
    print(v)
}
[1] "One"
[1] "Two"
[1] "Three"
for (i in 1:5) {
  print(i + 2)
}
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7

Combining for and if

  • The combination of loops and conditional execution makes for a simple yet very powerful programming interface.

  • The major advantage is that it allows to automate certain repetitive operations.

  • This code checks for every column of a matrix (or data frame) dataset whether it is skewed to the right, and if so, applies a log transformation. (Which statistically is not the brightest idea…)

  • We use the moments package for calculating the skewness.

library(moments)
n_col <- ncol(dataset)      # check how many columns there are
transformed_data <- dataset # copy the data into a new matrix holding the transformed data
for (i in 1:n_col) {        # loop over all columns
    skew <- skewness(dataset[, i])  # compute the skewness
    if (skew > 1) {
        transformed_data[, i] <- log(dataset[, i]) # log-transform the data
    }
}

Exercise 3.9: Impute Missing Values

  • Write a function that checks for every element of a vector whether it is a missing value. If a missing value is found, it is to be imputed as the average of the value before it and the value after it.

  • If the first value in the vector is missing, it should get the value of the second element. If the last value is missing, it should get the value of the preceding element.

  • To keep things simple, we assume that there are never two consecutive missing values, such that only the rules above have to be programmed.

  • Although there are many ways to create such a function, we focus on using a for loop and if statements.

  • Test your function on the following vector:

c(NA, 2, 3, NA, 5, NA)