Introduction to R

Session 4

Session Overview

  1. Inputs and Outputs
  2. Graphics
  3. Bonus: Advanced Graphics with ggplot

Today

Nalan Bastürk

  • Associate Professor at QE
  • Research interests: econometrics, Bayesian statistics, financial econometrics
  • Website
  • Sessions 4 and 5

Stephan Smeekes

  • Professor of Econometrics at QE
  • Research interests: econometrics, time series, high-dimensional statistics, bootstrap, macro- and climate econometrics
  • Website
  • Sessions 2, 3 and 4

Inputs and outputs of your R session

  • Common inputs / outputs of an R session are datasets or R scripts.
  • For this meeting we focus on datasets as inputs to the R session, loading data and saving data.
  • The file format (csv, RData, xls, stata …) and directory of the files are important keep in mind.
  • We will go over a few options, potential issues, and how to avoid the need to type in a long directory name when managing the input and output of the R session.

Working directories and R projects

  • Whenever we provide R with a file name, it can include the full path on the computer.
  • An alternative is to work on a specified directory.
  • Another alternative is to work within a ‘project’ that all paths are visible to the project scripts.
  • If we do not provide any path, R will use the current “working directory” for reading or writing files. It can be obtained by the command

Using the correct directory to get input / output of the R session

  • Navigating through the menus in RStudio is easy, (click and go) but requires using the menu every time the user runs the code.

  • Go to Session -> Set Working Directory. Two convenient options are:

    • Choose Directory…: Choose the directory yourself

    • To Source File Location: Set the working directory to the directory where your R Script (the source file) is saved

Using the correct directory to get input / output of the R session

  • An alternative is to use function setwd() at the beginning of your script. This line then has to be changed when the code runs in another machine.
 setwd("~/R_training")

Using the correct directory to get input / output of the R session

  • Recall: To set the working directory to the folder where your current R script is located, you can simply use:
  • Recall: We could also explicitly make the function call from the library:
setwd(this.path::here())

Types of input or data that can be loaded in R

R interacts with files in several ways.

  • You can load, save, import, or export a data file.
  • You can save a generated figure as a graphics file or store regression tables as text, spreadsheet, or LATEX tables.
  • You can load, save the full workspace (environment) you are working with to follow up another time.

Datasets can come in different formats.

  • RData files: Files that can directly
  • Other file formats (SPSS csv, xls, …) are also possible to load in R. This often requires the use of packages

Loading RData files

  • RData files are specific to R file formats.
  • They can store a single object or several objects.
  • These files are the easiest to manage as input or output in R, since they don’t require library calls.

Load climate data from RData format:

  • load function is used to load data in RData format.
  • load function loads all objects in the input RData file.

Example data: average temperatures for Maastricht and Eindhoven during 2024, every 2nd day of the month

load("data/climate_long.Rdata")
print(long_data)
             NAME MONTH TEMP
99091   EINDHOVEN     1 10.6
99122   EINDHOVEN     2  7.1
99151   EINDHOVEN     3 10.2
99178   EINDHOVEN     4  8.9
99207   EINDHOVEN     5 18.5
99238   EINDHOVEN     6 15.0
99268   EINDHOVEN     7 15.0
99299   EINDHOVEN     8 20.7
99329   EINDHOVEN     9 22.6
99359   EINDHOVEN    10 10.0
99390   EINDHOVEN    11 10.5
99415   EINDHOVEN    12  9.7
99801  MAASTRICHT     1  9.7
99832  MAASTRICHT     2  5.9
99861  MAASTRICHT     3  9.9
99888  MAASTRICHT     4  9.0
99917  MAASTRICHT     5 15.7
99948  MAASTRICHT     6 14.1
99978  MAASTRICHT     7 14.9
100009 MAASTRICHT     8 20.5
100039 MAASTRICHT     9 21.6
100069 MAASTRICHT    10 10.3
100100 MAASTRICHT    11 10.2
100125 MAASTRICHT    12  9.2

Notice the data are loaded as a dataframe:

is.data.frame(long_data)
[1] TRUE
typeof(long_data)
[1] "list"

There is a distinction between a long and wide dataframe:

load("data/climate_wide.Rdata")
print(wide_data)
   MONTH EINDHOVEN MAASTRICHT
1      1      10.6        9.7
2      2       7.1        5.9
3      3      10.2        9.9
4      4       8.9        9.0
5      5      18.5       15.7
6      6      15.0       14.1
7      7      15.0       14.9
8      8      20.7       20.5
9      9      22.6       21.6
10    10      10.0       10.3
11    11      10.5       10.2
12    12       9.7        9.2

Loading other formats of data in R

Option 1: Using menus within RStudio is the easiest (click and go) but requires using the menu every time the user runs the code.

Loading other formats of data in R

Option 1: Using menus within RStudio (cont’d)

Loading other formats of data in R

Option 1: Using menus within RStudio (cont’d)

Loading other formats of data in R

Option 1: Using menus within RStudio (cont’d)

Loading other formats of data in R

Advice for option 1:

  • Copy the command that appears after loading the data from the menus.

Loading other formats of data in R

Advice for option 1 (cont’d):

  • Paste the command on top of your script.
  • This way, next time you do not need the menu navigation.
library(readxl)
climate <- read_excel("data/climate_wide.xlsx")
  • You can view the data by clicking on it in the `Environment’ at the top-right of the workspace.

General way of importing and exporting of other data formats

  • Using the correct libraries for different data formats can be tedious.
  • R package rio is very convenient for data import and export. It figures out the type of data format from the file name extension, e.g. .csv for CSV, .dta for Stata, or *.sav for SPSS data sets
  • For a complete list of supported formats, see help(rio).
  • It calls an appropriate package to do the actual importing or exporting.

Loading SPSS and other file types

library('rio')
import("data/climate_wide.dta")

Loading csv files

import("data/climate.csv")

Loading data from APIs

  • It is possible to automatically load data from a web source using APIs.
  • An API (Application Programming Interface) acts as a bridge between your R code and an external data source: a website, database, or an online platform with permissions.
  • Advantages: Automation, efficiency and real-time access.
  • Disadvantages: No offline access to data (can be important for replication), dependency on external services (API can go down).
  • Suggestion: Save the downloaded data in RData format to mitigate disadvantages.

Example: AAPL daily prices from Yahoo Finance

  • As an example we will download AAPL daily prices from Yahoo Finance.
  • Convert loaded data to a vector.
  • Save the data as an RData file for future offline access.
library(quantmod)
# Get Apple Inc. (AAPL) stock data from Yahoo Finance
getSymbols("AAPL", src = "yahoo", from = "2024-01-01", to = "2025-06-13")
# Save data as RData file
save.image("data/AAPL.RData")

Example: AAPL daily prices from Yahoo Finance (cont’d)

  • View and plot data.
Loading required package: xts
Loading required package: zoo

Attaching package: 'zoo'
The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
Loading required package: TTR
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
# View the first few rows
head(AAPL)
           AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted
2024-01-02    187.15    188.44   183.89     185.64    82488700      184.2904
2024-01-03    184.22    185.88   183.43     184.25    58414500      182.9105
2024-01-04    182.15    183.09   180.88     181.91    71983600      180.5876
2024-01-05    181.99    182.76   180.17     181.18    62303300      179.8628
2024-01-08    182.09    185.60   181.50     185.56    59144500      184.2110
2024-01-09    183.92    185.15   182.73     185.14    42841800      183.7941

Outputs

  • Outputs work very similarly to the inputs above.
  • The most relevant outputs formats are the R output formats.
  • save() saves objects as an .RData file.
  • save.image() saves a selection of objects as an .RData file.

Exercise 4.1: Saving data

  • Save your current workspace using function save.image().
  • Save only one variable in the workspace using function save().
  • Make a list of two variables from long_data, and save this list using function save.

R Base Graphics

  • We will cover R base graphics.
  • Other alternatives include `ggplot2’…

To create plots with R’s standard graphics package, there are high-level and low-level plotting functions.

  • High-level functions generate a new graphic (and open a device).
  • Low-level functions add elements to an existing graphic.

Simple plots

plot(long_data$TEMP)     ## Plotting a single variable

Simple plots

plot(x = long_data$MONTH, y = long_data$TEMP)     ## Scatter plot wrt month

Multiple plots

par(mfrow = c(1,2)) # multiple plots in a row
plot(long_data$TEMP)     ## Plotting a single variable
plot(x = long_data$MONTH, y = long_data$TEMP)     ## Scatter plot wrt month

Functions calling methods

Notice that function plot() calls methods.

It will perform different operations depending on the class of the passed object. (We study the lm() function in detail in the next session!)

ols_result <- lm(TEMP~MONTH, data = long_data)
plot(ols_result)

Exercise 4.2: Load data from Yahoo Finance, and see how plot function behaves.

  • Plot the whole dataset
  • Plot a selected column, for example closing prices AAPL$AAPL.Close
  • Comment on the x axis of the data and how it is different from the plots we considered earlier.
  • The difference is due to the quantmod package.

Creating and saving a graph

load("data/climate_wide.Rdata")
pdf("figures/plot_data_short.pdf")
hist(wide_data$MAASTRICHT, breaks = 20)
dev.off()

Customizing Graphics

  • Adding points to an existing plot
  • Function `dev.off()’is called after all the plotting, to save the file and return control to the screen.
load("data/climate_wide.Rdata")
plot(wide_data$MAASTRICHT) # temperatures in Maastricht
lines(wide_data$EINDHOVEN) # temperatures in Eindhoven in lines

Customizing Graphics

  • The plot() function takes several many arguments that can change the layout of the plots. See ?par for all graphical options; there are many!

  • Some examples:

    • col: color of lines / points
    • lty, lwd: Line type and thickness
    • pch: Point type (1-16)
    • main, sub: Title, subtitle
    • xlab, ylab: x and y axis labels
    • log, xlog and ylog for logarithmic scales
    • xlim, ylim: x and y axis limits (for overriding R’s default choices)
    • mfcol, mfrow: Multiple plots in one graphics window (column-wise/row-wise)

Low-Level Graphic Functions

  • lines: Draw lines
  • abline: Quickly add horizontal, vertical lines, and lines using equation \(y = bx + a\)
  • points: Add points
  • arrows: Add arrows
  • title: Add a title
  • legend: Add a legend
  • text: Add text at \((x,y)\) coordinates
  • mtext: Add text with positional specification like side=1,...,4

Exercise 4.3: Plot temperatures for Maastricht

We want to visualize the daily temperatures in the climate data specifically for Maastricht. First, make a basic plot of temperatures in Maastricht then customise the plot in the following ways:

  1. The title of the X-axis should say ‘Month’, the title of the Y-axis ‘Average Temperature’.

  2. Make the plot a line plot with a blue line. (Hint: specifying the colour literally as "blue" works)

  3. Make the tick marks appear on the inside of the figure rather than the outside.

  4. Calculate the average temperature.

  5. Add a horizontal line with the average maximum temperature

You will need to consult the help file for this exercise; see this therefore more as an exercise in how to navigate R’s help system, than an exercise in plotting (which we will cover in more detail later).

You may want to ask ChatGPT for help.

Manually saving R plots

  • Use the plot functions without creating a graph.
  • Use the `plots’ area to save image manually.

Different plot types

You can manually save graphs of several formats.

Best practice is to save a graph through a device such as pdf or similar:

  • pdf(): Adobe PDF (easily integrated into LaTeX).
  • svg(): Scalable Vector Graphics (commonly used on websites).
  • png(), jpeg(), tiff(), bmp(): Various bitmap formats.
jpeg("figures/MaasTemperature.jpeg")
plot(x = wide_data$MAASTICHT, y = wide_data$MONTH)
dev.off()

A more complex example for plotting data over time gradually

  • Especially when data is large, a gradual illustration of data over time can be handy.
  • In this exercise, we plot the temperatures in Maastricht as if the data are becoming available gradually.
  • For this, we will use a loop that iterates over time points (months).
  • The concept of a loop was only mentioned in Session 2.
wide_data$MAASTRICHT
     xlab = "Time", ylab = "Value", main = "Adding Data Over Time", type = 'l')

# Gradually plot more and more of the data using a `for loop`
for (i in 4:nrow(wide_data)) {
  plot(1:i, wide_data$MAASTRICHT[1:i], # notice index i is increasing the number of plotted points
     xlab = "Time", ylab = "Value", main = "Adding Data Over Time", type = 'l')
}

Exercise 4.4 Make a continuous plot of temperatures - Use the last loop example to plot temperatures gradually. - Start with an initial number of 3 observations, as in the example. - Make sure that the range of the x and y axes match with the whole dataset in the first plot. - Within the for loop, add lines to the first plot, instead of plotting the data again. - Pause the program within the for loop to simulate “gradual” effect. - You can use ChatGBT or help functions for ?ylim, ?Sys.sleep

Bonus: Advanced Graphics using ggplot

  • R has several advanced graphics packages such as ggplot2 (see book by Hadley Wickham), plotly, Rgnuplot,…
  • We will focus on ggplot2 as this is widely used.
  • The ggplot2 package in R enables to build complex plots from data in a structured and layered manner.
  • The plot is based on a specified data frame, aesthetic mappings (like x and y axes), and layers such as points, lines, or bars (geom_* functions).
  • Advantages: flexible, consistent syntax, and fancy graphics.
  • Disadvantages: Very different syntax e.g. compared to functions.

Typical ggplot help function sections

  • Section What it Gives You
  • Title Short description
  • Description What the function does
  • Usage The function’s syntax
  • Arguments What each argument means
  • Details Extra explanation and special behavior
  • Aesthetics Visual mappings (like x, y, color, etc.)
  • Examples Example code to learn from
?ggplot2::geom_line

Using ggplot: Grammar of graphics

  • Graphics with ggplot2 are built step-by-step, adding new elements as layers
  • A plot starts with the function ggplot(). This is the main object we will add layers to.
  • Each layer is added with a plus sign (+) between layers. This allows for extensive flexibility and customization of plots.
  • Three components need to be specified for the plot:
    • data: data to feed in
    • aesthetics: how you will connect variables (columns) from your data to a visual dimension. Horizontal positioning, size, color etc.
    • geometries: This is a specification of what object will actually be drawn on the plot. This could be a point, a line, a bar, etc.

Using ggplot: Grammar of graphics (cont’d)

  • Several optional additional layers customize the graphics and help make flexible graphs.
    • Scales: How a variable is mapped to its aesthetic. Can be linear, in log scale etc.
    • Statistical transformations: Specification of whether and how the data are combined/transformed before being plotted.
    • Coordinate system: Specification of how the position aesthetics (x and y) are depicted, for example cartesian or polar coordinates.
    • Facet: This is a specification of data variables that partition the data into smaller “sub plots”, or panels.

Example ggplot (cont’d)

  • Help files of ggplot2 are also slightly different from the standard R help files
?ggplot # a bit complicated help file
help(package = "ggplot2") # a nicer list of all layer functions, see 'geom_line'

Example: Histogram of temperatures in Maastricht

  • Notice the syntax difference in parentheses and use of + for layers
  • Notice the data wide_data is a data frame with an index
library('ggplot2')
load("data/climate_wide.Rdata") # load data
wide_data$index <- 1:nrow(wide_data) # create data frame
ggplot(wide_data, aes(x = MAASTRICHT)) +
  geom_histogram(bandwidth = 200)
Warning in geom_histogram(bandwidth = 200): Ignoring unknown parameters:
`bandwidth`
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Example: Plot of temperatures in Maastricht

  • We will make a similar plot to exercise 4.3: a plot of temperatures in Maastricht
  • Aesthetics are defined with line width, color.
  • Geometrics is defined by the function geom_line (line plot)
load("data/climate_wide.Rdata") # load data
wide_data$index <- 1:nrow(wide_data) # create data frame
ggplot(wide_data, aes(x = index, y = MAASTRICHT)) +
  geom_line(color = "blue", linewidth = 1) # adds lines #

Example: Possible confusion with aesthetics

  • A point that is easy to make a mistake: mapping variables in aesthetics.
  • See color is defined in aes below compared to the earlier slide.
  • Check the weird legend that appears in the plot, and the line is still red.
  • See the help file for ggplot2::geom_line.
  • There is a lot of information in the help file as a result of flexibility, but the proper use is explained.
load("data/climate_wide.Rdata") # load data
wide_data$index <- 1:nrow(wide_data) # create data frame
ggplot(wide_data, aes(x = index, y = MAASTRICHT, color = "blue")) +
  geom_line(linewidth = 1) # adds lines #

Example: Adding additional geometrics and aesthetics

  • Adding layers, such as a new set of points is simple:
load("data/climate_wide.Rdata") # load data
wide_data$index <- 1:nrow(wide_data) # create data frame
ggplot(wide_data, aes(x = index, y = MAASTRICHT)) +
  geom_line(color = "blue", size = 1, linewidth = 2) +
  geom_line(aes(x = index, y = EINDHOVEN), linewidth = 0.3) # Adds Eindhoven data to the last plot
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Example: Adding more layers

  • Adding more layers for labels and legends
load("data/climate_wide.Rdata") # load data
wide_data$index <- 1:nrow(wide_data) # create data frame
ggplot(wide_data, aes(x = index, y = MAASTRICHT)) +
  geom_line(color = "blue", size = 1, linewidth = 2) +
  geom_line(aes(x = index, y = EINDHOVEN), linewidth = 0.3) + # Adds Eindhoven data to the last plot
  labs(
    x = "X Axis", y = "Y Axis", color = "Legend Title",   # Axis labels and legend title
    title = "Line Plot with Two Variables"
  ) +
  scale_color_manual(values = c("blue", "red")) +  # Custom colors
  theme_minimal()  # Minimal theme for a clean look

Example ggplot: Color points for Maastricht and Eindhoven temperatures

  • For this example, we will use the long dataframe since we will plot values for both cities.
  • This is a more complex graph where aesthetics defined in geometrics create point colors according to a variable in the data frame.
library('ggplot2')
load("data/climate_long.Rdata") # load data
long_data$index <- 1:nrow(long_data) # create data frame
ggplot(data = long_data) +
  geom_point(aes(x = index, y = TEMP, color = NAME))

Further references for ggplot2

R for data science

Collaborative and Reproducible Data Science in R

Exercise 4.5: Use ggplot2 for more flexible and advanced plots

  • Make the same plot as in Exercise 4.3 using package `ggplot2’
  • We suggest to use the wide data
  • Use help functions for geom_point, geom_hline, labs.
  • You can give it a try to use ChatGBT, but the outcome is difficult to understand if you don’t have familiarity with ggplot to begin with.