data_grades <- read.table("data/grades.csv",
header = TRUE, sep = ",", stringsAsFactors = FALSE)
Session 5
In the last session, we learned how to load data of various sources into R.
Today’s first part will be about how to manipulate data in R. Importantly, we will learn how to
We will work with fictive data set of student grades. Let us start by loading the data:
data_grades <- read.table("data/grades.csv",
header = TRUE, sep = ",", stringsAsFactors = FALSE)
We begin with selecting interesting variables from a data set. For our grades data set, we want to preserve information about ID
, Name
, and Exam_Score
, and drop all other information.
Variables can be selected by name, after which we inspect the first and last three rows in the data set:
Or by variable indexes
data1 <- data_grades[, c(1, 2, 7)]
though this is not that convenient unless you know the column numbers of the variables you want to select.
A more convenient alternative is to use the following function:
The objects data, data1 and data2 are all identical so you can use your preferred way of working!
Next, we want to subset the data set, i.e. preserve interesting rows while removing the others.
For the grades data set, we might be interested in information about students in tutorial group 1:
# select tutorial 1 students only
data_tutorial1 <- data_grades[data_grades$Tutorial == 1, ]
or alternatively:
data_tutorial1 <- subset(data_grades, Tutorial == 1)
Subsetting also works using characters. For instance, to retrieve only information for females:
# select female students only
data_females <- data_grades[data_grades$Gender == 'Female', ]
Inspect your new data sets!
Use the grades data set.
Generate a data set that contains information about the student ID, student name, their tutorial group, participation grade and their exam score.
Further reduce the data set obtained under 1 to only display information of students in tutorial group 4.
Further reduce the data set obtained under 2 to only display information of students with an exam score of more than 80. How many such students are there?
Let us continue with further data manipulations. The variable Tutorial
is currently an integer:
class(data_grades$Tutorial)
[1] "integer"
but it should be a factor (a categorical variable). This can be easily changed in R:
data_grades$Tutorial <- as.factor(data_grades$Tutorial)
after which you can inspect its new class:
class(data_grades$Tutorial)
[1] "factor"
When inspecting the variable itself, R now mentions the different levels of the factors:
data_grades$Tutorial
[1] 2 3 4 2 1 4 3 1 1 3 4 2 4 4 3 1 3 4 1 2 1 1 3 1 3 3 2 3 2 1 4 4 4 2 3 2 2 4
[39] 4 2 1 3
Levels: 1 2 3 4
which you can also directly retreive via:
levels(data_grades$Tutorial)
[1] "1" "2" "3" "4"
Sometimes we want to add a variable to an existing data set.
For instance, we want to add the exam score on 10 instead of 100. To add a new variable, use the $
operator and specify a new variable name:
What is the class of the variable Tutor? Transform it into a factor. How many tutors are there?
Add a new variable to compute the final score of each student, which is the weighted average of their participation grade (20%) and their exam score (80%)
Retrieve the final score of the students in tutorial group 2. Obtain summary statistics of their final scores.
What is the lowest and the highest final score in tutorial group 2? Retrieve this information from the summary statistics as well as by using a dedicated function.
In what follows, we consider the linear regression model
\[y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + \dots + \beta_p x_{p,i} + u_i,\ i=1,\dots,n.\]
We will first estimate the following simple regression for the grades data set: \[Final\_Score = \beta_0 + \beta_1 Participation\_Grade + u,\]
assuming the Final_Score
variable was added to the grades data set in Exercise 5.2:
data_grades$Final_Score <- 0.2*data_grades$Participation_Grade + 0.08*data_grades$Exam_Score
R is designed to easily estimate various statistical models. It provides a specific object class to symbolically describe statistical models, called formula
objects. See ?formula
for more details.
Our regression model
\[Final\_Score = \beta_0 + \beta_1 Participation\_Grade + u\]
can be specified in R as a formula
like this:
Final_Score ~ Participation_Grade
where ~
is the basis for all models: y ~ model
specifies that the dependent variable y
is modeled using the linear predictors described in model
.
The standard function for estimating linear models is lm()
. Estimating a regression model in R then only requires one line of code!
# example simple regression:
reg_1 <- lm(Final_Score ~ Participation_Grade, data = data_grades)
Let us now inspect the summary output of our estimated regression model:
# regression summary output:
summary(reg_1)
Call:
lm(formula = Final_Score ~ Participation_Grade, data = data_grades)
Residuals:
Min 1Q Median 3Q Max
-1.5354 -0.3763 -0.0813 0.6147 1.4246
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.1396 0.5925 0.236 0.815
Participation_Grade 0.9920 0.0898 11.047 1.01e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7683 on 40 degrees of freedom
Multiple R-squared: 0.7531, Adjusted R-squared: 0.747
F-statistic: 122 on 1 and 40 DF, p-value: 1.013e-13
Going from a simple to a multiple regression model is easy. For example, let us add the variable GPA
as a second predictor this can easily be done by adjusting the formula
to:
Final_Score ~ Participation_Grade + GPA
where +
now separates the different predictors included in the model.
Estimating the multiple regression model in R can be done as follows:
# example multiple regression:
reg_2 <- lm(Final_Score ~ Participation_Grade + GPA, data = data_grades)
The summary output of our newly estimated regression model:
# regression summary output:
summary(reg_2)
Call:
lm(formula = Final_Score ~ Participation_Grade + GPA, data = data_grades)
Residuals:
Min 1Q Median 3Q Max
-1.41979 -0.32087 -0.04297 0.46119 1.29582
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.7780 0.5663 -1.374 0.177347
Participation_Grade 0.7196 0.1056 6.813 3.87e-08 ***
GPA 0.4016 0.1056 3.804 0.000489 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6645 on 39 degrees of freedom
Multiple R-squared: 0.82, Adjusted R-squared: 0.8107
F-statistic: 88.8 on 2 and 39 DF, p-value: 3.021e-15
Imagine you want to estimate a regression model on log-transformed variables, for example: \[log(Final\_Score) = \beta_0 + \beta_1 log(Participation\_Grade) + u\] This can be done by directly using the log
function in the lm
function:
What if you want to include the square of a predictor? For example: \[Final\_Score = \beta_0 + \beta_1 Participation\_Grade + \beta_2 Participation\_Grade^2 + u\] You CANNOT use Final_Score ~ Participation_Grade + Participation_Grade^2
since the ^2
has a special (different) meaning in a formula object.
Instead, you should use the function I()
. More specifically:
does the job! Inspect its summary output.
Let us now add a dummy variable to the estimated regression model: \[Final\_Score = \beta_0 + \beta_1 Participation\_Grade + \beta_2 Male + u\] where Male
is a dummy variable that takes on the value 1 for male students, and 0 otherwise.
Let us start by transforming the variable Gender
to a factor:
data_grades$Gender <- as.factor(data_grades$Gender)
Factor variables in formulas are then automatically dummy coded.
# example dummy variables
reg_5 <- lm(Final_Score ~ Participation_Grade + Gender, data = data_grades)
Inspect the summary output of the regression model with a continuous predictor and a dummy variable; you will notice that R has estimated the regression model with female students as the baseline:
# example dummy variables
summary(reg_5)
Call:
lm(formula = Final_Score ~ Participation_Grade + Gender, data = data_grades)
Residuals:
Min 1Q Median 3Q Max
-1.52400 -0.38274 -0.09068 0.62019 1.41370
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.12697 0.61516 0.206 0.838
Participation_Grade 0.99219 0.09096 10.907 2.07e-13 ***
GenderMale 0.02230 0.24017 0.093 0.927
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.778 on 39 degrees of freedom
Multiple R-squared: 0.7532, Adjusted R-squared: 0.7405
F-statistic: 59.51 on 2 and 39 DF, p-value: 1.417e-12
Finally, let us investigate how interaction terms can be included in a regression model. We consider the model: \[Final\_Score = \beta_0 + \beta_1 Participation\_Grade + \beta_2 Male + \beta_3 Male\cdot Participation\_Grade + u\] The regression can be estimated in R via the command:
# example dummy variables
reg_6 <- lm(Final_Score ~ Participation_Grade + Gender + Participation_Grade:Gender , data = data_grades)
where :
creates interaction terms between variables.
Or in short:
# example dummy variables
reg_7 <- lm(Final_Score ~ Participation_Grade*Gender, data = data_grades)
does the same since a*b
in the Formula
object is equivalent to a + b + a:b
Inspect the summary output of the estimated regression model with interaction terms:
# example dummy variables
summary(reg_6)
Call:
lm(formula = Final_Score ~ Participation_Grade + Gender + Participation_Grade:Gender,
data = data_grades)
Residuals:
Min 1Q Median 3Q Max
-1.45415 -0.39360 -0.09823 0.64314 1.38103
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.3271 1.0452 -0.313 0.756
Participation_Grade 1.0620 0.1586 6.696 6.37e-08 ***
GenderMale 0.7025 1.2827 0.548 0.587
Participation_Grade:GenderMale -0.1050 0.1945 -0.540 0.592
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7851 on 38 degrees of freedom
Multiple R-squared: 0.7551, Adjusted R-squared: 0.7357
F-statistic: 39.05 on 3 and 38 DF, p-value: 1.083e-11
Estimate the following simple regression model in R: \[Final\_Score = \beta_0 + \beta_1 Participation\_Grade + u\] Save your regression model in the object my_reg1
, and inspect the summary output.
Make a scatterplot of Final_Score
(y-axis) against Participation_Grade
(x-axis). Verify that adding the line of code abline(my_reg1)
after you created your scatter plot, adds the regression line to your scatterplot!
Estimate the multiple regression model: \[Final\_Score = \beta_0 + \beta_1 Participation\_Grade +
\beta_2 Chang + u\] where Chang
is a dummy variable taking the value 1 for students having tutor “Chang, Stevens” and 0 otherwise. Is the dummy variable significant?
Imagine you want to include the tutor “Chang, Stevens” as the baseline level. Re-estimate the regression model after you have re-specified your factor variable Tutor
thereby explicitly defining “Chang, Stevens” as the baseline level. Hint: use the function relevel
to this end.
Finally, estimate the following regression model and inspect the summary output: \[Final\_Score = \beta_0 + \beta_1 Participation\_Grade + \beta_2 Chang + \beta_3 Participation\_Grade \cdot Chang + u.\]
Finally, we discuss how important output of a regression analysis can be directly accessed in R.
First, assume you want to access the estimated coefficients of an estimated regression model. This can done using the function coefficients()
:
# accessing coefficients:
reg_2 <- lm(Final_Score ~ Participation_Grade + GPA, data = data_grades)
coefficients(reg_2)
(Intercept) Participation_Grade GPA
-0.7780420 0.7196242 0.4015845
Alternatively, you can directly access the coefficients in the list of the lm
object reg_2
:
# accessing coefficients:
reg_2$coefficients
(Intercept) Participation_Grade GPA
-0.7780420 0.7196242 0.4015845
Note that you can do the same for accessing the fitted values (function fitted()
or slot $fitted.values
) or the residuals (function residuals()
or slot $residuals
) of your estimated regression model.
What if you want to access the \(R^2\), or the \(t\)-stats and \(p\)-values? Unfortunately, the list object reg_2
does not seem to contain this information:
# information stored in lm-object
names(reg_2)
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"
This does not mean that you cannot access it. Instead this information can be accessed via the slots in the list object of summary(reg_2)
!
In particular:
# summary object contains additional information:
sum_reg_2 <- summary(reg_2)
sum_reg_2$coefficients ## matrix with estimates, standard errors, t-stat, p-value
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.7780420 0.5663404 -1.373806 1.773473e-01
Participation_Grade 0.7196242 0.1056222 6.813193 3.872012e-08
GPA 0.4015845 0.1055576 3.804411 4.891059e-04
sum_reg_2$sigma ## residual standard error estimate
[1] 0.6644583
sum_reg_2$r.squared ## R^2 of regression
[1] 0.8199501
sum_reg_2$adj.r.squared ## adjusted R^2 of regression
[1] 0.8107167
Estimate the regression model \[Exam\_Score = \beta_0 + \beta_1 Participation\_Grade + \beta_2 Male + \beta_3 Male \cdot Participation\_Grade + u\]
Is the dummy Male
significant?
What are the values of the \(t\)-stats for the 3 predictors? Retrieve this information from the summary output but also saved their values in a new variable called my_tstats
. Note: save ONLY the values of the t-stats!
What is the estimated coefficient, standard error, \(t-\)value and \(p-\)value of the predictor \(Participation\_Grade\)? Save this (and ONLY this) information in new variable called my_grade_info
and display the information thereby rounding to two digits.
What is the value of the adjusted \(R^2\) ? Retrieve this information from the summary output but also saved its value in a new variable called my_adjR2
.
Store the residuals in a new variable called my_resid
. Make a scatter plot of the residuals, thereby labeling the x-axis as ‘Student index’, the y-axis as ‘Residuals’ and displaying the dots in red.
Store the fitted values in a new variable called my_fitted
. Make a scatter plot of the actual exam scores on the x-axis and the fitted values on the y-axis. Label the x-axis as ‘Exam scores’, the y-axis as ‘Fitted values’, and give the plot the title ‘Fitted versus Actual’.
Next, we will (superficially) cover the package dplyr. This package is part of the tidyverse, a collection of R packages designed to provide a consistent approach to working with data. The following packages belong to the tidyverse:
dplyr: “Grammar of Data Manipulation”
ggplot2: “Grammar of Graphics”
readr: “Fast and friendly way to read rectangular data”
tibble: “A tibble, or tbl_df, is a modern reimagining of the data.frame”
tidyr: “Create tidy data. Tidy data is data where:
purrr: “Enhance R’s functional programming toolkit”
Note: The philosophy (and syntax) of tidyverse differs completely from base-R and is somewhat similar to Python’s pandas
. Some argue tidyverse code is more readable and intuitive, others find it rather unhandy. R code written by AI models typically utilizes packages from the tidyverse.
An important component of working with data and dplyr
is the pipe operator `%>%. The goal of this operator (also found in many other languages) is to make function composition more readable in code.
Example:
You can use dplyr with %>%
to select variables, or subset rows:
# Select Student ID, Name and Exam_Score
data_grades %>% dplyr::select(ID, Name, Exam_Score)
# Select Exam_Score from data set and display its summary
data_grades %>% dplyr::select(Exam_Score) %>% summary()
# Subset on students belonging to tutorial group 1
data_grades %>% dplyr::filter(Tutorial==1)
# Subset on female students
data_grades %>% dplyr::filter(Gender=='Female')
# Adding variables
data_grades <- data_grades %>% mutate(Exam_Score_10 = Exam_Score/10)
dplyr
Some key functions of dplyr
are:
mutate()
: Add new variables to a dataset
select()
: Select variables (columns)
filter()
: Select observations (rows)
You can revisit some of the earlier exercises and now try to execute them using dplyr
!
Additional resources:
Wickham, H., Çetinkaya-Rundel, M. and Grolemund, G. (2023), R for Data Science (2e)
Heiss, F. (2020) Using R for Introductory Econometrics (2e)
Hanck, C., Arnold, M., Gerber, A., and Schmelzer, M. (2024) Introduction to Econometrics with R
Many more resources are available online.
Finally, thank you for attending the R training! We are happy to receive your feedback on this training through the following survey.