Throughout this course we will be using the statistical package
R
, and the friendly interface of Rstudio, to
conduct data analysis. The first part of today’s seminar is about
getting familiar with this piece of software. If you have never used
R
before, worry not! No prior knowledge is assumed, and we
will walk you through all the necessary steps to conduct analysis on the
topics discussed in the lectures. Of course, learning a new statistical
software is not simple, and we don’t aim to fully introduce you to the
world of R
– a wide and dynamic world that goes way beyond
what we are covering here. This is just a gentle introduction to some
specific R coding that corresponds to some methods often used to infer
causality from observational data (our main focus here). Our suggestion
is that you build on this and keep practicing. Coding is like learning a
foreign language: if you don’t use you lose it.
The first part of this assignment is a general introduction to R, starting from scratch. If you already have some experience with this software, you are welcome to join the second part of the assignment, where we analyse some longitudinal data.
R
is a statistical software that allows us to manipulate
data and estimate a wide variety of statistical models. It is one of the
fastest growing statistical software, one of the most popular data
science software, and, perhaps most importantly, it is open source
(free!). Crucially, most of the cutting edge causal inference packages
are implemented in R so, should you want to keep up with the latest
developments, the basic knowledge you get here can help you with that.
We will be using the RStudio user-interface, which makes
operating R
somewhat easier.
You should install both R
and Rstudio
onto your personal computers. You can download them from the following
sources:
You can watch this video to see how you can install R and Rstudio.
Let’s see what we have here. After installing R
and
Rstudio, we start Rstudio and see three panels. A screen-long
panel on the left-hand side called the console, a smaller panel
on the top right-hand side called the environment, and a last
one on the bottom right-hand side called Plots & Help. The
console is the simplest way to interact with R: you can type in some
code (after the arrow \(>\)) and
press Enter, and R
will run it and provide an
output. It is easy to visualize this if we simply use R as a calculator:
when we type in mathematical operations in the console, R immediately
returns the outcomes. Let’s see:
7 + 7
12 - 4
3 * 9
610 / 377
5^2
(0.31415 + 1.61803) / 3^3
## [1] 14
## [1] 8
## [1] 27
## [1] 1.618037
## [1] 25
## [1] 0.07156222
Directly typing code into the console is certainly easy, but often not the most efficient strategy. A better approach is to have a document where we can save all our code – that’s what scripts are for. R scripts are just plain text files that contain some R code, which we can edit the same way we would in Word or any other text file. These are similar to what you might have encountered using SPSS (syntax), STATA (do), or other statistical software. We can open R scripts within Rstudio: just go to File –> New File –> R Script (or just press Cmd/Ctrl + shift + N for a shortcut).
We can now see a new panel popping up taking up the space in the top left-hand side of Rstudio, just above the console. You can now type all your code into the script, save it, and open it again whenever you want. If you want to run a piece of code from the script, you can always copy and paste it on the console, though this is of course very inefficient. Instead, you can ask R to run any piece of code from the script directly. There are a few different ways to do it.
You should always work from an R script! Our suggestion is that you create a different script for each seminar, and save them using reasonable names such as “seminar1.R”, “seminar2.R”, and so on.
When working with R, we store information by creating “objects”. We create objects all the time; it is a simple way of labelling some piece of information so that we can use it in subsequent tasks. Say we want to know the outcome of \(3 * 5\), and then we want to divide that outcome by \(2\). We could, of course, simply ask R do the calculations directly:
(3 * 5) / 2
But we could also first, create an object that stores the result of
\(3*5\), and then divide said object by
\(2\). We can give objects any name we
like.1 To
create objects, we need to use the assignment operator
<-
. If we want to name the outcome of \(3*5\) outcome, then we simply need
to use the assignment operator:
outcome <- 3 * 5
Notice that R does not provide any output after we create an object.
That is because we are not asking for any output, we are simply creating
an object. Note, however, that the environment panel lists all the
objects we create in the current session. In case we want to confirm
what piece of information is stored under a given label, apart from
checking the environment panel, we can also just run the name of the
object and see R’s output. For instance, when we run
outcome
in the console, R returns the number 15, which is
the outcome of \(3*5\).
outcome
## [1] 15
This is useful because, once we have created objects, we can use to perform subsequent calculations. For instance:
outcome / 2
## [1] 7.5
And we can even use previously created objects to create a new object:
my_new_object <- outcome ^ 2
my_new_object
## [1] 225
Both objects that we have just created (outcome
and
my_new_object
) contain just a single number. But we can
create objects that contain more information as well. Often times, we
want to create a long list of numbers in one specific order. Think of a
regular spreadsheet, where we can enter numbers in a single column which
are separated (and ordered) according to their rows. In the R language,
those single columns are equivalent to vectors. A
vector is simply a set of information contained together in a specific
order. In order to create a vector in R, we use the c()
function: instead of including new information at every row (as we
would, were we using a regular spreadsheet), we separate new information
by commas inside the parentheses. For instance, we could create a new
vector by concatenating the following numbers:
new_vector <- c(0, 3, 1, 4, 1, 5, 9, 2)
new_vector
## [1] 0 3 1 4 1 5 9 2
Recall that vectors store information (in this case, numbers) in a
specific order. This is important. We can use this order to access
individuals elements of a vector. We do this by subsetting a
vector: we just need to use square brackets [ ] and include the number
corresponding to the position we want to access. For instance, if we
want to access the second element in our vector new_vector
,
we simply do the following:
new_vector[2]
## [1] 3
We can see that by using new_vector[2]
, R returns the
second element of the vector new_vector
, which is the
number 3. If we want to access the seventh element of the vector
new_vector
, we just use new_vector[7]
which
returns the number 9, and so on.
Now that we have our first vector, we can see how
functions work. Functions are a set of instructions:
you provide an input and R generates some output – the backbone of R
programming. A function is a command followed by round brackets
( )
. Inputs are arguments that go inside the brackets; if a
function requires more than one argument, these are separated by commas.
For instance, we can add all elements of the vector
new_vector
together by using the function
sum()
. Here the input is the name of the vector:
sum(new_vector)
## [1] 25
And 25 is the output. As always, we can save the result of the output
as an object using assignment operator <-
.
sum_of_our_vector <- sum(new_vector)
Here, sum_of_our_vector
is also an object! So we have
performed a calculation (sum()
) on some data
(new_vector
), and stored the result
(sum_of_our_vector
). Let’s try some new functions such as
mean()
, median()
, and summary()
.
What are they calculating?
mean(new_vector)
## [1] 3.125
median(new_vector)
## [1] 2.5
summary(new_vector)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.500 3.125 4.250 9.000
mean()
returns the average value of a vector,
median()
returns the median, and summary()
returns a set of useful statistics, such as the minimum and the maximum
values of the vector, the interquartile range, the median, and the mean.
You could create your own functions using the function()
function – e.g. you could come up with a new function to calculate the
mean of a vector. Say you create a set of new functions. They must be
useful if you spent the time and effort to develop them, so you would
like to let anyone in the world use them as well by making them publicly
available. This is the basic idea of R packages. People create
new functions and make them publicly available. We will be using various
packages throughout the course. These will help us conduct data
analysis. An example of a useful package is pysch
, which
contains the describe()
function: like the
summary()
function, it describes the content of a variable
or data set, but it provides more details. Let’s see how it works
describe(new_vector)
## Error in describe(new_vector): could not find function "describe"
We got an Error message! Why? Well, that is because the
describe()
function does not exist in base R. We
first need to install and then load the psych
package, only
then this command with its new functions will be available to us. This
is a two-step process. First step involves installing the relevant
package in your computer using the install.packages()
function; you only need to this once. The second step involves loading
the relevant package using the library()
function; you need
to do this every time you start a new session (i.e. open Rstudio). If
you are not sure which packages are installed in your computer, you can
simply run installed.packages()
; if you are not sure which
packages are loaded in the current R session, you can simply run
(.packages())
.
install.packages("psych") # you only need to use this once in your computer
library(psych) # making the package available in the current session
describe(new_vector) # look, now we can use the describe() function
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 8 3.12 2.9 2.5 3.12 2.22 0 9 9 0.81 -0.63 1.03
The describe()
function is useful because it provides us
with a wider set of useful statistics than the summary()
function, such as the number of observations, standard deviation, range,
skew and kurtosis. Note that R ignores everything that comes after the
#
. This is extremely useful to make comments throughout our
script. We would like to encourage you to take notes throughout the
seminar as you might forget what each of the commands stand for.
Data frames are the workhorse when conducting data analysis with R. A
data.frame
object is the R equivalent to a spreadsheet:
each row represents a unit, each column represents a variable. Nearly
every time we conduct data analysis with R, we will be working with
data.frames
. In most cases (including the second part of
this seminar), we will load a data set from a spreadsheet-based external
file (.csv, .xls, .dta, .sav, among others) onto R; for now however, we
will use a dataset that comes pre-installed with R just to see how it
works. Let’s use the data()
function to load the
USArrests
data set, which contains statistics in arrests
per 100,000 residents for assault, murder, and rape in each of the 50 US
states in 1973. (Sorry for the grim example!)
data("USArrests")
We can use the help()
function to read more about this
data set, which contains 50 observations (i.e., rows) on 4 variables
(i.e., columns).
help(USArrests)
The data.frame
is listed as a new object in the
environment panel. We can click on it to see it as a spreadsheet; we can
also type in the name of the data set to see what it looks like. Because
data sets are often very long, instead of seeing all of it we can opt to
look at just the first few rows using the head()
function:
head(USArrests, 10) # the second argument specifies the number of rows we want to see
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
## Connecticut 3.3 110 77 11.1
## Delaware 5.9 238 72 15.8
## Florida 15.4 335 80 31.9
## Georgia 17.4 211 60 25.8
$
and [,]
The easiest way to access a single variable (i.e., a column) of a
data.frame
is using the dollar sign $
. For
instance, to access the murder rate in US states in 1973:
USArrests$Murder
## [1] 13.2 10.0 8.1 8.8 9.0 7.9 3.3 5.9 15.4 17.4 5.3 2.6 10.4 7.2 2.2
## [16] 6.0 9.7 15.4 2.1 11.3 4.4 12.1 2.7 16.1 9.0 6.0 4.3 12.2 2.1 7.4
## [31] 11.4 11.1 13.0 0.8 7.3 6.6 4.9 6.3 3.4 14.4 3.8 13.2 12.7 3.2 2.2
## [46] 8.5 4.0 5.7 2.6 6.8
R returns all the observations for the column Murder
.
What is this? It is a vector! A list of information (here, numbers) in
one specific order. We can therefore apply everything we learned about
vectors here. For example, we can access the third element of this
vector:
USArrests$Murder[3]
## [1] 8.1
Which corresponds to the murder rate in Arizona. Let’s practice using the dollar sign to access the Assault variable. What are the first, tenth, and fifteenth elements?
USArrests$Assault[1]
## [1] 236
USArrests$Assault[10]
## [1] 211
USArrests$Assault[15]
## [1] 56
We saw earlier that we can subset a vector by using square brackets:
[ ]
. When dealing with data.frames, we often want to access
certain observations (rows) or certain columns (variables) or a
combination of the two without looking at the entire data set all at
once. We can also use square brackets ([,]
) to subset
data.frames.
In square brackets we put a row and a column coordinates separated by
a comma. The row coordinate goes first and the column coordinate second.
So USArrests[23, 3]
returns the 23rd row and third column
of the data frame. If we leave the column coordinate empty this means we
would like all columns. So, USArrests[10,]
returns the 10th
row of the data set. If we leave the row coordinate empty, R returns the
entire column. So, USArrests[,4]
returns the fourth column
of the data set.
USArrests[23, 3] # element in 23rd row, 3rd column
## [1] 66
USArrests[10,] # entire 10th row
## Murder Assault UrbanPop Rape
## Georgia 17.4 211 60 25.8
USArrests[,4] # entire fourth column
## [1] 21.2 44.5 31.0 19.5 40.6 38.7 11.1 15.8 31.9 25.8 20.2 14.2 24.0 21.0 11.3
## [16] 18.0 16.3 22.2 7.8 27.8 16.3 35.1 14.9 17.1 28.2 16.4 16.5 46.0 9.5 18.8
## [31] 32.1 26.1 16.1 7.3 21.4 20.0 29.3 14.9 8.3 22.5 12.8 26.9 25.5 22.9 11.2
## [46] 20.7 26.2 9.3 10.8 15.6
We can look at a selected number of rows of a dataset with the colon
in brackets: USArrests[1:7,]
returns the first seven rows
and all columns of the data.frame USArrests
. We could
display the second and fourth columns of the dataset by using the
c()
function in brackets like so:
USArrests[, c(2,4)]
.
Display all columns of the USArrests
dataset and show
rows 10 to 15. Next display all columns of the dataset but only for rows
10 and 15.
USArrests[10:15,]
## Murder Assault UrbanPop Rape
## Georgia 17.4 211 60 25.8
## Hawaii 5.3 46 83 20.2
## Idaho 2.6 120 54 14.2
## Illinois 10.4 249 83 24.0
## Indiana 7.2 113 65 21.0
## Iowa 2.2 56 57 11.3
USArrests[c(10, 15),]
## Murder Assault UrbanPop Rape
## Georgia 17.4 211 60 25.8
## Iowa 2.2 56 57 11.3
We can also subset by using logical values and logical operators. R
has two special representations for logical values: TRUE
and FALSE
. R also has many logical operators, such as
greater than (>
), less than (<
), or
equal to (==
).
When we apply a logical operator to an object, the value returned
should be a logical value (i.e. T
or F
). For
instance:
5 > 3
## [1] TRUE
7 < 4
## [1] FALSE
2 == 1
## [1] FALSE
Here, when we ask R whether 5 is greater than 3, R returns the
logical value TRUE
. When we ask if 7 is less than 4, R
returns the logical value FALSE
. When we ask R whether 2 is
equal to 1, R returns the logical value FALSE
.
For the purposes of subsetting, logical operations are useful because
they can be used to specify which elements of a vector or data.frame we
would like returned. For instance, let’s subset the
USArrests
and keep only states with a murder rate less than
5 per 100,000:
USArrests[USArrests$Murder < 5, ]
## Murder Assault UrbanPop Rape
## Connecticut 3.3 110 77 11.1
## Idaho 2.6 120 54 14.2
## Iowa 2.2 56 57 11.3
## Maine 2.1 83 51 7.8
## Massachusetts 4.4 149 85 16.3
## Minnesota 2.7 72 66 14.9
## Nebraska 4.3 102 62 16.5
## New Hampshire 2.1 57 56 9.5
## North Dakota 0.8 45 44 7.3
## Oregon 4.9 159 67 29.3
## Rhode Island 3.4 174 87 8.3
## South Dakota 3.8 86 45 12.8
## Utah 3.2 120 80 22.9
## Vermont 2.2 48 32 11.2
## Washington 4.0 145 73 26.2
## Wisconsin 2.6 53 66 10.8
Let’s go through this code slowly to see what is going on here.
First, we are asking R to display the USArrests
data.frame.
But not all of it: we are using square brackets [ ]
, so
only a subset of the dataset is displayed. There is some information
before but nothing after the comma inside the square brackets, which
means that only a fraction of rows but all columns should be displayed.
Which rows? Let’s take a closer look at the code before the comma inside
the square brackets. R should only display the rows for which the
expression USArrests$Murder < 5
is TRUE
,
i.e. states with a murder rate less than 5 (per 100,000).
mean_murder <- mean(USArrests$Murder)
median_murder <- median(USArrests$Murder)
mean_assault <- mean(USArrests$Assault)
median_assault <- median(USArrests$Assault)
mean_urban <- mean(USArrests$UrbanPop)
median_urban <- median(USArrests$UrbanPop)
mean_rape <- mean(USArrests$Rape)
median_rape <- median(USArrests$Rape)
urban_states <- USArrests[USArrests$UrbanPop >= median_urban, ]
rural_states <- USArrests[USArrests$UrbanPop < median_urban, ]
mean_assault_urban <- mean(urban_states$Assault)
mean_assault_rural <- mean(rural_states$Assault)
mean_assault_urban
## [1] 187.4643
mean_assault_rural
## [1] 149.5
The average assault rate in urban states is 187.46 (per 100,000), considerably larger than the average assault rate in rural states of 149.5.
In this section, we will explore what longitudinal data looks like
and how to manipulate it using tidyverse
tools.
We will use the sleepstudy
dataset from the
lme4
package. This dataset contains repeated measures of
reaction time for individuals over several days of sleep deprivation.
Let’s start by installing the lme4
package, if you have
never done so, and then loading the package.
# Install the lme4 package; you only need to do that once
## install.packages("lme4") # (uncomment if you need to install)
# Load the lme4 package
library(lme4)
## We are also going to use the tidyverse package, so let's load it now
# Install the tidyverse package; you only need to do that once
## install.packages("tidyverse") # (uncomment if you need to install)
# Load the tidyverse package
library(tidyverse)
Now, we can use the data()
function to load the
sleepstudy
dataset. Usually, we will use external data
sources and will import them into R
. Today, let’s use a
longitudinal dataset already pre-loaded in R
:
# load the 'sleepstudy' dataset
data("sleepstudy", package = "lme4")
# If you want to read more about the dataset:
help("sleepstudy")
The variables available in the sleepstudy
dataset are
the following:
Name | Description |
---|---|
Reaction |
Reaction time in milliseconds |
Days |
Number of days of sleep deprivation (0 to 9) |
Subject |
Identifier for each person |
Is this a longitudinal dataset? Let’s investigate the number of observations and and the number of unique individuals observed in the dataset.
# we can use the nrow() function to check the number of observations
nrow(sleepstudy)
## [1] 180
Now, if the dataset contains \(n=\) 180 observations (i.e., rows), how many unique individuals were part of the study?
# checking how many individuals were part of the study
length(unique(sleepstudy$Subject))
## [1] 18
Only 18 individuals. Indeed, each person was observed on 10 different
days. We can double check that using the count()
function
# we can use the count() function
count(sleepstudy, Subject)
## Subject n
## 1 308 10
## 2 309 10
## 3 310 10
## 4 330 10
## 5 331 10
## 6 332 10
## 7 333 10
## 8 334 10
## 9 335 10
## 10 337 10
## 11 349 10
## 12 350 10
## 13 351 10
## 14 352 10
## 15 369 10
## 16 370 10
## 17 371 10
## 18 372 10
This dataset is therefore in the long format. Although we have only 18 individuals, the dataset contains 180 observations. In the other words, the unit of analysis of this dataset is the person-day observation.
Some analytic strategies require data to be in the wide format. When
we want to pivot long datasets to wide format, we can use the
pivot_wider()
function, from tidyverse
. If we
do that, each subject should have only one row, with days appearing as
separate columns.
# pivot: from long to wide format
sleep_wide <-
sleepstudy %>%
pivot_wider(names_from = Days,
values_from = Reaction)
# checking that pivoting wider worked
head(sleep_wide)
## # A tibble: 6 × 11
## Subject `0` `1` `2` `3` `4` `5` `6` `7` `8` `9`
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 308 250. 259. 251. 321. 357. 415. 382. 290. 431. 466.
## 2 309 223. 205. 203. 205. 208. 216. 214. 218. 224. 237.
## 3 310 199. 194. 234. 233. 229. 220. 235. 256. 261. 248.
## 4 330 322. 300. 284. 285. 286. 298. 280. 318. 305. 354.
## 5 331 288. 285 302. 320. 316. 293. 290. 335. 294. 372.
## 6 332 235. 243. 273. 310. 317. 310. 454. 347. 330. 254.
Suppose we now want to go back to long format for modeling. Here’s how we can do that.
# pivot: from wide to long
sleep_long <-
sleep_wide %>%
pivot_longer(
cols = c('0', '1', '2', '3', '4', '5', '6', '7', '8', '9'),
names_to = "Days",
values_to = "Reaction"
)
# check that pivoting longer worked
head(sleep_long)
## # A tibble: 6 × 3
## Subject Days Reaction
## <fct> <chr> <dbl>
## 1 308 0 250.
## 2 308 1 259.
## 3 308 2 251.
## 4 308 3 321.
## 5 308 4 357.
## 6 308 5 415.
One benefit of long format is ease of plotting with
ggplot2
.
# plotting trajectories
ggplot(sleepstudy, aes(x = Days, y = Reaction, group = Subject, color = Subject)) +
geom_line(show.legend = FALSE) +
labs(title = "Reaction Time by Day of Sleep Deprivation",
x = "Days",
y = "Reaction Time (ms)") +
theme_minimal()
Well, they need to start with a letter. But otherwise they may contain numbers, upper and lower case letters (R distinguishes between them), and punctuation such as dots ( . ) and underscores ( _ )↩︎