Throughout this course we will be using the statistical package R, and the friendly interface of Rstudio, to conduct data analysis. The first part of today’s seminar is about getting familiar with this piece of software. If you have never used R before, worry not! No prior knowledge is assumed, and we will walk you through all the necessary steps to conduct analysis on the topics discussed in the lectures. Of course, learning a new statistical software is not simple, and we don’t aim to fully introduce you to the world of R – a wide and dynamic world that goes way beyond what we are covering here. This is just a gentle introduction to some specific R coding that corresponds to some methods often used to infer causality from observational data (our main focus here). Our suggestion is that you build on this and keep practicing. Coding is like learning a foreign language: if you don’t use you lose it.

The first part of this assignment is a general introduction to R, starting from scratch. If you already have some experience with this software, you are welcome to join the second part of the assignment, where we analyse some longitudinal data.

Why R?

R is a statistical software that allows us to manipulate data and estimate a wide variety of statistical models. It is one of the fastest growing statistical software, one of the most popular data science software, and, perhaps most importantly, it is open source (free!). Crucially, most of the cutting edge causal inference packages are implemented in R so, should you want to keep up with the latest developments, the basic knowledge you get here can help you with that. We will be using the RStudio user-interface, which makes operating R somewhat easier.

Installing R and Rstudio

You should install both R and Rstudio onto your personal computers. You can download them from the following sources:

You can watch this video to see how you can install R and Rstudio.

1st part: Introduction to R

Let’s see what we have here. After installing R and Rstudio, we start Rstudio and see three panels. A screen-long panel on the left-hand side called the console, a smaller panel on the top right-hand side called the environment, and a last one on the bottom right-hand side called Plots & Help. The console is the simplest way to interact with R: you can type in some code (after the arrow \(>\)) and press Enter, and R will run it and provide an output. It is easy to visualize this if we simply use R as a calculator: when we type in mathematical operations in the console, R immediately returns the outcomes. Let’s see:

7 + 7
12 - 4
3 * 9
610 / 377
5^2
(0.31415 + 1.61803) / 3^3
See results!
## [1] 14
## [1] 8
## [1] 27
## [1] 1.618037
## [1] 25
## [1] 0.07156222


Directly typing code into the console is certainly easy, but often not the most efficient strategy. A better approach is to have a document where we can save all our code – that’s what scripts are for. R scripts are just plain text files that contain some R code, which we can edit the same way we would in Word or any other text file. These are similar to what you might have encountered using SPSS (syntax), STATA (do), or other statistical software. We can open R scripts within Rstudio: just go to File –> New File –> R Script (or just press Cmd/Ctrl + shift + N for a shortcut).

We can now see a new panel popping up taking up the space in the top left-hand side of Rstudio, just above the console. You can now type all your code into the script, save it, and open it again whenever you want. If you want to run a piece of code from the script, you can always copy and paste it on the console, though this is of course very inefficient. Instead, you can ask R to run any piece of code from the script directly. There are a few different ways to do it.

  • To run an entire line, place the cursor on the line you want to run and use the Run button (on the top right-hand side of the script), or just press Ctrl/Cmd + Enter
  • To run multiple lines (or even just part of a single line), highlight the text you want to run and use the Run button, or press Ctrl/Cmd + Enter
  • To run the entire script, use the Source button, or press Ctrl/Cmd + Shift + S

You should always work from an R script! Our suggestion is that you create a different script for each seminar, and save them using reasonable names such as “seminar1.R”, “seminar2.R”, and so on.

Objects, vectors, functions

When working with R, we store information by creating “objects”. We create objects all the time; it is a simple way of labelling some piece of information so that we can use it in subsequent tasks. Say we want to know the outcome of \(3 * 5\), and then we want to divide that outcome by \(2\). We could, of course, simply ask R do the calculations directly:

(3 * 5) / 2

But we could also first, create an object that stores the result of \(3*5\), and then divide said object by \(2\). We can give objects any name we like.1 To create objects, we need to use the assignment operator <-. If we want to name the outcome of \(3*5\) outcome, then we simply need to use the assignment operator:

outcome <- 3 * 5

Notice that R does not provide any output after we create an object. That is because we are not asking for any output, we are simply creating an object. Note, however, that the environment panel lists all the objects we create in the current session. In case we want to confirm what piece of information is stored under a given label, apart from checking the environment panel, we can also just run the name of the object and see R’s output. For instance, when we run outcome in the console, R returns the number 15, which is the outcome of \(3*5\).

outcome
## [1] 15

This is useful because, once we have created objects, we can use to perform subsequent calculations. For instance:

outcome / 2
## [1] 7.5

And we can even use previously created objects to create a new object:

my_new_object <- outcome ^ 2
my_new_object
## [1] 225

Both objects that we have just created (outcome and my_new_object) contain just a single number. But we can create objects that contain more information as well. Often times, we want to create a long list of numbers in one specific order. Think of a regular spreadsheet, where we can enter numbers in a single column which are separated (and ordered) according to their rows. In the R language, those single columns are equivalent to vectors. A vector is simply a set of information contained together in a specific order. In order to create a vector in R, we use the c() function: instead of including new information at every row (as we would, were we using a regular spreadsheet), we separate new information by commas inside the parentheses. For instance, we could create a new vector by concatenating the following numbers:

new_vector <- c(0, 3, 1, 4, 1, 5, 9, 2)
new_vector
## [1] 0 3 1 4 1 5 9 2

Recall that vectors store information (in this case, numbers) in a specific order. This is important. We can use this order to access individuals elements of a vector. We do this by subsetting a vector: we just need to use square brackets [ ] and include the number corresponding to the position we want to access. For instance, if we want to access the second element in our vector new_vector, we simply do the following:

new_vector[2]
## [1] 3

We can see that by using new_vector[2], R returns the second element of the vector new_vector, which is the number 3. If we want to access the seventh element of the vector new_vector, we just use new_vector[7] which returns the number 9, and so on.

Now that we have our first vector, we can see how functions work. Functions are a set of instructions: you provide an input and R generates some output – the backbone of R programming. A function is a command followed by round brackets ( ). Inputs are arguments that go inside the brackets; if a function requires more than one argument, these are separated by commas. For instance, we can add all elements of the vector new_vector together by using the function sum(). Here the input is the name of the vector:

sum(new_vector)
## [1] 25

And 25 is the output. As always, we can save the result of the output as an object using assignment operator <-.

sum_of_our_vector <- sum(new_vector)

Here, sum_of_our_vector is also an object! So we have performed a calculation (sum()) on some data (new_vector), and stored the result (sum_of_our_vector). Let’s try some new functions such as mean(), median(), and summary(). What are they calculating?

See results!
mean(new_vector)
## [1] 3.125
median(new_vector)
## [1] 2.5
summary(new_vector)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.500   3.125   4.250   9.000
mean() returns the average value of a vector, median() returns the median, and summary() returns a set of useful statistics, such as the minimum and the maximum values of the vector, the interquartile range, the median, and the mean.


You could create your own functions using the function() function – e.g. you could come up with a new function to calculate the mean of a vector. Say you create a set of new functions. They must be useful if you spent the time and effort to develop them, so you would like to let anyone in the world use them as well by making them publicly available. This is the basic idea of R packages. People create new functions and make them publicly available. We will be using various packages throughout the course. These will help us conduct data analysis. An example of a useful package is pysch, which contains the describe() function: like the summary() function, it describes the content of a variable or data set, but it provides more details. Let’s see how it works

describe(new_vector)
## Error in describe(new_vector): could not find function "describe"

We got an Error message! Why? Well, that is because the describe() function does not exist in base R. We first need to install and then load the psych package, only then this command with its new functions will be available to us. This is a two-step process. First step involves installing the relevant package in your computer using the install.packages() function; you only need to this once. The second step involves loading the relevant package using the library() function; you need to do this every time you start a new session (i.e. open Rstudio). If you are not sure which packages are installed in your computer, you can simply run installed.packages(); if you are not sure which packages are loaded in the current R session, you can simply run (.packages()).

install.packages("psych")  # you only need to use this once in your computer
library(psych)             # making the package available in the current session

describe(new_vector)       # look, now we can use the describe() function 
##    vars n mean  sd median trimmed  mad min max range skew kurtosis   se
## X1    1 8 3.12 2.9    2.5    3.12 2.22   0   9     9 0.81    -0.63 1.03

The describe() function is useful because it provides us with a wider set of useful statistics than the summary() function, such as the number of observations, standard deviation, range, skew and kurtosis. Note that R ignores everything that comes after the #. This is extremely useful to make comments throughout our script. We would like to encourage you to take notes throughout the seminar as you might forget what each of the commands stand for.

data.frames

Data frames are the workhorse when conducting data analysis with R. A data.frame object is the R equivalent to a spreadsheet: each row represents a unit, each column represents a variable. Nearly every time we conduct data analysis with R, we will be working with data.frames. In most cases (including the second part of this seminar), we will load a data set from a spreadsheet-based external file (.csv, .xls, .dta, .sav, among others) onto R; for now however, we will use a dataset that comes pre-installed with R just to see how it works. Let’s use the data() function to load the USArrests data set, which contains statistics in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. (Sorry for the grim example!)

data("USArrests")

We can use the help() function to read more about this data set, which contains 50 observations (i.e., rows) on 4 variables (i.e., columns).

help(USArrests)

The data.frame is listed as a new object in the environment panel. We can click on it to see it as a spreadsheet; we can also type in the name of the data set to see what it looks like. Because data sets are often very long, instead of seeing all of it we can opt to look at just the first few rows using the head() function:

head(USArrests, 10) # the second argument specifies the number of rows we want to see
##             Murder Assault UrbanPop Rape
## Alabama       13.2     236       58 21.2
## Alaska        10.0     263       48 44.5
## Arizona        8.1     294       80 31.0
## Arkansas       8.8     190       50 19.5
## California     9.0     276       91 40.6
## Colorado       7.9     204       78 38.7
## Connecticut    3.3     110       77 11.1
## Delaware       5.9     238       72 15.8
## Florida       15.4     335       80 31.9
## Georgia       17.4     211       60 25.8

Subsetting with $ and [,]

The easiest way to access a single variable (i.e., a column) of a data.frame is using the dollar sign $. For instance, to access the murder rate in US states in 1973:

USArrests$Murder
##  [1] 13.2 10.0  8.1  8.8  9.0  7.9  3.3  5.9 15.4 17.4  5.3  2.6 10.4  7.2  2.2
## [16]  6.0  9.7 15.4  2.1 11.3  4.4 12.1  2.7 16.1  9.0  6.0  4.3 12.2  2.1  7.4
## [31] 11.4 11.1 13.0  0.8  7.3  6.6  4.9  6.3  3.4 14.4  3.8 13.2 12.7  3.2  2.2
## [46]  8.5  4.0  5.7  2.6  6.8

R returns all the observations for the column Murder. What is this? It is a vector! A list of information (here, numbers) in one specific order. We can therefore apply everything we learned about vectors here. For example, we can access the third element of this vector:

USArrests$Murder[3]
## [1] 8.1

Which corresponds to the murder rate in Arizona. Let’s practice using the dollar sign to access the Assault variable. What are the first, tenth, and fifteenth elements?

See results!
USArrests$Assault[1]
## [1] 236
USArrests$Assault[10]
## [1] 211
USArrests$Assault[15]
## [1] 56


We saw earlier that we can subset a vector by using square brackets: [ ]. When dealing with data.frames, we often want to access certain observations (rows) or certain columns (variables) or a combination of the two without looking at the entire data set all at once. We can also use square brackets ([,]) to subset data.frames.

In square brackets we put a row and a column coordinates separated by a comma. The row coordinate goes first and the column coordinate second. So USArrests[23, 3] returns the 23rd row and third column of the data frame. If we leave the column coordinate empty this means we would like all columns. So, USArrests[10,] returns the 10th row of the data set. If we leave the row coordinate empty, R returns the entire column. So, USArrests[,4] returns the fourth column of the data set.

USArrests[23, 3]  # element in 23rd row, 3rd column
## [1] 66
USArrests[10,]    # entire 10th row
##         Murder Assault UrbanPop Rape
## Georgia   17.4     211       60 25.8
USArrests[,4]     # entire fourth column
##  [1] 21.2 44.5 31.0 19.5 40.6 38.7 11.1 15.8 31.9 25.8 20.2 14.2 24.0 21.0 11.3
## [16] 18.0 16.3 22.2  7.8 27.8 16.3 35.1 14.9 17.1 28.2 16.4 16.5 46.0  9.5 18.8
## [31] 32.1 26.1 16.1  7.3 21.4 20.0 29.3 14.9  8.3 22.5 12.8 26.9 25.5 22.9 11.2
## [46] 20.7 26.2  9.3 10.8 15.6

We can look at a selected number of rows of a dataset with the colon in brackets: USArrests[1:7,] returns the first seven rows and all columns of the data.frame USArrests. We could display the second and fourth columns of the dataset by using the c() function in brackets like so: USArrests[, c(2,4)].

Display all columns of the USArrests dataset and show rows 10 to 15. Next display all columns of the dataset but only for rows 10 and 15.

See results!
USArrests[10:15,]
##          Murder Assault UrbanPop Rape
## Georgia    17.4     211       60 25.8
## Hawaii      5.3      46       83 20.2
## Idaho       2.6     120       54 14.2
## Illinois   10.4     249       83 24.0
## Indiana     7.2     113       65 21.0
## Iowa        2.2      56       57 11.3
USArrests[c(10, 15),]
##         Murder Assault UrbanPop Rape
## Georgia   17.4     211       60 25.8
## Iowa       2.2      56       57 11.3


Logical operators

We can also subset by using logical values and logical operators. R has two special representations for logical values: TRUE and FALSE. R also has many logical operators, such as greater than (>), less than (<), or equal to (==).

When we apply a logical operator to an object, the value returned should be a logical value (i.e. T or F). For instance:

5 > 3
## [1] TRUE
7 < 4
## [1] FALSE
2 == 1
## [1] FALSE

Here, when we ask R whether 5 is greater than 3, R returns the logical value TRUE. When we ask if 7 is less than 4, R returns the logical value FALSE. When we ask R whether 2 is equal to 1, R returns the logical value FALSE.

For the purposes of subsetting, logical operations are useful because they can be used to specify which elements of a vector or data.frame we would like returned. For instance, let’s subset the USArrests and keep only states with a murder rate less than 5 per 100,000:

USArrests[USArrests$Murder < 5, ]
##               Murder Assault UrbanPop Rape
## Connecticut      3.3     110       77 11.1
## Idaho            2.6     120       54 14.2
## Iowa             2.2      56       57 11.3
## Maine            2.1      83       51  7.8
## Massachusetts    4.4     149       85 16.3
## Minnesota        2.7      72       66 14.9
## Nebraska         4.3     102       62 16.5
## New Hampshire    2.1      57       56  9.5
## North Dakota     0.8      45       44  7.3
## Oregon           4.9     159       67 29.3
## Rhode Island     3.4     174       87  8.3
## South Dakota     3.8      86       45 12.8
## Utah             3.2     120       80 22.9
## Vermont          2.2      48       32 11.2
## Washington       4.0     145       73 26.2
## Wisconsin        2.6      53       66 10.8

Let’s go through this code slowly to see what is going on here. First, we are asking R to display the USArrests data.frame. But not all of it: we are using square brackets [ ], so only a subset of the dataset is displayed. There is some information before but nothing after the comma inside the square brackets, which means that only a fraction of rows but all columns should be displayed. Which rows? Let’s take a closer look at the code before the comma inside the square brackets. R should only display the rows for which the expression USArrests$Murder < 5 is TRUE, i.e. states with a murder rate less than 5 (per 100,000).

A few questions about data analysis with R

  1. Calculate the mean and median of each of the variables included in the data set. Assign each of the results of these calculations to objects (choose sensible names!).
See results!
mean_murder <- mean(USArrests$Murder)
median_murder <- median(USArrests$Murder)

mean_assault <- mean(USArrests$Assault)
median_assault <- median(USArrests$Assault)

mean_urban <- mean(USArrests$UrbanPop)
median_urban <- median(USArrests$UrbanPop)

mean_rape <- mean(USArrests$Rape)
median_rape <- median(USArrests$Rape)


  1. Is there a difference in the assault rate for urban and rural states? Define an urban state as one for which the urban population is greater than or equal to the median across all states. Define a rural state as one for which the urban population is less than the median.
See results!
urban_states <- USArrests[USArrests$UrbanPop >= median_urban, ]
rural_states <- USArrests[USArrests$UrbanPop < median_urban, ]

mean_assault_urban <- mean(urban_states$Assault)
mean_assault_rural <- mean(rural_states$Assault)

mean_assault_urban
## [1] 187.4643
mean_assault_rural
## [1] 149.5

The average assault rate in urban states is 187.46 (per 100,000), considerably larger than the average assault rate in rural states of 149.5.


2nd part: Analysing longitudinal data

In this section, we will explore what longitudinal data looks like and how to manipulate it using tidyverse tools.

We will use the sleepstudy dataset from the lme4 package. This dataset contains repeated measures of reaction time for individuals over several days of sleep deprivation. Let’s start by installing the lme4 package, if you have never done so, and then loading the package.

# Install the lme4 package; you only need to do that once
## install.packages("lme4") # (uncomment if you need to install)

# Load the lme4 package
library(lme4)

## We are also going to use the tidyverse package, so let's load it now

# Install the tidyverse package; you only need to do that once
## install.packages("tidyverse") # (uncomment if you need to install)

# Load the tidyverse package
library(tidyverse)

Now, we can use the data() function to load the sleepstudy dataset. Usually, we will use external data sources and will import them into R. Today, let’s use a longitudinal dataset already pre-loaded in R:

# load the 'sleepstudy' dataset
data("sleepstudy", package = "lme4")

# If you want to read more about the dataset:
help("sleepstudy")

The variables available in the sleepstudy dataset are the following:

Name Description
Reaction Reaction time in milliseconds
Days Number of days of sleep deprivation (0 to 9)
Subject Identifier for each person

Is this a longitudinal dataset? Let’s investigate the number of observations and and the number of unique individuals observed in the dataset.

# we can use the nrow() function to check the number of observations
nrow(sleepstudy)
## [1] 180

Now, if the dataset contains \(n=\) 180 observations (i.e., rows), how many unique individuals were part of the study?

# checking how many individuals were part of the study
length(unique(sleepstudy$Subject))
## [1] 18

Only 18 individuals. Indeed, each person was observed on 10 different days. We can double check that using the count() function

# we can use the count() function
count(sleepstudy, Subject)
##    Subject  n
## 1      308 10
## 2      309 10
## 3      310 10
## 4      330 10
## 5      331 10
## 6      332 10
## 7      333 10
## 8      334 10
## 9      335 10
## 10     337 10
## 11     349 10
## 12     350 10
## 13     351 10
## 14     352 10
## 15     369 10
## 16     370 10
## 17     371 10
## 18     372 10

This dataset is therefore in the long format. Although we have only 18 individuals, the dataset contains 180 observations. In the other words, the unit of analysis of this dataset is the person-day observation.

Pivoting: from long to wide format

Some analytic strategies require data to be in the wide format. When we want to pivot long datasets to wide format, we can use the pivot_wider() function, from tidyverse. If we do that, each subject should have only one row, with days appearing as separate columns.

# pivot: from long to wide format
sleep_wide <- 
  sleepstudy %>% 
  pivot_wider(names_from = Days, 
              values_from = Reaction)

# checking that pivoting wider worked
head(sleep_wide)
## # A tibble: 6 × 11
##   Subject   `0`   `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`
##   <fct>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 308      250.  259.  251.  321.  357.  415.  382.  290.  431.  466.
## 2 309      223.  205.  203.  205.  208.  216.  214.  218.  224.  237.
## 3 310      199.  194.  234.  233.  229.  220.  235.  256.  261.  248.
## 4 330      322.  300.  284.  285.  286.  298.  280.  318.  305.  354.
## 5 331      288.  285   302.  320.  316.  293.  290.  335.  294.  372.
## 6 332      235.  243.  273.  310.  317.  310.  454.  347.  330.  254.

Pivoting: from wide to long format

Suppose we now want to go back to long format for modeling. Here’s how we can do that.

# pivot: from wide to long
sleep_long <- 
  sleep_wide %>%
  pivot_longer(
    cols = c('0', '1', '2', '3', '4', '5', '6', '7', '8', '9'),
    names_to = "Days",
    values_to = "Reaction"
  )

# check that pivoting longer worked
head(sleep_long)
## # A tibble: 6 × 3
##   Subject Days  Reaction
##   <fct>   <chr>    <dbl>
## 1 308     0         250.
## 2 308     1         259.
## 3 308     2         251.
## 4 308     3         321.
## 5 308     4         357.
## 6 308     5         415.

Visualising longitudinal data

One benefit of long format is ease of plotting with ggplot2.

# plotting trajectories
ggplot(sleepstudy, aes(x = Days, y = Reaction, group = Subject, color = Subject)) +
  geom_line(show.legend = FALSE) +
  labs(title = "Reaction Time by Day of Sleep Deprivation",
       x = "Days",
       y = "Reaction Time (ms)") +
  theme_minimal()


  1. Well, they need to start with a letter. But otherwise they may contain numbers, upper and lower case letters (R distinguishes between them), and punctuation such as dots ( . ) and underscores ( _ )↩︎