Throughout this course we will be using the statistical package R, and the friendly interface of Rstudio, to conduct data analysis. The first part of today’s seminar is about getting familiar with this piece of software. If you have never used R before, worry not! No prior knowledge is assumed, and we will walk you through all the necessary steps to conduct analysis on the topics discussed in the lectures. Of course, learning a new statistical software is not simple, and we don’t aim to fully introduce you to the world of R – a wide and dynamic world that goes way beyond what we are covering here. This is just a gentle introduction to some specific R coding that corresponds to some methods often used to infer causality from observational data (our main focus here). Our suggestion is that you build on this and keep practicing. Coding is like learning a foreign language: if you don’t use you lose it.

The first part of this assignment is a general introduction to R, starting from scratch. If you already have some experience with this software, you are welcome to join the second part of the assignment, where we analyse some experimental data.

Why R?

R is a statistical software that allows us to manipulate data and estimate a wide variety of statistical models. It is one of the fastest growing statistical software packages, one of the most popular data science software packages, and, perhaps most importantly, it is open source (free!). We will also be using the RStudio user-interface, which makes operating R somewhat easier.

Installing R and Rstudio

You should install both R and Rstudio onto your personal computers. You can download them from the following sources:

R: The Comprehensive R Archive Network (CRAN)
Rstudio: RStudio.com

1st part: Introduction to R

Let’s see what what we have. After installing R and Rstudio, we start Rstudio and see three panels. A screen-long panel on the left-hand side called the console, a smaller panel on the top right-hand side called the environment, and a last one on the bottom right-hand side called Plots & Help. The console is the simplest way to interact with R: you can type in some code (after the arrow $>$) and press Enter, and R will run it and provide an output. It is easy to visualize this if we simply use R as a calculator: when we type in mathematical operations in the console, R immediately returns the outcomes. Let’s see:

7 + 7
12 - 4
3 * 9
610 / 377
5^2
(0.31415 + 1.61803) / 3^3

See results!

## [1] 14

## [1] 8

## [1] 27

## [1] 1.618037

## [1] 25

## [1] 0.07156222

Directly typing code into the console is certainly easy, but often not the most efficient strategy. A better approach is to have a document where we can save all our code – that’s what scripts are for. R scripts are just plain text files that contain some R code, which we can edit the same way we would in Word or any other text file. We can open R scripts within Rstudio: just go to File –> New File –> R Script (or just press Cmd/Ctrl + shift + N for a shortcut).

We can now see a new panel popping up taking up the space in the top left-hand side of Rstudio, just above the console. You can now type all your code into the script, save it, and open it again whenever you want. If you want to run a piece of code from the script, you can always just copy and paste it on the console, though this is of course very inefficient. Instead, you can ask R to run any piece of code from the script directly. There are a few different ways to do it.

To run an entire line, place the cursor on the line you want to run and use the Run button (on the top right-hand side of the script), or just press Ctrl/Cmd + Enter
To run multiple lines (or even just part of a single line), highlight the text you want to run and use the Run button, or press Ctrl/Cmd + Enter
To run the entire script, use the Source button, or press Ctrl/Cmd + Shift + S

You should always work from an R script! Our suggestion is that you create a different script for each seminar, and save them using reasonable names such as “seminar1.R”, “seminar2.R”, and so on.

Objects, vectors, functions

When working with R, we store information by creating “objects”. We create objects all the time; it is a simple way of labelling some piece of information so that we can use it in subsequent tasks. Say we want to know the outcome of $3 * 5$, and then we want to divide that outcome by $2$. We could, of course, simply ask R do the calculations directly:

(3 * 5) / 2

But we could also first create an object that stores the result of $3*5$ and then divide said object by $2$. We can give objects any name we like.¹ To create objects, we need to use the assignment operator <-. If we want to name the outcome of $3*5$ outcome, then we simply need to use the assignment operator:

outcome <- 3 * 5

Notice that R does not provide any output after we create an object. That is because we are not asking for any output, we are simply creating an object. Note, however, that the environment panel lists all the objects we create in the current session. In case we want to confirm what piece of information is stored under a given label, apart from checking the environment panel, we can also just run the name of the object and see R’s output. For instance, when we run outcome in the console, R returns the number 15, which is the outcome of $3*5$.

outcome

## [1] 15

This is useful because, once we have created objects, we can use to perform subsequent calculations. For instance:

outcome / 2

## [1] 7.5

And we can even use previously created objects to create a new object:

my_new_object <- outcome ^ 2
my_new_object

## [1] 225

Both objects that we have just created (outcome and my_new_object) contain just single numbers. But we can create objects that contain more information as well. Often times, we want to create a long list of numbers in one specific order. Think of a regular spreadsheet, where we can entry numbers in a single column which are separated (and ordered) according to their rows. In R language, those single columns are equivalent to vectors. A vector is simply a set of information contained together in a specific order. In order to create a vector in R, we use the c() function: instead of including new information at every row (as we would, were we using a regular spreadsheet), we separate new information by commas inside the parentheses. For instance, we could create a new vector by concatenating the following numbers:

new_vector <- c(0, 3, 1, 4, 1, 5, 9, 2)
new_vector

## [1] 0 3 1 4 1 5 9 2

Recall that vectors store information (in this case, numbers) in a specific order. This is important. We can use this order to access individuals elements of a vector. We do this by subsetting a vector: we just need to use square brackets [ ] and include the number corresponding to the position we want to access. For instance, if we want to access the second element in our vector new_vector, we simply do the following:

new_vector[2]

## [1] 3

We can see that by using new_vector[2], R returns the second element of the vector new_vector, which is the number 3. If we want to access the seventh element of the vector new_vector, we just use new_vector[7] which returns the number 9, and so on.

Now that we have our first vector, we can see how functions work. Functions are a set of instructions: you provide an input and R generates some output – the backbone of R programming. A function is a command followed by round brackets ( ). Inputs are arguments that go inside the brackets; if a function requires more than one argument, these are separated by commas. For instance, we can add all elements of the vector new_vector together by using the function sum(). Here the input is the name of the vector:

sum(new_vector)

## [1] 25

And 25 is the output. As always, we can save the result of the output as an object using assignment operator <-.

sum_of_our_vector <- sum(new_vector)

Here, sum_of_our_vector is also an object! So we have performed a calculation (sum()) on some data (new_vector), and stored the result (sum_of_our_vector). Let’s try some new functions such as mean(), median(), and summary(). What are they calculating?

See results!

mean(new_vector)

## [1] 3.125

median(new_vector)

## [1] 2.5

summary(new_vector)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.500   3.125   4.250   9.000

mean() returns the average value of a vector, median() returns the median, and summary() returns a set of useful statistics, such as the minimum and the maximum values of the vector, the interquartile range, the median, and the mean.

You could create your own functions using the function() function – e.g. you could come up with a new function to calculate the mean of a vector. Say you create a set of new functions. They are only useful functions, so you would like to let anyone in the world use them as well by making them publicly available. This is the basic idea of R packages. People create new functions and make them publicly available. We will be using various packages throughout the course that help us conduct data analysis. An example of a useful package is pysch, which contains the describe() function: like the summary() function, it describes the content of a variable or data set, but providing more details. Let’s see how it looks.

describe(new_vector)

## Error in describe(new_vector): could not find function "describe"

We got an Error message! Why? Well, that is because the describe() function does not exist in base R. We first need to load the psych package, only then will its new functions be available to us. This is a two-step process. First step involves installing the relevant package in your computer using the install.packages() function; you only need to this once. Second step involves loading the relevant package using the library() function; you need to that every time you start a new session.

install.packages("psych")  # you only need to use this once in your computer

library(psych)             # making the package available in the current session

## Warning: package 'psych' was built under R version 4.0.2

describe(new_vector)        # look, now we can use the describe() function

##    vars n mean  sd median trimmed  mad min max range skew kurtosis   se
## X1    1 8 3.12 2.9    2.5    3.12 2.22   0   9     9 0.81    -0.63 1.03

The describe() function is useful because it provides us with a wider set of useful statistics than the summary() function, such as the number of observations, standard deviation, range, skew and kurtosis. Note that R ignores everything that comes after the #. This is extremely useful to make comments throughout our code.

data.frames

Data frames are the workhorse when conducting data analysis with R. A data.frame object is the R equivalent to a spreadsheet: each row represents a unit, each column represents a variable. Nearly every time we conduct data analysis with R, we will be working with data.frames. In most cases (including the second part of this seminar), we will load a data set from a spreadsheet-based external file (.csv, .xls, .dta, .sav, among others) onto R; for now however, we will use a dataset that comes pre-installed with R just to see how it works. Let’s use the data() function to load the USArrests data set, which contains statistics in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973.

data("USArrests")

We can use the help() function to read more about this data set, which contains 50 observations (i.e., rows) on 4 variables (i.e., columns).

help(USArrests)

The data.frame is listed as a new object in the environment panel. We can click on it to see it as a spreadsheet; we can also type in the name of the data set to see what it looks like. Because data sets are often very long, instead of seeing all of it we can opt to look at just the first few rows using the head() function:

head(USArrests, 10) # the second argument specifies the number of rows we want to see

##             Murder Assault UrbanPop Rape
## Alabama       13.2     236       58 21.2
## Alaska        10.0     263       48 44.5
## Arizona        8.1     294       80 31.0
## Arkansas       8.8     190       50 19.5
## California     9.0     276       91 40.6
## Colorado       7.9     204       78 38.7
## Connecticut    3.3     110       77 11.1
## Delaware       5.9     238       72 15.8
## Florida       15.4     335       80 31.9
## Georgia       17.4     211       60 25.8

Subsetting with `$` and `[,]`

The easiest way to access a single variable (i.e., a column) of a data.frame is using the dollar sign $. For instance, to access the murder rate in US states in 1973:

USArrests$Murder

##  [1] 13.2 10.0  8.1  8.8  9.0  7.9  3.3  5.9 15.4 17.4  5.3  2.6 10.4  7.2  2.2
## [16]  6.0  9.7 15.4  2.1 11.3  4.4 12.1  2.7 16.1  9.0  6.0  4.3 12.2  2.1  7.4
## [31] 11.4 11.1 13.0  0.8  7.3  6.6  4.9  6.3  3.4 14.4  3.8 13.2 12.7  3.2  2.2
## [46]  8.5  4.0  5.7  2.6  6.8

R returns all the observations for the column Murder. What is this? It is a vector! A list of information (here, numbers) in one specific order. We can therefore apply everything we learned about vectors here. For example, we can access the third element of this vector:

USArrests$Murder[3]

## [1] 8.1

Which corresponds to the murder rate in Arizona. Let’s practice using the dollar sign to access the Assault variable. What are the first, tenth, and fifteenth elements?

See results!

USArrests$Assault[1]

## [1] 236

USArrests$Assault[10]

## [1] 211

USArrests$Assault[15]

## [1] 56

We saw earlier that we can subset a vector by using square brackets: [ ]. When dealing with data.frames, we often want to access certain observations (rows) or certain columns (variables) or a combination of the two without looking at the entire data set all at once. We can also use square brackets ([,]) to subset data.frames.

In square brackets we put a row and a column coordinates separated by a comma. The row coordinate goes first and the column coordinate second. So USArrests[23, 3] returns the 23rd row and third column of the data frame. If we leave the column coordinate empty this means we would like all columns. So, USArrests[10,] returns the 10th row of the data set. If we leave the row coordinate empty, R returns the entire column. So, USArrests[,4] returns the fourth column of the data set.

USArrests[23, 3]  # element in 23rd row, 3rd column

## [1] 66

USArrests[10,]    # entire 10th row

##         Murder Assault UrbanPop Rape
## Georgia   17.4     211       60 25.8

USArrests[,4]     # entire fourth column

##  [1] 21.2 44.5 31.0 19.5 40.6 38.7 11.1 15.8 31.9 25.8 20.2 14.2 24.0 21.0 11.3
## [16] 18.0 16.3 22.2  7.8 27.8 16.3 35.1 14.9 17.1 28.2 16.4 16.5 46.0  9.5 18.8
## [31] 32.1 26.1 16.1  7.3 21.4 20.0 29.3 14.9  8.3 22.5 12.8 26.9 25.5 22.9 11.2
## [46] 20.7 26.2  9.3 10.8 15.6

We can look at a selected number of rows of a dataset with the colon in brackets: USArrests[1:7,] returns the first seven rows and all columns of the data.frame USArrests. We could display the second and fourth columns of the dataset by using the c() function in brackets like so: USArrests[, c(2,4)].

Display all columns of the USArrests dataset and show rows 10 to 15. Next display all columns of the dataset but only for rows 10 and 15.

See results!

USArrests[10:15,]

##          Murder Assault UrbanPop Rape
## Georgia    17.4     211       60 25.8
## Hawaii      5.3      46       83 20.2
## Idaho       2.6     120       54 14.2
## Illinois   10.4     249       83 24.0
## Indiana     7.2     113       65 21.0
## Iowa        2.2      56       57 11.3

USArrests[c(10, 15),]

##         Murder Assault UrbanPop Rape
## Georgia   17.4     211       60 25.8
## Iowa       2.2      56       57 11.3

Logical operators

We can also subset by using logical values and logical operators. R has two special representations for logical values: TRUE and FALSE. R also has many logical operators, such as greater than (>), less than (<), or equal to (==).

When we apply a logical operator to an object, the value returned should be a logical value (i.e. T or F). For instance:

5 > 3

## [1] TRUE

7 < 4

## [1] FALSE

2 == 1

## [1] FALSE

Here, when we ask R whether 5 is greater than 3, R returns the logical value TRUE. When we ask if 7 is less than 4, R returns the logical value FALSE. When we ask R whether 2 is equal to 1, R returns the logical value FALSE.

For the purposes of subsetting, logical operations are useful because they can be used to specify which elements of a vector or data.frame we would like returned. For instance, let’s subset the USArrests and keep only states with a murder rate less than 5 per 100,000:

USArrests[USArrests$Murder < 5, ]

##               Murder Assault UrbanPop Rape
## Connecticut      3.3     110       77 11.1
## Idaho            2.6     120       54 14.2
## Iowa             2.2      56       57 11.3
## Maine            2.1      83       51  7.8
## Massachusetts    4.4     149       85 16.3
## Minnesota        2.7      72       66 14.9
## Nebraska         4.3     102       62 16.5
## New Hampshire    2.1      57       56  9.5
## North Dakota     0.8      45       44  7.3
## Oregon           4.9     159       67 29.3
## Rhode Island     3.4     174       87  8.3
## South Dakota     3.8      86       45 12.8
## Utah             3.2     120       80 22.9
## Vermont          2.2      48       32 11.2
## Washington       4.0     145       73 26.2
## Wisconsin        2.6      53       66 10.8

Let’s go through this code slowly to see what is going on here. First, we are asking R to display the USArrests data.frame. But no all of it: we are using square brackets [ ], so only a subset of the dataset is displayed. There is some information before but nothing after the comma inside the square brackets, which means that only a fraction of rows but all columns should be displayed. Which rows? Let’s take a closer look at the code before the comma inside the square brackets. R should only display the rows for which the expression USArrests$Murder < 5 is TRUE, i.e. states with a murder rate less than 5 (per 100,000).

A few questions about data analysis with R

Calculate the mean and median of each of the variables included in the data set. Assign each of the results of these calculations to objects (choose sensible names!).

See results!

mean_murder <- mean(USArrests$Murder)
median_murder <- median(USArrests$Murder)

mean_assault <- mean(USArrests$Assault)
median_assault <- median(USArrests$Assault)

mean_urban <- mean(USArrests$UrbanPop)
median_urban <- median(USArrests$UrbanPop)

mean_rape <- mean(USArrests$Rape)
median_rape <- median(USArrests$Rape)

Is there a difference in the assault rate for urban and rural states? Define an urban state as one for which the urban population is greater than or equal to the median across all states. Define a rural state as one for which the urban population is less than the median.

See results!

urban_states <- USArrests[USArrests$UrbanPop >= median_urban, ]
rural_states <- USArrests[USArrests$UrbanPop < median_urban, ]

mean_assault_urban <- mean(urban_states$Assault)
mean_assault_rural <- mean(rural_states$Assault)

mean_assault_urban

## [1] 187.4643

mean_assault_rural

## [1] 149.5

The average assault rate in urban states is 187.46 (per 100,000), considerably larger than the average assault rate in rural states of 149.5.

2nd part: Analysing experimental data

Can transphobia re reduced through in-person conversations and perspective-taking exercises? To address this question, two researchers conducted a field experiment on door-to-door canvassing in South Florida. Targeting antitransgender prejudice, the intervention involved canvassers holding single, approximately 10-minute conversations that encouraged actively taking the perspective of others with voters to see if these conversations could affect prejudicial attitudes towards transgender people.

Broockman, David and Joshua Kalla. 2016. “Durably reducing transphobia: a field experiment on door-to-door canvassing.” Science 352 (6282): 220-224.

In the experiment, the authors recruited first registered voters via mail for an online baseline survey. They then randomly assigned respondents of this baseline survey ($n=1825$) to either a treatment group targeted with the intervention ($n=913$) or a placebo group targeted with a conversation about recycling ($n=912$). For the intervention, 56 canvassers first knocked on voters’ doors unannounced. Then, canvassers asked to speak with the subject on their list and confirmed the person’s identity if the person came to the door. A total of several hundred individuals ($n=501$) came to their doors in the two conditions. For logistical reasons unrelated to the original study, we further reduce this dataset to (n=488) which is the full sample that appears in the transphobia.csv data (available on ILIAS).

The canvassers then engaged in a series of strategies previously shown to facilitate active processing under the treatment condition: canvassers informed voters that they might face a decision about the issue (whether to vote to repeal the law protecting transgender people); canvassers asked voters to explain their views; and canvassers showed a video that presented arguments on both sides. Canvassers defined the term “transgender” at this point and, if they were transgender themselves, noted this. The canvassers next attempted to encourage “analogic perspective-taking”. Canvassers first asked each voter to talk about a time when they themselves were judged negatively for being different. The canvassers then encouraged voters to see how their own experience offered a window into transgender people’s experiences, hoping to facilitate voters’ ability to take transgender people’s perspectives. The intervention ended with another attempt to encourage active processing by asking voters to describe if and how the exercise changed their mind. All of the former steps constitutes the “treatment.”

The placebo group was reminded that recycling was most effective when everyone participates. The canvassers talked about how they were working on ways to decrease environmental waste and asked the voters who came to the door about their support for a new law that would require supermarkets to charge for bags instead of giving them away for free. This was meant to mimic the effect of canvassers interacting with the voters in face-to-face conversation on a topic different from transphobia.

The authors then asked respondents ($n=488$) to complete follow-up online surveys via email presented as a continuation of the baseline survey. These follow-up surveys began 3 days, 3 weeks, 6 weeks, and 3 months after the intervention when the baseline survey was also conducted. The authors then created an index of tolerance towards transgender people. Higher values indicate higher tolerance, lower values indicate lower tolerance. The data set includes the following variables:

Name	Description
`vf_age`	Age
`vf_party`	Party: `D`=Democrats, `R`=Republicans and `N`=Independents
`vf_racename`	Race: `African American`, `Caucasian`, `Hispanic`
`vf_female`	Gender: `1` if female, `0` if male
`treat_ind`	Treatment assignment: `1`=treatment, `0`=placebo
`treatment.delivered`	Intervention was actually delivered (=`TRUE`) vs. was not (=`FALSE`)
`tolerance.t0`	Tolerance variable at Baseline
`tolerance.t1`	Tolerance captured at 3 days after Baseline
`tolerance.t2`	Tolerance captured at 3 weeks after Baseline
`tolerance.t3`	Tolerance captured at 6 weeks after Baseline
`tolerance.t4`	Tolerance captured at 3 months after Baseline

Preliminaries

It is sensible when you start any data analysis project to make sure your computer is set up in an efficient way. Our suggestion is that you create a folder on your computer where you can save all your scripts throughout the course (i.e., seminar1.R, seminar2.R, etc.). We also recommend you to create a subfolder into your main folder, and give it the name data: this is where you should save all your data sets.

When we work with RStudio, the first thing we should do is ensure that R is to set the working directory. This essentially is the folder in your computer where will operate (e.g., when looking for data and other scripts). There are two ways to this. The easiest (and recommended) is to set the folder from which you want R to work from as an R Project. You can do that by clicking on “Project: (none)” at the top-right corner of RStudio, then click “New project” and just assign it to your folder of choice. The second way to set the working directory involves knowing the location of the relevant folder in your computer. Say you have created a folder named “Causal Inference in OS” inside a folder named “GESIS 2021” on your desktop. Then you can use the setwd() to set the working directory:

setwd("~/Desktop/GESIS 2021/Causal Inference in OS")   # if you are working on a Mac
setwd("C:/Desktop/GESIS 2021/Causal Inference in OS")  # if you are working on a Windows PC

Loading data

Once you have downloaded the data, put the transphobia.csv file into the data folder that you created earlier in the seminar. Now load the data into the current R session using the read.csv() function:

transphobia <- read.csv('data/transphobia.csv')

You can now check the environment panel and see a data.frame object named transphobia with 488 rows (observations) and 11 columns (variables).

Some exercises

Question 1 – Describing variables

Let’s start describing the data. Use the table() function and the treat_ind variable to find how many respondents were randomly assigned to the treatment and the control groups.

See results!

table(transphobia$treat_ind)

## 
##   0   1 
## 252 236

236 were randomly assigned to the treatment group, whereas 252 were randomly assigned to the control group.

Simply counting how many respondents were assigned to each treatment might not be informative. A better approach is to calculate the proportions, which we can do using the prop.table() function. What percentage of respondents were randomly assigned to the treatment and control groups?

Code hint: the prop.table() function requires a table() as an argument!

See results!

prop.table(table(transphobia$treat_ind))

## 
##         0         1 
## 0.5163934 0.4836066

48.36 % of the respondents were randomly assigned to the treatment group, whereas 51.64 % were assigned to the control group.

What about the response variable, how does it distribute across all respondents? Use the describe() function and the variable tolerance.t1, which measures tolerance levels towards transgender people three days after the intervention.

See results!

describe(transphobia$tolerance.t1)

##    vars   n mean   sd median trimmed  mad   min  max range  skew kurtosis   se
## X1    1 418 0.08 1.07   0.07     0.1 1.01 -2.26 2.07  4.32 -0.15    -0.46 0.05

We can see that this index of tolerance levels towards transgender people ranges from -2.26 to 2.07. The mean is 0.08, very similar to the median of 0.07, suggesting a symmetrical distribution. The standard deviation is 1.07.

Question 2 – Covariate balance

In order to make causal claims, we need be confident that our treatment groups are balanced. Do respondents in the treatment and control groups have similar characteristics in terms of their age?

See results!

To assess balance in the variable vf_age, we can calculate the average age for each treatment group:

mean(transphobia$vf_age[transphobia$treat_ind == T])  # average age in the treatment group

## [1] 50.07627

mean(transphobia$vf_age[transphobia$treat_ind == F])  # average age in the control group

## [1] 48.60317

Respondents in the treatment group are on average 50.08 years old, whereas respondents in the control group are on average 48.6 years old. To assess whether the estimated difference of 1.48 year between the two groups is due to sampling uncertainty, we can conduct a t test using the t.test() function:

t.test(x = transphobia$vf_age[transphobia$treat_ind == 1], 
       y = transphobia$vf_age[transphobia$treat_ind == 0], 
       conf.level = 0.95,
       var.equal = T)

## 
##  Two Sample t-test
## 
## data:  transphobia$vf_age[transphobia$treat_ind == 1] and transphobia$vf_age[transphobia$treat_ind == 0]
## t = 0.92987, df = 486, p-value = 0.3529
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.639634  4.585827
## sample estimates:
## mean of x mean of y 
##  50.07627  48.60317

Given that (i) the t statistic of 0.93 is lower than 1.96, (ii) the p-value of 0.35 is larger than 0.05, and (iii) the confidence interval of $[-1.65; 4.59]$ includes zero of a plausible value (all three statements are equivalent), we can safely conclude that there is no significant difference in the average age of respondents assigned to the treatment and the control groups.

Conduct the same analysis for the variables vf_female, vf_racename, and vf_party. Do respondents in the treatment and control groups have similar characteristics in terms of their gender, race, and party affiliation?

See results!

To assess the association between gender (as measured in the study) and treatment assignment, we can use the prop.table() and table() functions to visualize the cross-tabulation, and then conduct a Chi-squared test.

table_gender <- table(transphobia$vf_female,  # first argument is represented by rows
                      transphobia$treat_ind)  # second argument is represented by columns
prop.table(table_gender, 1) # the argument 1 indicates we want conditional proportions by rows

##    
##             0         1
##   0 0.4903846 0.5096154
##   1 0.5357143 0.4642857

As we can see, 49% of all male respondents were assigned to the control group, whereas 51% were assigned to the treatment group. Among female respondents, 54% were assigned to the control group and 46% were assigned to the treatment group. To handle sampling uncertainty, let’s use the chisq.test() function.

chisq.test(table_gender)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table_gender
## X-squared = 0.80883, df = 1, p-value = 0.3685

Given that $p = 0.37$, we have little evidence to reject the null hypothesis that there is no association between gender and treatment assignment.

Let us now adopt the same strategies and check balance for vf_racename and vf_party:

table_race <- table(transphobia$vf_racename,
                    transphobia$treat_ind)
prop.table(table_race, 1)

##                   
##                            0         1
##   African American 0.5037037 0.4962963
##   Caucasian        0.5161290 0.4838710
##   Hispanic         0.5240175 0.4759825

table_party <- table(transphobia$vf_party,
                     transphobia$treat_ind)
prop.table(table_party, 1)

##    
##             0         1
##   D 0.4834711 0.5165289
##   N 0.5333333 0.4666667
##   R 0.5634921 0.4365079

Overall, respondents of all three racial groups (African American, Caucasian, and Hispanic) and all three political groups (Democrats, Republicans, and Independents) seem relatively well balanced across the treatment and control groups, with close to 50% of respondents of each profile in either treatment groups.

chisq.test(table_race)

## 
##  Pearson's Chi-squared test
## 
## data:  table_race
## X-squared = 0.14038, df = 2, p-value = 0.9322

chisq.test(table_party)

## 
##  Pearson's Chi-squared test
## 
## data:  table_party
## X-squared = 2.3074, df = 2, p-value = 0.3155

Given that both p-values are larger than 0.05, we fail to reject both null hypotheses of no association between either race or party affiliation and treatment assignment.

In particular, it is crucial to find balance in the response variable prior to the intervention. That is, respondents from both treatment groups should have, on average, the same levels of tolerance towards transgender people. This is the variable tolerance.t0. Is it the case here?

See details!

We can check whether respondents in the treatment and the control groups had the same levels of tolerance before the intervention by conducting a t.test

t.test(x = transphobia$tolerance.t0[transphobia$treat_ind == T],
       y = transphobia$tolerance.t0[transphobia$treat_ind == F],
       conf.level = 0.95,
       var.equal = T)

## 
##  Two Sample t-test
## 
## data:  transphobia$tolerance.t0[transphobia$treat_ind == T] and transphobia$tolerance.t0[transphobia$treat_ind == F]
## t = -0.41789, df = 486, p-value = 0.6762
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2282952  0.1482183
## sample estimates:
##    mean of x    mean of y 
## -0.030558356  0.009480114

Based on the t statistic of -0.42, the p-value of 0.68, and the 95% confidence interval of $[-0.23; 0.15]$, we can safely conclude that respondents from both groups had, on average, the same levels of tolerance before the intervention.

What do you conclude in relation to covariate balance? What does it imply in terms of our ability to make causal claims?

See results!

We seem to have covariate balance, as none of the covariates (i.e., age, gender, race, and party affiliation) is associated with the treatment groups. Tolerance levels at the baseline (i.e. before the intervention) are also independent from the treatment groups. This is, of course, expected, considering that respondents were randomly assigned to the receive the intervention. This implies that any differences we find between the two groups can be attributed to the treatment implementation. That is, we are able to make causal claims. For instance, if we compare the levels of tolerance towards transgender people between the two groups, we can identify the causal effect of in-person conversations and perspective-taking exercises on antitransgender prejudice.

Question 3 – Estimating an ATE

What is the average tolerance level 3 days after the intervention among those respondents who were randomly assigned to the treatment group? What about those in the control group? Can you interpret this mean difference causally?

See details!

# Average in the treatment group
average_treatment_t1 <- mean(transphobia$tolerance.t1[transphobia$treat_ind == T], na.rm = T)

# Average in the control group
average_control_t1 <- mean(transphobia$tolerance.t1[transphobia$treat_ind == F], na.rm = T)

# ATE
average_treatment_t1 - average_control_t1

## [1] 0.1443226

The average tolerance level 3 days after the intervention among those respondents who were randomly assigned to the treatment group is 0.15, whereas among those in the control group it is 0.01. The mean difference of is the average treatment effect, given that respondents were randomly assigned to the treatment groups. Being randomly assigned to the treatment group led to an increase in tolerance levels of 0.14 points in the tolerance scale.

When our goal is to make causal claims, we are mostly interested in unbiased estimates like the mean difference that we just calculated. But we still need to handle uncertainty. After all, outcomes could be due to sampling error. Using the t.test() function, what do you conclude about the statistical significance of the ATE?

See results!

t.test(x = transphobia$tolerance.t1[transphobia$treat_ind == T],
       y = transphobia$tolerance.t1[transphobia$treat_ind == F],
       conf.level = 0.95,
       var.equal = T)

## 
##  Two Sample t-test
## 
## data:  transphobia$tolerance.t1[transphobia$treat_ind == T] and transphobia$tolerance.t1[transphobia$treat_ind == F]
## t = 1.3757, df = 416, p-value = 0.1696
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.06189076  0.35053604
## sample estimates:
##   mean of x   mean of y 
## 0.153535857 0.009213218

Given the t statistic of 1.36, the p-value of 0.17, and the 95% confidence interval of $[-0.06; 0.35]$, we fail to reject the null hypothesis that the mean difference is zero. In other words, the average treatment effect is not statistically significant. We have little evidence to sustain that being randomly assigned to the treatment group leads to an increase in tolerance levels.

Estimate the average treatment effect again, now using a linear regression model.

Code hint: you can use the lm() function: lm(dependent_variable ~ explanatory_variable, data)

See results!

reg1 <- lm(tolerance.t1 ~ treat_ind, transphobia)
summary(reg1)

## 
## Call:
## lm(formula = tolerance.t1 ~ treat_ind, data = transphobia)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.41152 -0.69412 -0.06658  0.74185  2.05647 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.009213   0.071836   0.128    0.898
## treat_ind   0.144323   0.104907   1.376    0.170
## 
## Residual standard error: 1.07 on 416 degrees of freedom
##   (70 observations deleted due to missingness)
## Multiple R-squared:  0.004529,   Adjusted R-squared:  0.002136 
## F-statistic: 1.893 on 1 and 416 DF,  p-value: 0.1696

As expected, results are exactly the same as before. The coefficient for a binary explanatory variable in a simple linear regression model represents the mean difference. This shows that we can use the regression framework to estimate ATEs.

Estimate a new linear regression model, now regressing tolerance.t1 on treat_ind, vf_age, vf_racename, vf_female, and vf_party. What happens with the coefficient for treat_ind? Is this expected?

See results!

reg2 <- lm(tolerance.t1 ~ treat_ind + vf_age + vf_racename + vf_female + vf_party, transphobia)

# install.packages('texreg')  # install the R package 'texreg'. You only need to to this once in your computer
library(texreg)

## Warning: package 'texreg' was built under R version 4.0.2

## Version:  1.37.5
## Date:     2020-06-17
## Author:   Philip Leifeld (University of Essex)
## 
## Consider submitting praise using the praise or praise_interactive functions.
## Please cite the JSS article in your publications -- see citation("texreg").

screenreg(list(reg1, reg2))

## 
## =========================================
##                       Model 1  Model 2   
## -----------------------------------------
## (Intercept)             0.01     0.01    
##                        (0.07)   (0.19)   
## treat_ind               0.14     0.15    
##                        (0.10)   (0.10)   
## vf_age                          -0.01 ***
##                                 (0.00)   
## vf_racenameCaucasian             0.97 ***
##                                 (0.15)   
## vf_racenameHispanic              0.77 ***
##                                 (0.14)   
## vf_female                        0.38 ***
##                                 (0.10)   
## vf_partyN                       -0.32 *  
##                                 (0.14)   
## vf_partyR                       -0.62 ***
##                                 (0.13)   
## -----------------------------------------
## R^2                     0.00     0.16    
## Adj. R^2                0.00     0.14    
## Num. obs.             418      418       
## =========================================
## *** p < 0.001; ** p < 0.01; * p < 0.05

The coefficient for treat_ind remains virtually unaltered after we include four new covariates in the regression model. This is expected, given that treat_ind was randomly assigned and achieved covariate balance.

Question 4 – What went wrong?

Results are not encouraging. We found a positive but not statistically significant ATE. One thing that could explain this is treatment delivery: canvassers might have made mistakes and ended up engaging in a conversation about transphobia with respondents assigned to the control group and about recycling with respondents assigned to the treatment group. Using the prop.table() function and the variable treatment.delivered, check whether this is the case.

See results!

table_delivery <- table(transphobia$treat_ind, transphobia$treatment.delivered)
prop.table(table_delivery, 1)

##    
##          FALSE       TRUE
##   0 0.95634921 0.04365079
##   1 0.21610169 0.78389831

Considering respondents who were randomly assigned to the control group, 96% correctly received a placebo intervention and 4% incorrectly received the treatment intervention. Among those who were randomly assigned to the treatment group, 78% correctly received the treatment intervention and 22% incorrectly received the placebo intervention.

Estimate a linear regression model regressing tolerance.t1 on treatment.delivered. Does the coefficient represent the ATE?

See results!

reg3 <- lm(tolerance.t1 ~ treatment.delivered, transphobia)
screenreg(list(reg1, reg2, reg3))

## 
## ======================================================
##                          Model 1  Model 2     Model 3 
## ------------------------------------------------------
## (Intercept)                0.01     0.01       -0.01  
##                           (0.07)   (0.19)      (0.07) 
## treat_ind                  0.14     0.15              
##                           (0.10)   (0.10)             
## vf_age                             -0.01 ***          
##                                    (0.00)             
## vf_racenameCaucasian                0.97 ***          
##                                    (0.15)             
## vf_racenameHispanic                 0.77 ***          
##                                    (0.14)             
## vf_female                           0.38 ***          
##                                    (0.10)             
## vf_partyN                          -0.32 *            
##                                    (0.14)             
## vf_partyR                          -0.62 ***          
##                                    (0.13)             
## treatment.deliveredTRUE                         0.22 *
##                                                (0.11) 
## ------------------------------------------------------
## R^2                        0.00     0.16        0.01  
## Adj. R^2                   0.00     0.14        0.01  
## Num. obs.                418      418         418     
## ======================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05

Respondents who received the treatment intervention (an in-person conversation about transphobia with perspective-taking exercises) had tolerance levels 0.22 points higher three days later than respondents who received the placebo intervention (a conversation about recycling). This is statistically significant, suggesting a relationship between the treatment delivery and prejudice. However, this is not the ATE. This variable was not randomly assigned, and therefore there could be potential confounders.

Well, they need to start with a letter. But otherwise they may contain numbers, upper and lower case letters (R distinguishes between them), and punctuation such as dots ( . ) and underscores ( _ )↩︎

Causal Inference in Observational Studies

First assignment: Introducton to R

Krisztián Pósch and Thiago R Oliveira

Why R?

Installing R and Rstudio

1st part: Introduction to R

Objects, vectors, functions

data.frames

Subsetting with `$` and `[,]`

Logical operators

A few questions about data analysis with R

2nd part: Analysing experimental data

Preliminaries

Loading data

Some exercises

Causal Inference in Observational Studies

First assignment: Introducton to R

Krisztián Pósch and Thiago R Oliveira

Why R?

Installing R and Rstudio

1st part: Introduction to R

Objects, vectors, functions

data.frames

Subsetting with $ and [,]

Logical operators

A few questions about data analysis with R

2nd part: Analysing experimental data

Preliminaries

Loading data

Some exercises

Subsetting with `$` and `[,]`