Throughout this course we will be using the statistical package `R`

, and the friendly interface of *Rstudio*, to conduct data analysis. The first part of today’s seminar is about getting familiar with this piece of software. If you have never used R before, worry not! No prior knowledge is assumed, and we will walk you through all the necessary steps to conduct analysis on the topics discussed in the lectures. Of course, learning a new statistical software is not simple, and we don’t aim to fully introduce you to the world of R – a wide and dynamic world that goes way beyond what we are covering here. This is just a gentle introduction to some specific R coding that corresponds to some methods often used to infer causality from observational data (our main focus here). Our suggestion is that you build on this and keep practicing. Coding is like learning a foreign language: if you don’t use you lose it.

The first part of this assignment is a general introduction to R, starting from scratch. If you already have some experience with this software, you are welcome to join the second part of the assignment, where we analyse some experimental data.

R is a statistical software that allows us to manipulate data and estimate a wide variety of statistical models. It is one of the fastest growing statistical software packages, one of the most popular data science software packages, and, perhaps most importantly, it is open source (free!). We will also be using the RStudio user-interface, which makes operating R somewhat easier.

You should install *both* R and Rstudio onto your personal computers. You can download them from the following sources:

Let’s see what what we have. After installing R and Rstudio, we start Rstudio and see three panels. A screen-long panel on the left-hand side called the *console*, a smaller panel on the top right-hand side called the *environment*, and a last one on the bottom right-hand side called *Plots & Help*. The console is the simplest way to interact with R: you can type in some code (after the arrow \(>\)) and press *Enter*, and R will run it and provide an output. It is easy to visualize this if we simply use R as a calculator: when we type in mathematical operations in the console, R immediately returns the outcomes. Let’s see:

```
7 + 7
12 - 4
3 * 9
610 / 377
5^2
(0.31415 + 1.61803) / 3^3
```

`## [1] 14`

`## [1] 8`

`## [1] 27`

`## [1] 1.618037`

`## [1] 25`

`## [1] 0.07156222`

Directly typing code into the console is certainly easy, but often not the most efficient strategy. A better approach is to have a document where we can save all our code – that’s what *scripts* are for. R scripts are just plain text files that contain some R code, which we can edit the same way we would in Word or any other text file. We can open R scripts within Rstudio: just go to *File –> New File –> R Script* (or just press *Cmd/Ctrl + shift + N* for a shortcut).

We can now see a new panel popping up taking up the space in the top left-hand side of Rstudio, just above the console. You can now type all your code into the script, save it, and open it again whenever you want. If you want to run a piece of code from the script, you can always just copy and paste it on the console, though this is of course very inefficient. Instead, you can ask R to run any piece of code from the script directly. There are a few different ways to do it.

- To run an entire line, place the cursor on the line you want to run and use the
*Run*button (on the top right-hand side of the script), or just press*Ctrl/Cmd + Enter* - To run multiple lines (or even just part of a single line), highlight the text you want to run and use the
*Run*button, or press*Ctrl/Cmd + Enter* - To run the entire script, use the
*Source*button, or press*Ctrl/Cmd + Shift + S*

**You should always work from an R script!** Our suggestion is that you create a different script for each seminar, and save them using reasonable names such as “seminar1.R”, “seminar2.R”, and so on.

When working with R, we store information by creating “objects”. We create objects all the time; it is a simple way of labelling some piece of information so that we can use it in subsequent tasks. Say we want to know the outcome of \(3 * 5\), and then we want to divide that outcome by \(2\). We could, of course, simply ask R do the calculations directly:

`(3 * 5) / 2`

But we could also first create an object that stores the result of \(3*5\) and then divide said object by \(2\). We can give objects any name we like.^{1} To create objects, we need to use the assignment operator `<-`

. If we want to name the outcome of \(3*5\) *outcome*, then we simply need to use the assignment operator:

`outcome <- 3 * 5`

Notice that R does not provide any output after we create an object. That is because we are not asking for any output, we are simply creating an object. Note, however, that the environment panel lists all the objects we create in the current session. In case we want to confirm what piece of information is stored under a given label, apart from checking the environment panel, we can also just run the name of the object and see R’s output. For instance, when we run `outcome`

in the console, R returns the number 15, which is the outcome of \(3*5\).

`outcome`

`## [1] 15`

This is useful because, once we have created objects, we can use to perform subsequent calculations. For instance:

`outcome / 2`

`## [1] 7.5`

And we can even use previously created objects to create a new object:

```
my_new_object <- outcome ^ 2
my_new_object
```

`## [1] 225`

Both objects that we have just created (`outcome`

and `my_new_object`

) contain just single numbers. But we can create objects that contain more information as well. Often times, we want to create a long list of numbers in one specific order. Think of a regular spreadsheet, where we can entry numbers in a single column which are separated (and ordered) according to their rows. In R language, those single columns are equivalent to **vectors**. A vector is simply a set of information contained together in a specific order. In order to create a vector in R, we use the `c()`

function: instead of including new information at every row (as we would, were we using a regular spreadsheet), we separate new information by commas inside the parentheses. For instance, we could create a new vector by concatenating the following numbers:

```
new_vector <- c(0, 3, 1, 4, 1, 5, 9, 2)
new_vector
```

`## [1] 0 3 1 4 1 5 9 2`

Recall that vectors store information (in this case, numbers) in a specific order. This is important. We can use this order to access individuals elements of a vector. We do this by *subsetting* a vector: we just need to use square brackets [ ] and include the number corresponding to the position we want to access. For instance, if we want to access the second element in our vector `new_vector`

, we simply do the following:

`new_vector[2]`

`## [1] 3`

We can see that by using `new_vector[2]`

, R returns the second element of the vector `new_vector`

, which is the number 3. If we want to access the seventh element of the vector `new_vector`

, we just use `new_vector[7]`

which returns the number 9, and so on.

Now that we have our first vector, we can see how **functions** work. Functions are a set of instructions: you provide an input and R generates some output – the backbone of R programming. A function is a command followed by round brackets `( )`

. Inputs are arguments that go inside the brackets; if a function requires more than one argument, these are separated by commas. For instance, we can add all elements of the vector `new_vector`

together by using the function `sum()`

. Here the input is the name of the vector:

`sum(new_vector)`

`## [1] 25`

And 25 is the output. As always, we can save the result of the output as an object using assignment operator `<-`

.

`sum_of_our_vector <- sum(new_vector)`

Here, `sum_of_our_vector`

is also an object! So we have performed a calculation (`sum()`

) on some data (`new_vector`

), and stored the result (`sum_of_our_vector`

). Let’s try some new functions such as `mean()`

, `median()`

, and `summary()`

. What are they calculating?

`mean(new_vector)`

`## [1] 3.125`

`median(new_vector)`

`## [1] 2.5`

`summary(new_vector)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.500 3.125 4.250 9.000
```

`mean()`

returns the average value of a vector, `median()`

returns the median, and `summary()`

returns a set of useful statistics, such as the minimum and the maximum values of the vector, the interquartile range, the median, and the mean.
You could create your own functions using the `function()`

function – e.g. you could come up with a new function to calculate the mean of a vector. Say you create a set of new functions. They are only useful functions, so you would like to let anyone in the world use them as well by making them publicly available. This is the basic idea of *R packages*. People create new functions and make them publicly available. We will be using various packages throughout the course that help us conduct data analysis. An example of a useful package is `pysch`

, which contains the `describe()`

function: like the `summary()`

function, it describes the content of a variable or data set, but providing more details. Let’s see how it looks.

`describe(new_vector)`

`## Error in describe(new_vector): could not find function "describe"`

We got an Error message! Why? Well, that is because the `describe()`

function does not exist in *base R*. We first need to load the `psych`

package, only then will its new functions be available to us. This is a two-step process. First step involves installing the relevant package in your computer using the `install.packages()`

function; you only need to this once. Second step involves loading the relevant package using the `library()`

function; you need to that every time you start a new session.

`install.packages("psych") # you only need to use this once in your computer`

`library(psych) # making the package available in the current session`

`## Warning: package 'psych' was built under R version 4.0.2`

`describe(new_vector) # look, now we can use the describe() function `

```
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 8 3.12 2.9 2.5 3.12 2.22 0 9 9 0.81 -0.63 1.03
```

The `describe()`

function is useful because it provides us with a wider set of useful statistics than the `summary()`

function, such as the number of observations, standard deviation, range, skew and kurtosis. Note that R ignores everything that comes after the `#`

. This is extremely useful to make comments throughout our code.

Data frames are the workhorse when conducting data analysis with R. A `data.frame`

object is the R equivalent to a spreadsheet: each row represents a unit, each column represents a variable. Nearly every time we conduct data analysis with R, we will be working with `data.frames`

. In most cases (including the second part of this seminar), we will load a data set from a spreadsheet-based external file (.csv, .xls, .dta, .sav, among others) onto R; for now however, we will use a dataset that comes pre-installed with R just to see how it works. Let’s use the `data()`

function to load the `USArrests`

data set, which contains statistics in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973.

`data("USArrests")`

We can use the `help()`

function to read more about this data set, which contains 50 observations (i.e., rows) on 4 variables (i.e., columns).

`help(USArrests)`

The `data.frame`

is listed as a new object in the environment panel. We can click on it to see it as a spreadsheet; we can also type in the name of the data set to see what it looks like. Because data sets are often very long, instead of seeing all of it we can opt to look at just the first few rows using the `head()`

function:

`head(USArrests, 10) # the second argument specifies the number of rows we want to see`

```
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
## Connecticut 3.3 110 77 11.1
## Delaware 5.9 238 72 15.8
## Florida 15.4 335 80 31.9
## Georgia 17.4 211 60 25.8
```

`$`

and `[,]`

The easiest way to access a single variable (i.e., a column) of a `data.frame`

is using the dollar sign `$`

. For instance, to access the murder rate in US states in 1973:

`USArrests$Murder`

```
## [1] 13.2 10.0 8.1 8.8 9.0 7.9 3.3 5.9 15.4 17.4 5.3 2.6 10.4 7.2 2.2
## [16] 6.0 9.7 15.4 2.1 11.3 4.4 12.1 2.7 16.1 9.0 6.0 4.3 12.2 2.1 7.4
## [31] 11.4 11.1 13.0 0.8 7.3 6.6 4.9 6.3 3.4 14.4 3.8 13.2 12.7 3.2 2.2
## [46] 8.5 4.0 5.7 2.6 6.8
```

R returns all the observations for the column `Murder`

. What is this? It is a vector! A list of information (here, numbers) in one specific order. We can therefore apply everything we learned about vectors here. For example, we can access the third element of this vector:

`USArrests$Murder[3]`

`## [1] 8.1`

Which corresponds to the murder rate in Arizona. Let’s practice using the dollar sign to access the Assault variable. What are the first, tenth, and fifteenth elements?

`USArrests$Assault[1]`

`## [1] 236`

`USArrests$Assault[10]`

`## [1] 211`

`USArrests$Assault[15]`

`## [1] 56`

We saw earlier that we can subset a vector by using square brackets: `[ ]`

. When dealing with data.frames, we often want to access certain observations (rows) or certain columns (variables) or a combination of the two without looking at the entire data set all at once. We can also use square brackets (`[,]`

) to subset data.frames.

In square brackets we put a row and a column coordinates separated by a comma. The row coordinate goes first and the column coordinate second. So `USArrests[23, 3]`

returns the 23rd row and third column of the data frame. If we leave the column coordinate empty this means we would like all columns. So, `USArrests[10,]`

returns the 10th row of the data set. If we leave the row coordinate empty, R returns the entire column. So, `USArrests[,4]`

returns the fourth column of the data set.

`USArrests[23, 3] # element in 23rd row, 3rd column`

`## [1] 66`

`USArrests[10,] # entire 10th row`

```
## Murder Assault UrbanPop Rape
## Georgia 17.4 211 60 25.8
```

`USArrests[,4] # entire fourth column`

```
## [1] 21.2 44.5 31.0 19.5 40.6 38.7 11.1 15.8 31.9 25.8 20.2 14.2 24.0 21.0 11.3
## [16] 18.0 16.3 22.2 7.8 27.8 16.3 35.1 14.9 17.1 28.2 16.4 16.5 46.0 9.5 18.8
## [31] 32.1 26.1 16.1 7.3 21.4 20.0 29.3 14.9 8.3 22.5 12.8 26.9 25.5 22.9 11.2
## [46] 20.7 26.2 9.3 10.8 15.6
```

We can look at a selected number of rows of a dataset with the colon in brackets: `USArrests[1:7,]`

returns the first seven rows and all columns of the data.frame `USArrests`

. We could display the second and fourth columns of the dataset by using the `c()`

function in brackets like so: `USArrests[, c(2,4)]`

.

Display all columns of the `USArrests`

dataset and show rows 10 to 15. Next display all columns of the dataset but only for rows 10 and 15.

`USArrests[10:15,]`

```
## Murder Assault UrbanPop Rape
## Georgia 17.4 211 60 25.8
## Hawaii 5.3 46 83 20.2
## Idaho 2.6 120 54 14.2
## Illinois 10.4 249 83 24.0
## Indiana 7.2 113 65 21.0
## Iowa 2.2 56 57 11.3
```

`USArrests[c(10, 15),]`

```
## Murder Assault UrbanPop Rape
## Georgia 17.4 211 60 25.8
## Iowa 2.2 56 57 11.3
```

We can also subset by using logical values and logical operators. R has two special representations for logical values: `TRUE`

and `FALSE`

. R also has many logical operators, such as greater than (`>`

), less than (`<`

), or equal to (`==`

).

When we apply a logical operator to an object, the value returned should be a logical value (i.e. `T`

or `F`

). For instance:

`5 > 3`

`## [1] TRUE`

`7 < 4`

`## [1] FALSE`

`2 == 1`

`## [1] FALSE`

Here, when we ask R whether 5 is greater than 3, R returns the logical value `TRUE`

. When we ask if 7 is less than 4, R returns the logical value `FALSE`

. When we ask R whether 2 is equal to 1, R returns the logical value `FALSE`

.

For the purposes of subsetting, logical operations are useful because they can be used to specify which elements of a vector or data.frame we would like returned. For instance, let’s subset the `USArrests`

and keep only states with a murder rate less than 5 per 100,000:

`USArrests[USArrests$Murder < 5, ]`

```
## Murder Assault UrbanPop Rape
## Connecticut 3.3 110 77 11.1
## Idaho 2.6 120 54 14.2
## Iowa 2.2 56 57 11.3
## Maine 2.1 83 51 7.8
## Massachusetts 4.4 149 85 16.3
## Minnesota 2.7 72 66 14.9
## Nebraska 4.3 102 62 16.5
## New Hampshire 2.1 57 56 9.5
## North Dakota 0.8 45 44 7.3
## Oregon 4.9 159 67 29.3
## Rhode Island 3.4 174 87 8.3
## South Dakota 3.8 86 45 12.8
## Utah 3.2 120 80 22.9
## Vermont 2.2 48 32 11.2
## Washington 4.0 145 73 26.2
## Wisconsin 2.6 53 66 10.8
```

Let’s go through this code slowly to see what is going on here. First, we are asking R to display the `USArrests`

data.frame. But no all of it: we are using square brackets `[ ]`

, so only a subset of the dataset is displayed. There is some information before but nothing after the comma inside the square brackets, which means that only a fraction of rows but all columns should be displayed. Which rows? Let’s take a closer look at the code before the comma inside the square brackets. R should only display the rows for which the expression `USArrests$Murder < 5`

is `TRUE`

, i.e. states with a murder rate less than 5 (per 100,000).

- Calculate the mean and median of each of the variables included in the data set. Assign each of the results of these calculations to objects (choose sensible names!).

```
mean_murder <- mean(USArrests$Murder)
median_murder <- median(USArrests$Murder)
mean_assault <- mean(USArrests$Assault)
median_assault <- median(USArrests$Assault)
mean_urban <- mean(USArrests$UrbanPop)
median_urban <- median(USArrests$UrbanPop)
mean_rape <- mean(USArrests$Rape)
median_rape <- median(USArrests$Rape)
```

- Is there a difference in the assault rate for urban and rural states? Define an urban state as one for which the urban population is greater than or equal to the median across all states. Define a rural state as one for which the urban population is less than the median.

```
urban_states <- USArrests[USArrests$UrbanPop >= median_urban, ]
rural_states <- USArrests[USArrests$UrbanPop < median_urban, ]
mean_assault_urban <- mean(urban_states$Assault)
mean_assault_rural <- mean(rural_states$Assault)
mean_assault_urban
```

`## [1] 187.4643`

`mean_assault_rural`

`## [1] 149.5`

The average assault rate in urban states is 187.46 (per 100,000), considerably larger than the average assault rate in rural states of 149.5.

Can transphobia re reduced through in-person conversations and perspective-taking exercises? To address this question, two researchers conducted a field experiment on door-to-door canvassing in South Florida. Targeting antitransgender prejudice, the intervention involved canvassers holding single, approximately 10-minute conversations that encouraged actively taking the perspective of others with voters to see if these conversations could affect prejudicial attitudes towards transgender people.

- Broockman, David and Joshua Kalla. 2016. “Durably reducing transphobia: a field experiment on door-to-door canvassing.”
*Science*352 (6282): 220-224.

In the experiment, the authors recruited first registered voters via mail for an online baseline survey. They then **randomly assigned** respondents of this baseline survey (\(n=1825\)) to either a treatment group targeted with the intervention (\(n=913\)) or a placebo group targeted with a conversation about recycling (\(n=912\)). For the intervention, 56 canvassers first knocked on voters’ doors unannounced. Then, canvassers asked to speak with the subject on their list and confirmed the person’s identity if the person came to the door. A total of several hundred individuals (\(n=501\)) came to their doors in the two conditions. For logistical reasons unrelated to the original study, we further reduce this dataset to (n=488) which is the full sample that appears in the `transphobia.csv`

data (available on ILIAS).

The canvassers then engaged in a series of strategies previously shown to facilitate active processing under the treatment condition: canvassers informed voters that they might face a decision about the issue (whether to vote to repeal the law protecting transgender people); canvassers asked voters to explain their views; and canvassers showed a video that presented arguments on both sides. Canvassers defined the term “transgender” at this point and, if they were transgender themselves, noted this. The canvassers next attempted to encourage “analogic perspective-taking”. Canvassers first asked each voter to talk about a time when they themselves were judged negatively for being different. The canvassers then encouraged voters to see how their own experience offered a window into transgender people’s experiences, hoping to facilitate voters’ ability to take transgender people’s perspectives. The intervention ended with another attempt to encourage active processing by asking voters to describe if and how the exercise changed their mind. All of the former steps constitutes the “treatment.”

The placebo group was reminded that recycling was most effective when everyone participates. The canvassers talked about how they were working on ways to decrease environmental waste and asked the voters who came to the door about their support for a new law that would require supermarkets to charge for bags instead of giving them away for free. This was meant to mimic the effect of canvassers interacting with the voters in face-to-face conversation on a topic different from transphobia.

The authors then asked respondents (\(n=488\)) to complete follow-up online surveys via email presented as a continuation of the baseline survey. These follow-up surveys began 3 days, 3 weeks, 6 weeks, and 3 months after the intervention when the baseline survey was also conducted. The authors then created an index of tolerance towards transgender people. Higher values indicate higher tolerance, lower values indicate lower tolerance. The data set includes the following variables:

Name | Description |
---|---|

`vf_age` |
Age |

`vf_party` |
Party: `D` =Democrats, `R` =Republicans and `N` =Independents |

`vf_racename` |
Race: `African American` , `Caucasian` , `Hispanic` |

`vf_female` |
Gender: `1` if female, `0` if male |

`treat_ind` |
Treatment assignment: `1` =treatment, `0` =placebo |

`treatment.delivered` |
Intervention was actually delivered (=`TRUE` ) vs. was not (=`FALSE` ) |

`tolerance.t0` |
Tolerance variable at Baseline |

`tolerance.t1` |
Tolerance captured at 3 days after Baseline |

`tolerance.t2` |
Tolerance captured at 3 weeks after Baseline |

`tolerance.t3` |
Tolerance captured at 6 weeks after Baseline |

`tolerance.t4` |
Tolerance captured at 3 months after Baseline |

It is sensible when you start any data analysis project to make sure your computer is set up in an efficient way. Our suggestion is that you create a folder on your computer where you can save all your scripts throughout the course (i.e., `seminar1.R`

, `seminar2.R`

, etc.). We also recommend you to create a *subfolder* into your main folder, and give it the name `data`

: this is where you should save all your data sets.

When we work with RStudio, the first thing we should do is ensure that R is to set the *working directory*. This essentially is the folder in your computer where will operate (e.g., when looking for data and other scripts). There are two ways to this. The easiest (and recommended) is to set the folder from which you want R to work from as an *R Project*. You can do that by clicking on “*Project: (none)*” at the top-right corner of RStudio, then click “*New project*” and just assign it to your folder of choice. The second way to set the working directory involves knowing the location of the relevant folder in your computer. Say you have created a folder named “Causal Inference in OS” inside a folder named “GESIS 2021” on your desktop. Then you can use the `setwd()`

to set the working directory:

```
setwd("~/Desktop/GESIS 2021/Causal Inference in OS") # if you are working on a Mac
setwd("C:/Desktop/GESIS 2021/Causal Inference in OS") # if you are working on a Windows PC
```

Once you have downloaded the data, put the `transphobia.csv`

file into the `data`

folder that you created earlier in the seminar. Now load the data into the current R session using the `read.csv()`

function:

`transphobia <- read.csv('data/transphobia.csv')`

You can now check the environment panel and see a data.frame object named `transphobia`

with 488 rows (observations) and 11 columns (variables).

**Question 1 – Describing variables**

- Let’s start describing the data. Use the
`table()`

function and the`treat_ind`

variable to find how many respondents were randomly assigned to the treatment and the control groups.

`table(transphobia$treat_ind)`

```
##
## 0 1
## 252 236
```

236 were randomly assigned to the treatment group, whereas 252 were randomly assigned to the control group.

- Simply counting how many respondents were assigned to each treatment might not be informative. A better approach is to calculate the proportions, which we can do using the
`prop.table()`

function. What percentage of respondents were randomly assigned to the treatment and control groups?

**Code hint**: the `prop.table()`

function requires a `table()`

as an argument!

`prop.table(table(transphobia$treat_ind))`

```
##
## 0 1
## 0.5163934 0.4836066
```

48.36 % of the respondents were randomly assigned to the treatment group, whereas 51.64 % were assigned to the control group.
- What about the response variable, how does it distribute across all respondents? Use the
`describe()`

function and the variable`tolerance.t1`

, which measures tolerance levels towards transgender people three days after the intervention.

`describe(transphobia$tolerance.t1)`

```
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 418 0.08 1.07 0.07 0.1 1.01 -2.26 2.07 4.32 -0.15 -0.46 0.05
```

We can see that this index of tolerance levels towards transgender people ranges from -2.26 to 2.07. The mean is 0.08, very similar to the median of 0.07, suggesting a symmetrical distribution. The standard deviation is 1.07.
**Question 2 – Covariate balance**

- In order to make causal claims, we need be confident that our treatment groups are balanced. Do respondents in the treatment and control groups have similar characteristics in terms of their age?

To assess balance in the variable `vf_age`

, we can calculate the average age for each treatment group:

`mean(transphobia$vf_age[transphobia$treat_ind == T]) # average age in the treatment group`

`## [1] 50.07627`

`mean(transphobia$vf_age[transphobia$treat_ind == F]) # average age in the control group`

`## [1] 48.60317`

Respondents in the treatment group are on average 50.08 years old, whereas respondents in the control group are on average 48.6 years old. To assess whether the estimated difference of 1.48 year between the two groups is due to sampling uncertainty, we can conduct a `t test`

using the `t.test()`

function:

```
t.test(x = transphobia$vf_age[transphobia$treat_ind == 1],
y = transphobia$vf_age[transphobia$treat_ind == 0],
conf.level = 0.95,
var.equal = T)
```

```
##
## Two Sample t-test
##
## data: transphobia$vf_age[transphobia$treat_ind == 1] and transphobia$vf_age[transphobia$treat_ind == 0]
## t = 0.92987, df = 486, p-value = 0.3529
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.639634 4.585827
## sample estimates:
## mean of x mean of y
## 50.07627 48.60317
```

Given that (i) the t statistic of 0.93 is lower than 1.96, (ii) the p-value of 0.35 is larger than 0.05, and (iii) the confidence interval of \([-1.65; 4.59]\) includes zero of a plausible value (all three statements are equivalent), we can safely conclude that there is no significant difference in the average age of respondents assigned to the treatment and the control groups.
- Conduct the same analysis for the variables
`vf_female`

,`vf_racename`

, and`vf_party`

. Do respondents in the treatment and control groups have similar characteristics in terms of their gender, race, and party affiliation?

`prop.table()`

and `table()`

functions to visualize the cross-tabulation, and then conduct a Chi-squared test.
```
table_gender <- table(transphobia$vf_female, # first argument is represented by rows
transphobia$treat_ind) # second argument is represented by columns
prop.table(table_gender, 1) # the argument 1 indicates we want conditional proportions by rows
```

```
##
## 0 1
## 0 0.4903846 0.5096154
## 1 0.5357143 0.4642857
```

As we can see, 49% of all male respondents were assigned to the control group, whereas 51% were assigned to the treatment group. Among female respondents, 54% were assigned to the control group and 46% were assigned to the treatment group. To handle sampling uncertainty, let’s use the `chisq.test()`

function.

`chisq.test(table_gender)`

```
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table_gender
## X-squared = 0.80883, df = 1, p-value = 0.3685
```

Given that \(p = 0.37\), we have little evidence to reject the null hypothesis that there is no association between gender and treatment assignment.

Let us now adopt the same strategies and check balance for `vf_racename`

and `vf_party`

:

```
table_race <- table(transphobia$vf_racename,
transphobia$treat_ind)
prop.table(table_race, 1)
```

```
##
## 0 1
## African American 0.5037037 0.4962963
## Caucasian 0.5161290 0.4838710
## Hispanic 0.5240175 0.4759825
```

```
table_party <- table(transphobia$vf_party,
transphobia$treat_ind)
prop.table(table_party, 1)
```

```
##
## 0 1
## D 0.4834711 0.5165289
## N 0.5333333 0.4666667
## R 0.5634921 0.4365079
```

Overall, respondents of all three racial groups (African American, Caucasian, and Hispanic) and all three political groups (Democrats, Republicans, and Independents) seem relatively well balanced across the treatment and control groups, with close to 50% of respondents of each profile in either treatment groups.

`chisq.test(table_race)`

```
##
## Pearson's Chi-squared test
##
## data: table_race
## X-squared = 0.14038, df = 2, p-value = 0.9322
```

`chisq.test(table_party)`

```
##
## Pearson's Chi-squared test
##
## data: table_party
## X-squared = 2.3074, df = 2, p-value = 0.3155
```

Given that both p-values are larger than 0.05, we fail to reject both null hypotheses of no association between either race or party affiliation and treatment assignment.

- In particular, it is crucial to find balance in the response variable prior to the intervention. That is, respondents from both treatment groups should have, on average, the same levels of tolerance towards transgender people. This is the variable
`tolerance.t0`

. Is it the case here?

`t.test`

```
t.test(x = transphobia$tolerance.t0[transphobia$treat_ind == T],
y = transphobia$tolerance.t0[transphobia$treat_ind == F],
conf.level = 0.95,
var.equal = T)
```

```
##
## Two Sample t-test
##
## data: transphobia$tolerance.t0[transphobia$treat_ind == T] and transphobia$tolerance.t0[transphobia$treat_ind == F]
## t = -0.41789, df = 486, p-value = 0.6762
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.2282952 0.1482183
## sample estimates:
## mean of x mean of y
## -0.030558356 0.009480114
```

Based on the t statistic of -0.42, the p-value of 0.68, and the 95% confidence interval of \([-0.23; 0.15]\), we can safely conclude that respondents from both groups had, on average, the same levels of tolerance before the intervention.
- What do you conclude in relation to covariate balance? What does it imply in terms of our ability to make causal claims?

**Question 3 – Estimating an ATE**

- What is the average tolerance level 3 days after the intervention among those respondents who were randomly assigned to the treatment group? What about those in the control group? Can you interpret this mean difference causally?

```
# Average in the treatment group
average_treatment_t1 <- mean(transphobia$tolerance.t1[transphobia$treat_ind == T], na.rm = T)
# Average in the control group
average_control_t1 <- mean(transphobia$tolerance.t1[transphobia$treat_ind == F], na.rm = T)
# ATE
average_treatment_t1 - average_control_t1
```

`## [1] 0.1443226`

The average tolerance level 3 days after the intervention among those respondents who were randomly assigned to the treatment group is 0.15, whereas among those in the control group it is 0.01. The mean difference of is the *average treatment effect*, given that respondents were randomly assigned to the treatment groups. Being randomly assigned to the treatment group led to an increase in tolerance levels of 0.14 points in the tolerance scale.

- When our goal is to make causal claims, we are mostly interested in
*unbiased estimates*like the mean difference that we just calculated. But we still need to handle uncertainty. After all, outcomes could be due to sampling error. Using the`t.test()`

function, what do you conclude about the statistical significance of the ATE?

```
t.test(x = transphobia$tolerance.t1[transphobia$treat_ind == T],
y = transphobia$tolerance.t1[transphobia$treat_ind == F],
conf.level = 0.95,
var.equal = T)
```

```
##
## Two Sample t-test
##
## data: transphobia$tolerance.t1[transphobia$treat_ind == T] and transphobia$tolerance.t1[transphobia$treat_ind == F]
## t = 1.3757, df = 416, p-value = 0.1696
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.06189076 0.35053604
## sample estimates:
## mean of x mean of y
## 0.153535857 0.009213218
```

Given the t statistic of 1.36, the p-value of 0.17, and the 95% confidence interval of \([-0.06; 0.35]\), we *fail* to reject the null hypothesis that the mean difference is zero. In other words, the average treatment effect is *not* statistically significant. We have little evidence to sustain that being randomly assigned to the treatment group leads to an increase in tolerance levels.

- Estimate the average treatment effect again, now using a linear regression model.

**Code hint**: you can use the `lm()`

function: `lm(dependent_variable ~ explanatory_variable, data)`

```
reg1 <- lm(tolerance.t1 ~ treat_ind, transphobia)
summary(reg1)
```

```
##
## Call:
## lm(formula = tolerance.t1 ~ treat_ind, data = transphobia)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.41152 -0.69412 -0.06658 0.74185 2.05647
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.009213 0.071836 0.128 0.898
## treat_ind 0.144323 0.104907 1.376 0.170
##
## Residual standard error: 1.07 on 416 degrees of freedom
## (70 observations deleted due to missingness)
## Multiple R-squared: 0.004529, Adjusted R-squared: 0.002136
## F-statistic: 1.893 on 1 and 416 DF, p-value: 0.1696
```

As expected, results are exactly the same as before. The coefficient for a binary explanatory variable in a simple linear regression model represents the mean difference. This shows that we can use the regression framework to estimate ATEs.

- Estimate a new linear regression model, now regressing
`tolerance.t1`

on`treat_ind`

,`vf_age`

,`vf_racename`

,`vf_female`

, and`vf_party`

. What happens with the coefficient for`treat_ind`

? Is this expected?

```
reg2 <- lm(tolerance.t1 ~ treat_ind + vf_age + vf_racename + vf_female + vf_party, transphobia)
# install.packages('texreg') # install the R package 'texreg'. You only need to to this once in your computer
library(texreg)
```

`## Warning: package 'texreg' was built under R version 4.0.2`

```
## Version: 1.37.5
## Date: 2020-06-17
## Author: Philip Leifeld (University of Essex)
##
## Consider submitting praise using the praise or praise_interactive functions.
## Please cite the JSS article in your publications -- see citation("texreg").
```

`screenreg(list(reg1, reg2))`

```
##
## =========================================
## Model 1 Model 2
## -----------------------------------------
## (Intercept) 0.01 0.01
## (0.07) (0.19)
## treat_ind 0.14 0.15
## (0.10) (0.10)
## vf_age -0.01 ***
## (0.00)
## vf_racenameCaucasian 0.97 ***
## (0.15)
## vf_racenameHispanic 0.77 ***
## (0.14)
## vf_female 0.38 ***
## (0.10)
## vf_partyN -0.32 *
## (0.14)
## vf_partyR -0.62 ***
## (0.13)
## -----------------------------------------
## R^2 0.00 0.16
## Adj. R^2 0.00 0.14
## Num. obs. 418 418
## =========================================
## *** p < 0.001; ** p < 0.01; * p < 0.05
```

The coefficient for `treat_ind`

remains virtually unaltered after we include four new covariates in the regression model. This is expected, given that `treat_ind`

was randomly assigned and achieved covariate balance.

**Question 4 – What went wrong?**

- Results are not encouraging. We found a positive but not statistically significant ATE. One thing that could explain this is treatment delivery: canvassers might have made mistakes and ended up engaging in a conversation about transphobia with respondents assigned to the control group and about recycling with respondents assigned to the treatment group. Using the
`prop.table()`

function and the variable`treatment.delivered`

, check whether this is the case.

```
table_delivery <- table(transphobia$treat_ind, transphobia$treatment.delivered)
prop.table(table_delivery, 1)
```

```
##
## FALSE TRUE
## 0 0.95634921 0.04365079
## 1 0.21610169 0.78389831
```

Considering respondents who were randomly assigned to the control group, 96% correctly received a placebo intervention and 4% incorrectly received the treatment intervention. Among those who were randomly assigned to the treatment group, 78% correctly received the treatment intervention and **22%** incorrectly received the placebo intervention.

- Estimate a linear regression model regressing
`tolerance.t1`

on`treatment.delivered`

. Does the coefficient represent the ATE?

```
reg3 <- lm(tolerance.t1 ~ treatment.delivered, transphobia)
screenreg(list(reg1, reg2, reg3))
```

```
##
## ======================================================
## Model 1 Model 2 Model 3
## ------------------------------------------------------
## (Intercept) 0.01 0.01 -0.01
## (0.07) (0.19) (0.07)
## treat_ind 0.14 0.15
## (0.10) (0.10)
## vf_age -0.01 ***
## (0.00)
## vf_racenameCaucasian 0.97 ***
## (0.15)
## vf_racenameHispanic 0.77 ***
## (0.14)
## vf_female 0.38 ***
## (0.10)
## vf_partyN -0.32 *
## (0.14)
## vf_partyR -0.62 ***
## (0.13)
## treatment.deliveredTRUE 0.22 *
## (0.11)
## ------------------------------------------------------
## R^2 0.00 0.16 0.01
## Adj. R^2 0.00 0.14 0.01
## Num. obs. 418 418 418
## ======================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05
```

Respondents who received the treatment intervention (an in-person conversation about transphobia with perspective-taking exercises) had tolerance levels 0.22 points higher three days later than respondents who received the placebo intervention (a conversation about recycling). This is statistically significant, suggesting a relationship between the treatment delivery and prejudice. However, this is **not** the ATE. This variable was not randomly assigned, and therefore there could be potential confounders.

Well, they need to start with a letter. But otherwise they may contain numbers, upper and lower case letters (R distinguishes between them), and punctuation such as dots ( . ) and underscores ( _ )↩︎