Throughout this course we will be using the statistical package
R, and the friendly interface of Rstudio, to conduct data analysis. The first part of today’s seminar is about getting familiar with this piece of software. If you have never used R before, worry not! No prior knowledge is assumed, and we will walk you through all the necessary steps to conduct analysis on the topics discussed in the lectures. Of course, learning a new statistical software is not simple, and we don’t aim to fully introduce you to the world of R – a wide and dynamic world that goes way beyond what we are covering here. This is just a gentle introduction to some specific R coding that corresponds to some methods often used to infer causality from observational data (our main focus here). Our suggestion is that you build on this and keep practicing. Coding is like learning a foreign language: if you don’t use you lose it.
The first part of this assignment is a general introduction to R, starting from scratch. If you already have some experience with this software, you are welcome to join the second part of the assignment, where we analyse some experimental data.
R is a statistical software that allows us to manipulate data and estimate a wide variety of statistical models. It is one of the fastest growing statistical software packages, one of the most popular data science software packages, and, perhaps most importantly, it is open source (free!). We will also be using the RStudio user-interface, which makes operating R somewhat easier.
You should install both R and Rstudio onto your personal computers. You can download them from the following sources:
Let’s see what what we have. After installing R and Rstudio, we start Rstudio and see three panels. A screen-long panel on the left-hand side called the console, a smaller panel on the top right-hand side called the environment, and a last one on the bottom right-hand side called Plots & Help. The console is the simplest way to interact with R: you can type in some code (after the arrow \(>\)) and press Enter, and R will run it and provide an output. It is easy to visualize this if we simply use R as a calculator: when we type in mathematical operations in the console, R immediately returns the outcomes. Let’s see:
7 + 7 12 - 4 3 * 9 610 / 377 5^2 (0.31415 + 1.61803) / 3^3
##  14
##  8
##  27
##  1.618037
##  25
##  0.07156222
Directly typing code into the console is certainly easy, but often not the most efficient strategy. A better approach is to have a document where we can save all our code – that’s what scripts are for. R scripts are just plain text files that contain some R code, which we can edit the same way we would in Word or any other text file. We can open R scripts within Rstudio: just go to File –> New File –> R Script (or just press Cmd/Ctrl + shift + N for a shortcut).
We can now see a new panel popping up taking up the space in the top left-hand side of Rstudio, just above the console. You can now type all your code into the script, save it, and open it again whenever you want. If you want to run a piece of code from the script, you can always just copy and paste it on the console, though this is of course very inefficient. Instead, you can ask R to run any piece of code from the script directly. There are a few different ways to do it.
You should always work from an R script! Our suggestion is that you create a different script for each seminar, and save them using reasonable names such as “seminar1.R”, “seminar2.R”, and so on.
When working with R, we store information by creating “objects”. We create objects all the time; it is a simple way of labelling some piece of information so that we can use it in subsequent tasks. Say we want to know the outcome of \(3 * 5\), and then we want to divide that outcome by \(2\). We could, of course, simply ask R do the calculations directly:
(3 * 5) / 2
But we could also first create an object that stores the result of \(3*5\) and then divide said object by \(2\). We can give objects any name we like.1 To create objects, we need to use the assignment operator
<-. If we want to name the outcome of \(3*5\) outcome, then we simply need to use the assignment operator:
outcome <- 3 * 5
Notice that R does not provide any output after we create an object. That is because we are not asking for any output, we are simply creating an object. Note, however, that the environment panel lists all the objects we create in the current session. In case we want to confirm what piece of information is stored under a given label, apart from checking the environment panel, we can also just run the name of the object and see R’s output. For instance, when we run
outcome in the console, R returns the number 15, which is the outcome of \(3*5\).
##  15
This is useful because, once we have created objects, we can use to perform subsequent calculations. For instance:
outcome / 2
##  7.5
And we can even use previously created objects to create a new object:
my_new_object <- outcome ^ 2 my_new_object
##  225
Both objects that we have just created (
my_new_object) contain just single numbers. But we can create objects that contain more information as well. Often times, we want to create a long list of numbers in one specific order. Think of a regular spreadsheet, where we can entry numbers in a single column which are separated (and ordered) according to their rows. In R language, those single columns are equivalent to vectors. A vector is simply a set of information contained together in a specific order. In order to create a vector in R, we use the
c() function: instead of including new information at every row (as we would, were we using a regular spreadsheet), we separate new information by commas inside the parentheses. For instance, we could create a new vector by concatenating the following numbers:
new_vector <- c(0, 3, 1, 4, 1, 5, 9, 2) new_vector
##  0 3 1 4 1 5 9 2
Recall that vectors store information (in this case, numbers) in a specific order. This is important. We can use this order to access individuals elements of a vector. We do this by subsetting a vector: we just need to use square brackets [ ] and include the number corresponding to the position we want to access. For instance, if we want to access the second element in our vector
new_vector, we simply do the following:
##  3
We can see that by using
new_vector, R returns the second element of the vector
new_vector, which is the number 3. If we want to access the seventh element of the vector
new_vector, we just use
new_vector which returns the number 9, and so on.
Now that we have our first vector, we can see how functions work. Functions are a set of instructions: you provide an input and R generates some output – the backbone of R programming. A function is a command followed by round brackets
( ). Inputs are arguments that go inside the brackets; if a function requires more than one argument, these are separated by commas. For instance, we can add all elements of the vector
new_vector together by using the function
sum(). Here the input is the name of the vector:
##  25
And 25 is the output. As always, we can save the result of the output as an object using assignment operator
sum_of_our_vector <- sum(new_vector)
sum_of_our_vector is also an object! So we have performed a calculation (
sum()) on some data (
new_vector), and stored the result (
sum_of_our_vector). Let’s try some new functions such as
summary(). What are they calculating?
##  3.125
##  2.5
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.000 1.000 2.500 3.125 4.250 9.000
mean()returns the average value of a vector,
median()returns the median, and
summary()returns a set of useful statistics, such as the minimum and the maximum values of the vector, the interquartile range, the median, and the mean.
You could create your own functions using the
function() function – e.g. you could come up with a new function to calculate the mean of a vector. Say you create a set of new functions. They are only useful functions, so you would like to let anyone in the world use them as well by making them publicly available. This is the basic idea of R packages. People create new functions and make them publicly available. We will be using various packages throughout the course that help us conduct data analysis. An example of a useful package is
pysch, which contains the
describe() function: like the
summary() function, it describes the content of a variable or data set, but providing more details. Let’s see how it looks.
## Error in describe(new_vector): could not find function "describe"
We got an Error message! Why? Well, that is because the
describe() function does not exist in base R. We first need to load the
psych package, only then will its new functions be available to us. This is a two-step process. First step involves installing the relevant package in your computer using the
install.packages() function; you only need to this once. Second step involves loading the relevant package using the
library() function; you need to that every time you start a new session.
install.packages("psych") # you only need to use this once in your computer
library(psych) # making the package available in the current session
## Warning: package 'psych' was built under R version 4.0.2
describe(new_vector) # look, now we can use the describe() function
## vars n mean sd median trimmed mad min max range skew kurtosis se ## X1 1 8 3.12 2.9 2.5 3.12 2.22 0 9 9 0.81 -0.63 1.03
describe() function is useful because it provides us with a wider set of useful statistics than the
summary() function, such as the number of observations, standard deviation, range, skew and kurtosis. Note that R ignores everything that comes after the
#. This is extremely useful to make comments throughout our code.
Data frames are the workhorse when conducting data analysis with R. A
data.frame object is the R equivalent to a spreadsheet: each row represents a unit, each column represents a variable. Nearly every time we conduct data analysis with R, we will be working with
data.frames. In most cases (including the second part of this seminar), we will load a data set from a spreadsheet-based external file (.csv, .xls, .dta, .sav, among others) onto R; for now however, we will use a dataset that comes pre-installed with R just to see how it works. Let’s use the
data() function to load the
USArrests data set, which contains statistics in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973.
We can use the
help() function to read more about this data set, which contains 50 observations (i.e., rows) on 4 variables (i.e., columns).
data.frame is listed as a new object in the environment panel. We can click on it to see it as a spreadsheet; we can also type in the name of the data set to see what it looks like. Because data sets are often very long, instead of seeing all of it we can opt to look at just the first few rows using the
head(USArrests, 10) # the second argument specifies the number of rows we want to see
## Murder Assault UrbanPop Rape ## Alabama 13.2 236 58 21.2 ## Alaska 10.0 263 48 44.5 ## Arizona 8.1 294 80 31.0 ## Arkansas 8.8 190 50 19.5 ## California 9.0 276 91 40.6 ## Colorado 7.9 204 78 38.7 ## Connecticut 3.3 110 77 11.1 ## Delaware 5.9 238 72 15.8 ## Florida 15.4 335 80 31.9 ## Georgia 17.4 211 60 25.8
The easiest way to access a single variable (i.e., a column) of a
data.frame is using the dollar sign
$. For instance, to access the murder rate in US states in 1973:
##  13.2 10.0 8.1 8.8 9.0 7.9 3.3 5.9 15.4 17.4 5.3 2.6 10.4 7.2 2.2 ##  6.0 9.7 15.4 2.1 11.3 4.4 12.1 2.7 16.1 9.0 6.0 4.3 12.2 2.1 7.4 ##  11.4 11.1 13.0 0.8 7.3 6.6 4.9 6.3 3.4 14.4 3.8 13.2 12.7 3.2 2.2 ##  8.5 4.0 5.7 2.6 6.8
R returns all the observations for the column
Murder. What is this? It is a vector! A list of information (here, numbers) in one specific order. We can therefore apply everything we learned about vectors here. For example, we can access the third element of this vector:
##  8.1
Which corresponds to the murder rate in Arizona. Let’s practice using the dollar sign to access the Assault variable. What are the first, tenth, and fifteenth elements?
##  236
##  211
##  56
We saw earlier that we can subset a vector by using square brackets:
[ ]. When dealing with data.frames, we often want to access certain observations (rows) or certain columns (variables) or a combination of the two without looking at the entire data set all at once. We can also use square brackets (
[,]) to subset data.frames.
In square brackets we put a row and a column coordinates separated by a comma. The row coordinate goes first and the column coordinate second. So
USArrests[23, 3] returns the 23rd row and third column of the data frame. If we leave the column coordinate empty this means we would like all columns. So,
USArrests[10,] returns the 10th row of the data set. If we leave the row coordinate empty, R returns the entire column. So,
USArrests[,4] returns the fourth column of the data set.
USArrests[23, 3] # element in 23rd row, 3rd column
##  66
USArrests[10,] # entire 10th row
## Murder Assault UrbanPop Rape ## Georgia 17.4 211 60 25.8
USArrests[,4] # entire fourth column
##  21.2 44.5 31.0 19.5 40.6 38.7 11.1 15.8 31.9 25.8 20.2 14.2 24.0 21.0 11.3 ##  18.0 16.3 22.2 7.8 27.8 16.3 35.1 14.9 17.1 28.2 16.4 16.5 46.0 9.5 18.8 ##  32.1 26.1 16.1 7.3 21.4 20.0 29.3 14.9 8.3 22.5 12.8 26.9 25.5 22.9 11.2 ##  20.7 26.2 9.3 10.8 15.6
We can look at a selected number of rows of a dataset with the colon in brackets:
USArrests[1:7,] returns the first seven rows and all columns of the data.frame
USArrests. We could display the second and fourth columns of the dataset by using the
c() function in brackets like so:
Display all columns of the
USArrests dataset and show rows 10 to 15. Next display all columns of the dataset but only for rows 10 and 15.
## Murder Assault UrbanPop Rape ## Georgia 17.4 211 60 25.8 ## Hawaii 5.3 46 83 20.2 ## Idaho 2.6 120 54 14.2 ## Illinois 10.4 249 83 24.0 ## Indiana 7.2 113 65 21.0 ## Iowa 2.2 56 57 11.3
## Murder Assault UrbanPop Rape ## Georgia 17.4 211 60 25.8 ## Iowa 2.2 56 57 11.3
We can also subset by using logical values and logical operators. R has two special representations for logical values:
FALSE. R also has many logical operators, such as greater than (
>), less than (
<), or equal to (
When we apply a logical operator to an object, the value returned should be a logical value (i.e.
F). For instance:
5 > 3
##  TRUE
7 < 4
##  FALSE
2 == 1
##  FALSE
Here, when we ask R whether 5 is greater than 3, R returns the logical value
TRUE. When we ask if 7 is less than 4, R returns the logical value
FALSE. When we ask R whether 2 is equal to 1, R returns the logical value
For the purposes of subsetting, logical operations are useful because they can be used to specify which elements of a vector or data.frame we would like returned. For instance, let’s subset the
USArrests and keep only states with a murder rate less than 5 per 100,000:
USArrests[USArrests$Murder < 5, ]
## Murder Assault UrbanPop Rape ## Connecticut 3.3 110 77 11.1 ## Idaho 2.6 120 54 14.2 ## Iowa 2.2 56 57 11.3 ## Maine 2.1 83 51 7.8 ## Massachusetts 4.4 149 85 16.3 ## Minnesota 2.7 72 66 14.9 ## Nebraska 4.3 102 62 16.5 ## New Hampshire 2.1 57 56 9.5 ## North Dakota 0.8 45 44 7.3 ## Oregon 4.9 159 67 29.3 ## Rhode Island 3.4 174 87 8.3 ## South Dakota 3.8 86 45 12.8 ## Utah 3.2 120 80 22.9 ## Vermont 2.2 48 32 11.2 ## Washington 4.0 145 73 26.2 ## Wisconsin 2.6 53 66 10.8
Let’s go through this code slowly to see what is going on here. First, we are asking R to display the
USArrests data.frame. But no all of it: we are using square brackets
[ ], so only a subset of the dataset is displayed. There is some information before but nothing after the comma inside the square brackets, which means that only a fraction of rows but all columns should be displayed. Which rows? Let’s take a closer look at the code before the comma inside the square brackets. R should only display the rows for which the expression
USArrests$Murder < 5 is
TRUE, i.e. states with a murder rate less than 5 (per 100,000).
mean_murder <- mean(USArrests$Murder) median_murder <- median(USArrests$Murder) mean_assault <- mean(USArrests$Assault) median_assault <- median(USArrests$Assault) mean_urban <- mean(USArrests$UrbanPop) median_urban <- median(USArrests$UrbanPop) mean_rape <- mean(USArrests$Rape) median_rape <- median(USArrests$Rape)
urban_states <- USArrests[USArrests$UrbanPop >= median_urban, ] rural_states <- USArrests[USArrests$UrbanPop < median_urban, ] mean_assault_urban <- mean(urban_states$Assault) mean_assault_rural <- mean(rural_states$Assault) mean_assault_urban
##  187.4643
##  149.5
The average assault rate in urban states is 187.46 (per 100,000), considerably larger than the average assault rate in rural states of 149.5.
Can transphobia re reduced through in-person conversations and perspective-taking exercises? To address this question, two researchers conducted a field experiment on door-to-door canvassing in South Florida. Targeting antitransgender prejudice, the intervention involved canvassers holding single, approximately 10-minute conversations that encouraged actively taking the perspective of others with voters to see if these conversations could affect prejudicial attitudes towards transgender people.
In the experiment, the authors recruited first registered voters via mail for an online baseline survey. They then randomly assigned respondents of this baseline survey (\(n=1825\)) to either a treatment group targeted with the intervention (\(n=913\)) or a placebo group targeted with a conversation about recycling (\(n=912\)). For the intervention, 56 canvassers first knocked on voters’ doors unannounced. Then, canvassers asked to speak with the subject on their list and confirmed the person’s identity if the person came to the door. A total of several hundred individuals (\(n=501\)) came to their doors in the two conditions. For logistical reasons unrelated to the original study, we further reduce this dataset to (n=488) which is the full sample that appears in the
transphobia.csv data (available on ILIAS).
The canvassers then engaged in a series of strategies previously shown to facilitate active processing under the treatment condition: canvassers informed voters that they might face a decision about the issue (whether to vote to repeal the law protecting transgender people); canvassers asked voters to explain their views; and canvassers showed a video that presented arguments on both sides. Canvassers defined the term “transgender” at this point and, if they were transgender themselves, noted this. The canvassers next attempted to encourage “analogic perspective-taking”. Canvassers first asked each voter to talk about a time when they themselves were judged negatively for being different. The canvassers then encouraged voters to see how their own experience offered a window into transgender people’s experiences, hoping to facilitate voters’ ability to take transgender people’s perspectives. The intervention ended with another attempt to encourage active processing by asking voters to describe if and how the exercise changed their mind. All of the former steps constitutes the “treatment.”
The placebo group was reminded that recycling was most effective when everyone participates. The canvassers talked about how they were working on ways to decrease environmental waste and asked the voters who came to the door about their support for a new law that would require supermarkets to charge for bags instead of giving them away for free. This was meant to mimic the effect of canvassers interacting with the voters in face-to-face conversation on a topic different from transphobia.
The authors then asked respondents (\(n=488\)) to complete follow-up online surveys via email presented as a continuation of the baseline survey. These follow-up surveys began 3 days, 3 weeks, 6 weeks, and 3 months after the intervention when the baseline survey was also conducted. The authors then created an index of tolerance towards transgender people. Higher values indicate higher tolerance, lower values indicate lower tolerance. The data set includes the following variables:
||Intervention was actually delivered (=
||Tolerance variable at Baseline|
||Tolerance captured at 3 days after Baseline|
||Tolerance captured at 3 weeks after Baseline|
||Tolerance captured at 6 weeks after Baseline|
||Tolerance captured at 3 months after Baseline|
It is sensible when you start any data analysis project to make sure your computer is set up in an efficient way. Our suggestion is that you create a folder on your computer where you can save all your scripts throughout the course (i.e.,
seminar2.R, etc.). We also recommend you to create a subfolder into your main folder, and give it the name
data: this is where you should save all your data sets.
When we work with RStudio, the first thing we should do is ensure that R is to set the working directory. This essentially is the folder in your computer where will operate (e.g., when looking for data and other scripts). There are two ways to this. The easiest (and recommended) is to set the folder from which you want R to work from as an R Project. You can do that by clicking on “Project: (none)” at the top-right corner of RStudio, then click “New project” and just assign it to your folder of choice. The second way to set the working directory involves knowing the location of the relevant folder in your computer. Say you have created a folder named “Causal Inference in OS” inside a folder named “GESIS 2021” on your desktop. Then you can use the
setwd() to set the working directory:
setwd("~/Desktop/GESIS 2021/Causal Inference in OS") # if you are working on a Mac setwd("C:/Desktop/GESIS 2021/Causal Inference in OS") # if you are working on a Windows PC
Once you have downloaded the data, put the
transphobia.csv file into the
data folder that you created earlier in the seminar. Now load the data into the current R session using the
transphobia <- read.csv('data/transphobia.csv')
You can now check the environment panel and see a data.frame object named
transphobia with 488 rows (observations) and 11 columns (variables).
Question 1 – Describing variables
table()function and the
treat_indvariable to find how many respondents were randomly assigned to the treatment and the control groups.
## ## 0 1 ## 252 236
236 were randomly assigned to the treatment group, whereas 252 were randomly assigned to the control group.
prop.table()function. What percentage of respondents were randomly assigned to the treatment and control groups?
Code hint: the
prop.table() function requires a
table() as an argument!
48.36 % of the respondents were randomly assigned to the treatment group, whereas 51.64 % were assigned to the control group.
## ## 0 1 ## 0.5163934 0.4836066
describe()function and the variable
tolerance.t1, which measures tolerance levels towards transgender people three days after the intervention.
We can see that this index of tolerance levels towards transgender people ranges from -2.26 to 2.07. The mean is 0.08, very similar to the median of 0.07, suggesting a symmetrical distribution. The standard deviation is 1.07.
## vars n mean sd median trimmed mad min max range skew kurtosis se ## X1 1 418 0.08 1.07 0.07 0.1 1.01 -2.26 2.07 4.32 -0.15 -0.46 0.05
Question 2 – Covariate balance
To assess balance in the variable
vf_age, we can calculate the average age for each treatment group:
mean(transphobia$vf_age[transphobia$treat_ind == T]) # average age in the treatment group
##  50.07627
mean(transphobia$vf_age[transphobia$treat_ind == F]) # average age in the control group
##  48.60317
Respondents in the treatment group are on average 50.08 years old, whereas respondents in the control group are on average 48.6 years old. To assess whether the estimated difference of 1.48 year between the two groups is due to sampling uncertainty, we can conduct a
t test using the
t.test(x = transphobia$vf_age[transphobia$treat_ind == 1], y = transphobia$vf_age[transphobia$treat_ind == 0], conf.level = 0.95, var.equal = T)
Given that (i) the t statistic of 0.93 is lower than 1.96, (ii) the p-value of 0.35 is larger than 0.05, and (iii) the confidence interval of \([-1.65; 4.59]\) includes zero of a plausible value (all three statements are equivalent), we can safely conclude that there is no significant difference in the average age of respondents assigned to the treatment and the control groups.
## ## Two Sample t-test ## ## data: transphobia$vf_age[transphobia$treat_ind == 1] and transphobia$vf_age[transphobia$treat_ind == 0] ## t = 0.92987, df = 486, p-value = 0.3529 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -1.639634 4.585827 ## sample estimates: ## mean of x mean of y ## 50.07627 48.60317
vf_party. Do respondents in the treatment and control groups have similar characteristics in terms of their gender, race, and party affiliation?
See results! To assess the association between gender (as measured in the study) and treatment assignment, we can use the
table() functions to visualize the cross-tabulation, and then conduct a Chi-squared test.
table_gender <- table(transphobia$vf_female, # first argument is represented by rows transphobia$treat_ind) # second argument is represented by columns prop.table(table_gender, 1) # the argument 1 indicates we want conditional proportions by rows
## ## 0 1 ## 0 0.4903846 0.5096154 ## 1 0.5357143 0.4642857
As we can see, 49% of all male respondents were assigned to the control group, whereas 51% were assigned to the treatment group. Among female respondents, 54% were assigned to the control group and 46% were assigned to the treatment group. To handle sampling uncertainty, let’s use the
## ## Pearson's Chi-squared test with Yates' continuity correction ## ## data: table_gender ## X-squared = 0.80883, df = 1, p-value = 0.3685
Given that \(p = 0.37\), we have little evidence to reject the null hypothesis that there is no association between gender and treatment assignment.
Let us now adopt the same strategies and check balance for
table_race <- table(transphobia$vf_racename, transphobia$treat_ind) prop.table(table_race, 1)
## ## 0 1 ## African American 0.5037037 0.4962963 ## Caucasian 0.5161290 0.4838710 ## Hispanic 0.5240175 0.4759825
table_party <- table(transphobia$vf_party, transphobia$treat_ind) prop.table(table_party, 1)
## ## 0 1 ## D 0.4834711 0.5165289 ## N 0.5333333 0.4666667 ## R 0.5634921 0.4365079
Overall, respondents of all three racial groups (African American, Caucasian, and Hispanic) and all three political groups (Democrats, Republicans, and Independents) seem relatively well balanced across the treatment and control groups, with close to 50% of respondents of each profile in either treatment groups.
## ## Pearson's Chi-squared test ## ## data: table_race ## X-squared = 0.14038, df = 2, p-value = 0.9322
## ## Pearson's Chi-squared test ## ## data: table_party ## X-squared = 2.3074, df = 2, p-value = 0.3155
Given that both p-values are larger than 0.05, we fail to reject both null hypotheses of no association between either race or party affiliation and treatment assignment.
tolerance.t0. Is it the case here?
See details! We can check whether respondents in the treatment and the control groups had the same levels of tolerance before the intervention by conducting a
t.test(x = transphobia$tolerance.t0[transphobia$treat_ind == T], y = transphobia$tolerance.t0[transphobia$treat_ind == F], conf.level = 0.95, var.equal = T)
Based on the t statistic of -0.42, the p-value of 0.68, and the 95% confidence interval of \([-0.23; 0.15]\), we can safely conclude that respondents from both groups had, on average, the same levels of tolerance before the intervention.
## ## Two Sample t-test ## ## data: transphobia$tolerance.t0[transphobia$treat_ind == T] and transphobia$tolerance.t0[transphobia$treat_ind == F] ## t = -0.41789, df = 486, p-value = 0.6762 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -0.2282952 0.1482183 ## sample estimates: ## mean of x mean of y ## -0.030558356 0.009480114
We seem to have covariate balance, as none of the covariates (i.e., age, gender, race, and party affiliation) is associated with the treatment groups. Tolerance levels at the baseline (i.e. before the intervention) are also independent from the treatment groups. This is, of course, expected, considering that respondents were randomly assigned to the receive the intervention. This implies that any differences we find between the two groups can be attributed to the treatment implementation. That is, we are able to make causal claims. For instance, if we compare the levels of tolerance towards transgender people between the two groups, we can identify the causal effect of in-person conversations and perspective-taking exercises on antitransgender prejudice.
Question 3 – Estimating an ATE
# Average in the treatment group average_treatment_t1 <- mean(transphobia$tolerance.t1[transphobia$treat_ind == T], na.rm = T) # Average in the control group average_control_t1 <- mean(transphobia$tolerance.t1[transphobia$treat_ind == F], na.rm = T) # ATE average_treatment_t1 - average_control_t1
##  0.1443226
The average tolerance level 3 days after the intervention among those respondents who were randomly assigned to the treatment group is 0.15, whereas among those in the control group it is 0.01. The mean difference of is the average treatment effect, given that respondents were randomly assigned to the treatment groups. Being randomly assigned to the treatment group led to an increase in tolerance levels of 0.14 points in the tolerance scale.
t.test()function, what do you conclude about the statistical significance of the ATE?
t.test(x = transphobia$tolerance.t1[transphobia$treat_ind == T], y = transphobia$tolerance.t1[transphobia$treat_ind == F], conf.level = 0.95, var.equal = T)
## ## Two Sample t-test ## ## data: transphobia$tolerance.t1[transphobia$treat_ind == T] and transphobia$tolerance.t1[transphobia$treat_ind == F] ## t = 1.3757, df = 416, p-value = 0.1696 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -0.06189076 0.35053604 ## sample estimates: ## mean of x mean of y ## 0.153535857 0.009213218
Given the t statistic of 1.36, the p-value of 0.17, and the 95% confidence interval of \([-0.06; 0.35]\), we fail to reject the null hypothesis that the mean difference is zero. In other words, the average treatment effect is not statistically significant. We have little evidence to sustain that being randomly assigned to the treatment group leads to an increase in tolerance levels.
Code hint: you can use the
lm(dependent_variable ~ explanatory_variable, data)
reg1 <- lm(tolerance.t1 ~ treat_ind, transphobia) summary(reg1)
## ## Call: ## lm(formula = tolerance.t1 ~ treat_ind, data = transphobia) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.41152 -0.69412 -0.06658 0.74185 2.05647 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.009213 0.071836 0.128 0.898 ## treat_ind 0.144323 0.104907 1.376 0.170 ## ## Residual standard error: 1.07 on 416 degrees of freedom ## (70 observations deleted due to missingness) ## Multiple R-squared: 0.004529, Adjusted R-squared: 0.002136 ## F-statistic: 1.893 on 1 and 416 DF, p-value: 0.1696
As expected, results are exactly the same as before. The coefficient for a binary explanatory variable in a simple linear regression model represents the mean difference. This shows that we can use the regression framework to estimate ATEs.
vf_party. What happens with the coefficient for
treat_ind? Is this expected?
reg2 <- lm(tolerance.t1 ~ treat_ind + vf_age + vf_racename + vf_female + vf_party, transphobia) # install.packages('texreg') # install the R package 'texreg'. You only need to to this once in your computer library(texreg)
## Warning: package 'texreg' was built under R version 4.0.2
## Version: 1.37.5 ## Date: 2020-06-17 ## Author: Philip Leifeld (University of Essex) ## ## Consider submitting praise using the praise or praise_interactive functions. ## Please cite the JSS article in your publications -- see citation("texreg").
## ## ========================================= ## Model 1 Model 2 ## ----------------------------------------- ## (Intercept) 0.01 0.01 ## (0.07) (0.19) ## treat_ind 0.14 0.15 ## (0.10) (0.10) ## vf_age -0.01 *** ## (0.00) ## vf_racenameCaucasian 0.97 *** ## (0.15) ## vf_racenameHispanic 0.77 *** ## (0.14) ## vf_female 0.38 *** ## (0.10) ## vf_partyN -0.32 * ## (0.14) ## vf_partyR -0.62 *** ## (0.13) ## ----------------------------------------- ## R^2 0.00 0.16 ## Adj. R^2 0.00 0.14 ## Num. obs. 418 418 ## ========================================= ## *** p < 0.001; ** p < 0.01; * p < 0.05
The coefficient for
treat_ind remains virtually unaltered after we include four new covariates in the regression model. This is expected, given that
treat_ind was randomly assigned and achieved covariate balance.
Question 4 – What went wrong?
prop.table()function and the variable
treatment.delivered, check whether this is the case.
table_delivery <- table(transphobia$treat_ind, transphobia$treatment.delivered) prop.table(table_delivery, 1)
## ## FALSE TRUE ## 0 0.95634921 0.04365079 ## 1 0.21610169 0.78389831
Considering respondents who were randomly assigned to the control group, 96% correctly received a placebo intervention and 4% incorrectly received the treatment intervention. Among those who were randomly assigned to the treatment group, 78% correctly received the treatment intervention and 22% incorrectly received the placebo intervention.
treatment.delivered. Does the coefficient represent the ATE?
reg3 <- lm(tolerance.t1 ~ treatment.delivered, transphobia) screenreg(list(reg1, reg2, reg3))
## ## ====================================================== ## Model 1 Model 2 Model 3 ## ------------------------------------------------------ ## (Intercept) 0.01 0.01 -0.01 ## (0.07) (0.19) (0.07) ## treat_ind 0.14 0.15 ## (0.10) (0.10) ## vf_age -0.01 *** ## (0.00) ## vf_racenameCaucasian 0.97 *** ## (0.15) ## vf_racenameHispanic 0.77 *** ## (0.14) ## vf_female 0.38 *** ## (0.10) ## vf_partyN -0.32 * ## (0.14) ## vf_partyR -0.62 *** ## (0.13) ## treatment.deliveredTRUE 0.22 * ## (0.11) ## ------------------------------------------------------ ## R^2 0.00 0.16 0.01 ## Adj. R^2 0.00 0.14 0.01 ## Num. obs. 418 418 418 ## ====================================================== ## *** p < 0.001; ** p < 0.01; * p < 0.05
Respondents who received the treatment intervention (an in-person conversation about transphobia with perspective-taking exercises) had tolerance levels 0.22 points higher three days later than respondents who received the placebo intervention (a conversation about recycling). This is statistically significant, suggesting a relationship between the treatment delivery and prejudice. However, this is not the ATE. This variable was not randomly assigned, and therefore there could be potential confounders.
Well, they need to start with a letter. But otherwise they may contain numbers, upper and lower case letters (R distinguishes between them), and punctuation such as dots ( . ) and underscores ( _ )↩︎