03/12/2019
Slides available at https://www.ThiagoROliveira.com/IntroR

Outline

Outline

  • Motivation to learn R
  • R basics
    • Using R as a calculator
    • Creating and manipulating objects in R
    • Loading data sets into R
  • Data manipulation and exploration
    • Linear models
    • Plots


Slides available at https://www.ThiagoROliveira.com/IntroR

Motivation to learn R

What is this course about?

Statistical packages for data analysis

Stuff that you can do using:

  • MS Excel
  • SPSS
  • Stata
  • SAS
  • Matlab

But using the R programming language

But why should I learn R?

Because:

  • R is free
  • R is open source
  • R is very versatile
  • R is very good for data visualisation
  • R is a programming language

On the other hand:

  • R has a steep(er) learning curve
  • R is a programming language

But why should I learn R?

With R, you can…

  • Do data analysis
  • Use multiple data frames simultaneously
  • Analyse different kinds of data (e.g., text, network, spatial)
  • Collect data from the internet (webscrapping)
  • Write your own functions
  • Write your own packages

But why should I learn R?

R is getting popular…

Alright. Where should I start?

To obtain R, visit https://cran.r-project.org/

  • CRAN: The Comprehensive R Archive Network
  • Select the link that matches your operating system and then follow the installation instructions

What next?

Highly recommended: use RStudio

  • Open-source programme that facilitates the use of R
  • Visit http://www.rstudio.com/ and follow the download and installation instructions

Screenshot of RStudio

R basics

Using R as a calculator

Using R as a calculator

In the console, we can type in arithmetic operations of any kind:

2 + 2
## [1] 4
5 / 3
## [1] 1.666667
4 * 4
## [1] 16

Using R as a calculator

5 * (10 - 3)
## [1] 35
12 / (5 * 2)
## [1] 1.2
sqrt(4)
## [1] 2

Use the keyboard

You can use the cursor or arrow keys on your keyboard to edit your code at the console:

  • Use the UP and DOWN keys to re-run something without typing it again
  • Use the LEFT and RIGHT keys to edit

Take a few minutes to play around at the console and try different things out. Don’t worry if you make a mistake, you can’t break anything easily!

Scripts

Working directly with the console is fine, but…

  • sometimes we want to save our work in some sort of document
  • we often need to share our work with colleagues
  • it is generally good to keep track of what we are doing

Scripts are just text files – like notepad or TextEdit – embeded within RStudio


It is highly recommended that you always work from a script file


To run your code from a script, select the the lines you want to run and hit CTRL-ENTER (cmd-Enter) or use the Run button

Scripts

Scripts

Creating and manipulating objects in R

Creating and manipulating objects in R

R can store information as an object with a name of our choice

  • Once created, we just refer to it by name – objects are shortcuts to some piece of information or data
  • To create an object, we just use the assignment operator <-
result <- 2 + 2
print(result)
## [1] 4

We can use objects to perform subsequent calculations

result * 3
## [1] 12

Creating and manipulating objects in R

We can even use an object and assign the result to a new object:

new_result <- result / 8
new_result
## [1] 0.5

Take a look at the upper-right window. The Environment lists all R objects created in this section.

Creating and manipulating objects in R

Note that if we assign a different value to the same object name, the value of the object will be changed

result <- 7 - 2
print(result)
## [1] 5

And remember that object names are case sensitive.

  • result is not the same as Result or RESULT
print(Result)
## Error in print(Result): object 'Result' not found

Creating and manipulating objects in R

So far, we have only assigned numbers to an object. But R can represent various types of values as objects.

  • For instance, we can store a string of characters using quotation marks
my_name <- "thiago"
my_name
## [1] "thiago"
  • In character strings, spacing is allowed
my_name <- "thiago r. oliveira"
my_name
## [1] "thiago r. oliveira"

Creating and manipulating objects in R

Notice that we can treat numbers like characters if we want to

RESULT <- "4"
RESULT
## [1] "4"

However, arithmetic operations cannot be used for character strings.

sqrt(RESULT)
## Error in sqrt(RESULT): non-numeric argument to mathematical function

Creating and manipulating objects in R

Each object belongs to a different class. The Environment window shows the class of an object. We can also use the function class()

class(result)
## [1] "numeric"
class(RESULT)
## [1] "character"
class(sqrt)
## [1] "function"

Vectors

A vector is a set of information contained together in a specific order.

  • We use the function c(), which stands for concatenate
new_vector <- c(0, 3, 1, 4, 1, 5, 9, 2)
new_vector
## [1] 0 3 1 4 1 5 9 2
  • The order in which the numbers in the vector are stored is important, and we can access individual elements of a vector by using square parentheses, which look like this: [ ] (we call it indexing). For instance, if we wish to access the 2nd element of the vector we just created, we can do the following
new_vector[2]
## [1] 3

Vectors

We can also use indexing to subset a vector. For example, remember the new_vector we created was c(0, 3, 1, 4, 1, 5, 9, 2)

new_vector[c(1, 5, 6)]
## [1] 0 1 5
new_vector[c(6, 5, 1)]
## [1] 5 1 0
new_vector[-5]
## [1] 0 3 1 4 5 9 2

Vectors

The c() function can be used to combine multiple vectors

x1 <- c(1, 3, 5, 7)
x1
## [1] 1 3 5 7
x2 <- c(4:7)
x2
## [1] 4 5 6 7
x1x2 <- c(x1, x2)
x1x2
## [1] 1 3 5 7 4 5 6 7

Vectors

In R, mathematical operations on vectors occur elementwise:

fib <- c(1, 1, 2, 3, 5, 8, 13, 21)
fib[1:7]
## [1]  1  1  2  3  5  8 13
fib[2:8]
## [1]  1  2  3  5  8 13 21
fib[1:7] + fib[2:8]
## [1]  2  3  5  8 13 21 34

Functions

Functions are the backbone or R operations

  • A function often takes multiple input objects and returns an output object
  • We have already seen several functions: sqrt(), print(), class(), and c()
  • Functions always have the following structure
funcname(input)

Where

  • funcname is the name of the function
  • input is the argument passed to the function
sqrt(49)
## [1] 7

Functions

A function always requires the use of parenthesis or round brackets ( ). Inputs to the function are called arguments and go inside the brackets. Some basic functions useful for summarising data include:

  • length() for the length of a vector
  • min() for the minimum value
  • max() for the maximum value
  • range() for range of data
  • mean() for mean
  • sum() for the sum of data
length(new_vector)
## [1] 8

Functions

min(new_vector)
## [1] 0
max(new_vector)
## [1] 9
range(new_vector)
## [1] 0 9
sum(new_vector)
## [1] 25

Functions

We can be creative. Instead of running

mean(new_vector)
## [1] 3.125

We can also estimate

sum(new_vector) / length(new_vector)
## [1] 3.125

Functions

We can also perform calculations on the output of a function:

mean(new_vector) * 3
## [1] 9.375

Which means that we can also have nested functions:

log(mean(new_vector))
## [1] 1.139434

We can also assign the output of any function to a new object for use later:

log_pie <- log(mean(new_vector))

Functions – some other examples

world.pop <- c(2525779, 3026003, 3691173, 4449049, 5320817, 6127700, 6916183)
year <- seq(to = 2010, by = 10, from = 1950)
year
## [1] 1950 1960 1970 1980 1990 2000 2010
names(world.pop) <- year
names(world.pop)
## [1] "1950" "1960" "1970" "1980" "1990" "2000" "2010"
world.pop
##    1950    1960    1970    1980    1990    2000    2010 
## 2525779 3026003 3691173 4449049 5320817 6127700 6916183

Functions

You can write your own functions!

my_new_mean <- function(x) {      # function takes one input
  new.mean <- sum(x) / length(x)  # object 'new.mean' is defined as this ratio
  return(new.mean)                # return output
}
my_new_mean(world.pop)            # Testing the new function
## [1] 4579529
mean(world.pop)
## [1] 4579529

Functions – another example

my.summary <- function(x) {
  s.out <- sum(x)
  l.out <- length(x)
  m.out <- mean(x)
  out <- c(s.out, l.out, m.out) # define the output
  names(out) <- c("sum", "length", "mean") # add labels
  return(out)
}
my.summary(world.pop)
##      sum   length     mean 
## 32056704        7  4579529

Loading data sets into R

Loading data sets into R

Loading data into R can be tricky sometimes.

  • First, we need to ensure the data files reside in the working directory
getwd()
## [1] "/Users/rodri147/Dropbox/LSE/Teaching/Intro to R"
  • To change the working directory, use the setwd() function
setwd("~/Dropbox/LSE/Teaching/Intro to R")

Now assuming the data files are in the working directory…

Loading data sets into R

If the data file is saved as a CSV file, we just use the read.csv function. Click here to download the data.

data_pop <- read.csv("UNpop.csv")
class(data_pop)
## [1] "data.frame"
  • You can use the View() command which displays the data frame like a spreadsheet.

If the data file is saved as an RDta file, we just use the load() function. Click here to download the data.

load("UNpop.RData")

Loading data sets into R

Data files from other statistical software cannot be loaded into base R. Fortunately though, one of R’s strengths is the existence of a large community of R users who contribute writing R packages.

  • The foreign package is useful to load data sets saved as .dta or .sav into R
#install.packages("foreign") # install package
library(foreign)            # load package
  • Click here to download the data.
mydata_stata <- read.dta("UNpop.dta")

Loading data sets into R – useful functions

names(data_pop) # names of the variables
## [1] "year"      "world.pop"
nrow(data_pop)  # number of rows (observations)
## [1] 7
ncol(data_pop)  # number of columns (variables)
## [1] 2
dim(data_pop)   # dimensions
## [1] 7 2

Loading data sets into R – useful functions

The $ operator is very useful to access an individual variable.

summary(data_pop$world.pop)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 2525779 3358588 4449049 4579529 5724258 6916183
summary(data_pop$year)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1950    1965    1980    1980    1995    2010

Loading data sets into R – useful functions

Another way of accessing individual variables is to use [ ] - Data frame is a two-dimensional array, so we need two indexes: [rows, columns]

mean(data_pop[, "world.pop"])
## [1] 4579529
data_pop[1:3, ]
##   year world.pop
## 1 1950   2525779
## 2 1960   3026003
## 3 1970   3691173
data_pop[c(1, 3, 5), "world.pop"]
## [1] 2525779 3691173 5320817

Data manipulation and exploration

Plots

Plots

  • Plots are one of the great strenghts of R
  • There are two main frameworks for plotting
  • Base R graphics
  • ggplot
  • Which should you use? It is a matter of preference!

Base R plots

The basic plotting syntax is very simple. plot(x_var, y_var) will give you a scatter (click here to download data):

sim.df <- read.csv("file.csv")
plot(sim.df$x, sim.df$y)

Base R plots

Hmm, let’s work on that.

Base R plots

The plot function takes a number of arguments (?plot for a full list). The fewer you specify, the uglier your plot:

plot(x = sim.df$x, y = sim.df$y, 
     xlab = "X variable", 
     ylab = "Y variable", 
     main = "Awesome plot title",
     pch = 19, # Solid points
     cex = 0.5, # Smaller points
     bty = "n", # Remove surrounding box
     col = sim.df$g # Colour by grouping variable
     )

Base R plots

Base R plots

The default behaviour of plot() depends on the type of input variables for the x and y arguments. If x is a factor variable, and y is numeric, then R will produce a boxplot:

plot(x = sim.df$g, y = sim.df$x)

Base R plots

ggplot

A very popular alternative to base R plots is the ggplot2 library (the 2 in the name refers to the second iteration, which is the standard). This is a separate package (i.e. it is not a part of the base R environment) but is very widely used.

  • Based on the Grammar of Graphics data visualisation scheme:

Wilkinson, L. (2005). The Grammar of Graphics 2nd Ed. Heidelberg: Springer. https://doi.org/10.1007/0-387-28695-0

Wickham, H. (2010). A Layered Grammar of Graphics. Journal of Computational and Graphical Statistics, 19(1), 3–28. https://doi.org/10.1198/jcgs.2009.07098

  • Graphs are broken into multiple layers

  • Layers can be recycled across multiple plots

ggplot

Let’s recreate the previous scatter plot using ggplot:

library("ggplot2")
ggplot(data = sim.df, aes(x = x, y = y, col = g)) +
  # Add scatterplot
  geom_point() +
  # Change axes labels and plot title
  labs(x = "X variable",
       y = "Y variable",
       title = "Awesome plot title") +
  # Change default grey theme to black and white
  theme_bw()

ggplot

ggplot

One nice feature of ggplot is that it is very easy to create facet plots:

library("ggplot2")
ggplot(data = sim.df, aes(x = x, y = y, col = g)) + 
  geom_point() +
  labs(x = "X variable",
       y = "Y variable",
       title = "Awesome plot title") +
  theme_bw() +
  # Separate plots by variable g
  facet_wrap(~ g)

ggplot

Linear models

Linear regression models

Linear regression models in R are implemented using the lm function.

lm.fit <- lm(formula = y ~ x, data = sim.df)

The formula argument is the specification of the model, and the data argument is the data on which you would like the model to be estimated.

lm.fit
## 
## Call:
## lm(formula = y ~ x, data = sim.df)
## 
## Coefficients:
## (Intercept)            x  
##      0.4416       0.3402

lm

We can specify multivariate models:

lm.multi.fit <- lm(formula = y ~ x + z, data = sim.df)

Interaction models:

lm.inter.fit <- lm(formula = y ~ x * z, data = sim.df)

Note that direct effects of x and z are also included, when interaction term is specified.

Fixed-effect models:

lm.fe.fit <- lm(formula = y ~ x + g, data = sim.df)

And many more!

lm

The output of the lm function is a long list of interesting output.

When we call the fitted object (e.g. lm.fit), we are presented only with the estimated coefficients.

For some more information of the estimated model, use summary(fitted.model):

lm.fit.summary <- summary(lm.fit)
lm.fit.summary

lm

## 
## Call:
## lm(formula = y ~ x, data = sim.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1618 -0.6669  0.0217  0.6872  3.2098 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.44164    0.03215   13.74   <2e-16 ***
## x            0.34023    0.03192   10.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.016 on 998 degrees of freedom
## Multiple R-squared:  0.1022, Adjusted R-squared:  0.1013 
## F-statistic: 113.6 on 1 and 998 DF,  p-value: < 2.2e-16

lm

As with any other function, summary(fitted.model) returns an object. Here, it is a list. What is saved as the output of this function?

names(lm.fit.summary)
##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled"

If we want to extract other information of interest from the fitted model object, we can use the $ operator to do so:

lm.fit.summary$r.squared
## [1] 0.1022217

lm

Accessing elements from saved models can be very helpful in making comparisons across models.

Suppose we want to extract and compare \(R^2\) across different models.

lm.r2 <- summary(lm.fit)$r.squared
lm.multi.r2 <- summary(lm.multi.fit)$r.squared
lm.inter.r2 <- summary(lm.inter.fit)$r.squared

r2.compare <- data.frame(
  model = c("bivariate", "multivariate", "interaction"), 
  r.squared = c(lm.r2, 
                lm.multi.r2, 
                lm.inter.r2)
)

lm

We can print the data frame containing values of \(R^2\):

r2.compare
##          model r.squared
## 1    bivariate 0.1022217
## 2 multivariate 0.1101408
## 3  interaction 0.1205508

Or we can plot them:

ggplot(r2.compare, aes(x = model, y = r.squared))+
  geom_point(size = 4) +
  # Use `expression` to add 2 as a superscript to R
  ggtitle(expression(paste(R^{2}, " ", "Comparison"))) +
  theme_bw()

lm

lm diagnostics

There are a number of functions that are helpful in producing model diagnostics:

  • residuals(fitted.model) extracts the residuals from a fitted model
  • coefficients(fitted.model) extracts coefficients
  • fitted(fitted.model) extracts fitted values
  • plot(fitted.model) is a convenience function for producing a number of useful diagnostics plots

Thank you!

Any questions?