Longitudinal Data Analysis

Today, we will explore data from children of female respondents to the US National Longitudinal Survey of Youth. The data are a subsample from the National Longitudinal Survey of Youth (NLSY) of Labor Market Experience in Youth. Starting in 1986, children of the female respondents of the original NLSY Youth sample were assessed in 1986, and again in 1988, 1990 and 1992. For inclusion in the subsample considered here, children had to be between 6 and 8 years old at the first measurement. We also restrict the analysis to 221 children who were assessed at all four occasions.

As usual, let’s start by installing and/or loading the packages we will need for the session. In this session, we will use three packages: haven, tidyverse, and lme4.

# Install the lme4 package; you only need to do that once
## install.packages("lme4") # (uncomment if you need to install)

# Load the lme4 package
library(lme4)

## We are also going to use the tidyverse package, so let's load it now

# Install the tidyverse package; you only need to do that once
## install.packages("tidyverse") # (uncomment if you need to install)

# Load the tidyverse package
library(tidyverse)

## Finally, we also need the haven package to import datasets into R
## install.packages("haven") # (uncomment if you need to install)
library(haven)

Now, let’s download the read_long.dta data file from the website. Download the file and store it in a sensible folder in your computer (preferably the folder of your R Project). Then, we can use the read_stata() function to import the dataset into R.

Because the data is available online, we can also use the following function to import data into R, which is more efficient. (Though do note that most datasets are not readily available online).

# import data into R
reading <- read_stata("https://thiagoroliveira.com/LDA-2025/data/read_long.dta")

The variables available in the reading dataset are the following:

Name	Description
`childid`	Identifier for each respondent
`t`	Occasion (0 = 1986, 1 = 1988, 2 = 1990, 3 = 1992)
`male`	Gender (1 = male)
`homecog`	Amount of cognitive support at home
`read`	Reading scores
`year`	Year

First part

Let’s reproduce some of the analyses shown in the lecture slides. Considering read as your main dependent variable:

Obtain summary statistics for each time point

Show me how

# We can use the group_by() and summarise() functions
reading %>%
  group_by(year) %>%
  summarise(average_read = mean(read))

## # A tibble: 4 × 2
##    year average_read
##   <dbl>        <dbl>
## 1  1986         2.52
## 2  1988         4.04
## 3  1990         5.02
## 4  1992         5.80

# We can also plot the overall trajectory
ggplot(data = reading %>% group_by(year) %>% summarise(average_read = mean(read)),
       aes(x = year, y = average_read)) +
  geom_line() + ylim(0,10) +
  theme_minimal()

Plot trajectories for a subsample of children

Show me how

# We can use the filter() function and then ggplot()
ggplot(data = reading %>% filter(childid %in% c(1:10)) %>% mutate(childid = factor(childid)),
       aes(x = year, y = read, group = childid, colour = childid)) +
  geom_line() + ylim(0,10) +
  theme_minimal()

Fit a linear random intercept growth model

Show me how

# We can use the lmer() function
random_intercepts <- lmer(read ~ t + (1 | childid), data = reading)

# see results
summary(random_intercepts)

## Linear mixed model fit by REML ['lmerMod']
## Formula: read ~ t + (1 | childid)
##    Data: reading
## 
## REML criterion at convergence: 2212.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.5054 -0.5301  0.0371  0.5530  3.8621 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  childid  (Intercept) 0.7323   0.8557  
##  Residual             0.4225   0.6500  
## Number of obs: 884, groups:  childid, 221
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  2.71932    0.06820   39.87
## t            1.08403    0.01955   55.44
## 
## Correlation of Fixed Effects:
##   (Intr)
## t -0.430

Fit a linear random slope growth model

Show me how

# We can use the lmer() function
random_slope <- lmer(read ~ t + (t | childid), data = reading)

# see results
summary(random_slope)

## Linear mixed model fit by REML ['lmerMod']
## Formula: read ~ t + (t | childid)
##    Data: reading
## 
## REML criterion at convergence: 2128.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.3579 -0.5406 -0.0376  0.4967  4.1818 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  childid  (Intercept) 0.51927  0.7206       
##           t           0.06981  0.2642   0.15
##  Residual             0.30645  0.5536       
## Number of obs: 884, groups:  childid, 221
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  2.71932    0.05762   47.19
## t            1.08403    0.02436   44.51
## 
## Correlation of Fixed Effects:
##   (Intr)
## t -0.205

Fit a quadratic growth model

Show me how

# We can use the lmer() function
quadratic_simple <- lmer(read ~ t + t_squared + (t | childid), data = reading %>% mutate(t_squared = t^2))

# see results
summary(quadratic_simple)

## Linear mixed model fit by REML ['lmerMod']
## Formula: read ~ t + t_squared + (t | childid)
##    Data: reading %>% mutate(t_squared = t^2)
## 
## REML criterion at convergence: 2022.3
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.9572 -0.4935 -0.0371  0.4502  4.5190 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  childid  (Intercept) 0.56713  0.7531       
##           t           0.08349  0.2889   0.04
##  Residual             0.23807  0.4879       
## Number of obs: 884, groups:  childid, 221
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  2.53369    0.05991   42.29
## t            1.64093    0.05493   29.88
## t_squared   -0.18563    0.01641  -11.31
## 
## Correlation of Fixed Effects:
##           (Intr) t     
## t         -0.333       
## t_squared  0.274 -0.896

# We can also include the quadratic term as a growth parameter
quadratic_growth <- lmer(read ~ t + t_squared + (t + t_squared | childid), data = reading %>% mutate(t_squared = t^2))

## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.00866492 (tol = 0.002, component 1)

# see results
summary(quadratic_growth)

## Linear mixed model fit by REML ['lmerMod']
## Formula: read ~ t + t_squared + (t + t_squared | childid)
##    Data: reading %>% mutate(t_squared = t^2)
## 
## REML criterion at convergence: 2003.9
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.2667 -0.4781 -0.0332  0.4251  3.9510 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr       
##  childid  (Intercept) 0.57085  0.7555              
##           t           0.36464  0.6039   -0.04      
##           t_squared   0.01733  0.1316   -0.02 -0.90
##  Residual             0.20349  0.4511              
## Number of obs: 884, groups:  childid, 221
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  2.53369    0.05880   43.09
## t            1.64093    0.06250   26.26
## t_squared   -0.18563    0.01757  -10.57
## 
## Correlation of Fixed Effects:
##           (Intr) t     
## t         -0.285       
## t_squared  0.216 -0.925
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.00866492 (tol = 0.002, component 1)

Let’s compare all four models. We can conduct a series of likelihood ratio tests, or we can simply compare the BIC of all models. Which model specification is the preferred solution?

Show me how

# We can use the texreg package for such comparisons

# Install the texreg package; you only need to do that once
## install.packages("texreg") # (uncomment if you need to install)

# Load the texreg package
library(texreg)

# We can then use the screenreg() function to make direct comparisons between models
list(random_intercepts, random_slope, quadratic_simple, quadratic_growth) %>% screenreg()

## 
## ==========================================================================================
##                                     Model 1       Model 2       Model 3       Model 4     
## ------------------------------------------------------------------------------------------
## (Intercept)                             2.72 ***      2.72 ***      2.53 ***      2.53 ***
##                                        (0.07)        (0.06)        (0.06)        (0.06)   
## t                                       1.08 ***      1.08 ***      1.64 ***      1.64 ***
##                                        (0.02)        (0.02)        (0.05)        (0.06)   
## t_squared                                                          -0.19 ***     -0.19 ***
##                                                                    (0.02)        (0.02)   
## ------------------------------------------------------------------------------------------
## AIC                                  2220.45       2140.56       2036.34       2023.92    
## BIC                                  2239.59       2169.27       2069.83       2071.76    
## Log Likelihood                      -1106.23      -1064.28      -1011.17      -1001.96    
## Num. obs.                             884           884           884           884       
## Num. groups: childid                  221           221           221           221       
## Var: childid (Intercept)                0.73          0.52          0.57          0.57    
## Var: Residual                           0.42          0.31          0.24          0.20    
## Var: childid t                                        0.07          0.08          0.36    
## Cov: childid (Intercept) t                            0.03          0.01         -0.02    
## Var: childid t_squared                                                            0.02    
## Cov: childid (Intercept) t_squared                                               -0.00    
## Cov: childid t t_squared                                                         -0.07    
## ==========================================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05

The quadratic growth model with a fixed coefficient for age squared is the preferred solution.

Finally, using the model specification with the preferred solution, fit a growth curve model adjusting for gender and amount of cognitive support at home. How can we interpret this growth curve model with covariates?

Show me how

# fit growth curve model with covariates
quadratic_simple_covs <- lmer(read ~ t + t_squared + male + homecog + (t | childid), data = reading %>% mutate(t_squared = t^2))

# see results
summary(quadratic_simple_covs)

## Linear mixed model fit by REML ['lmerMod']
## Formula: read ~ t + t_squared + male + homecog + (t | childid)
##    Data: reading %>% mutate(t_squared = t^2)
## 
## REML criterion at convergence: 2024.2
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.9753 -0.5063 -0.0452  0.4480  4.5622 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  childid  (Intercept) 0.55944  0.7480       
##           t           0.08349  0.2889   0.01
##  Residual             0.23807  0.4879       
## Number of obs: 884, groups:  childid, 221
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  2.12315    0.22398   9.479
## t            1.64093    0.05493  29.875
## t_squared   -0.18563    0.01641 -11.312
## male        -0.12537    0.11181  -1.121
## homecog      0.05235    0.02280   2.296
## 
## Correlation of Fixed Effects:
##           (Intr) t      t_sqrd male  
## t         -0.091                     
## t_squared  0.073 -0.896              
## male      -0.267  0.000  0.000       
## homecog   -0.928  0.000  0.000  0.006

Second part

Download the read_long.dta data file from the website. Download the file and store it in a sensible folder in your computer (preferably the folder of your R Project). Then, we can use the read_stata() function to import the dataset into R.

Again, because the data is available online, we can also use the following function to import data into R, which is more efficient.

# import data into R
physfunc <- read_stata("https://thiagoroliveira.com/LDA-2025/data/physfunc.dta")

The data file physfunc.dta contains repeated measurements (for up to six occasions) on health functioning from a study of British civil servants, the Whitehall II study. Health functioning was assessed by the SF-36, a 36 item instrument that comprises eight subscales covering physical, psychological and social functioning. These eight scales can be summarised into physical and mental health components. These are scaled using general US population norms to have mean values of 50 and low scores imply poor functioning. We will analyse physical health functioning (phf).

The variables available in the physfunc dataset are the following:

Name	Description
`id`	Identifier for each respondent
`occ`	Occasion (1, 2, 3, 4, 5, 6)
`female`	Gender (1 = female)
`grade`	Employment grade at baseline (1 = high grade, 2 = intermediate grade, 3 = low grade)
`age50`	Mean age at baseline
`phf`	Physical health functioning

The aim of this exercise is to investigate the following questions:

How does women’s physical functioning change as they get older?
To what extent does this vary between women?

Considering the subset of female respondents only, carry out the following analysis and interpret the results:

Obtain summary statistics for phf and age50 at each occasion
Plot physical health functioning trajectories by age for any five individuals in the dataset.
Fit a simple random intercept growth model for phf with no explanatory variables.
Fit a random slope growth model for phf which allows the rate of change in physical health functioning to vary among individuals.
Fit a quadratic growth model for phf which allows for non-linear trajectories.
Which model is preferred?

Longitudinal Data Analysis

Day 2: Growth Curve Models—a multilevel approach

Thiago R Oliveira

First part

Second part