Today, we will explore data from children of female respondents to the US National Longitudinal Survey of Youth. The data are a subsample from the National Longitudinal Survey of Youth (NLSY) of Labor Market Experience in Youth. Starting in 1986, children of the female respondents of the original NLSY Youth sample were assessed in 1986, and again in 1988, 1990 and 1992. For inclusion in the subsample considered here, children had to be between 6 and 8 years old at the first measurement. We also restrict the analysis to 221 children who were assessed at all four occasions.
As usual, let’s start by installing and/or loading the packages we
will need for the session. In this session, we will use three packages:
haven
, tidyverse
, and lme4
.
# Install the lme4 package; you only need to do that once
## install.packages("lme4") # (uncomment if you need to install)
# Load the lme4 package
library(lme4)
## We are also going to use the tidyverse package, so let's load it now
# Install the tidyverse package; you only need to do that once
## install.packages("tidyverse") # (uncomment if you need to install)
# Load the tidyverse package
library(tidyverse)
## Finally, we also need the haven package to import datasets into R
## install.packages("haven") # (uncomment if you need to install)
library(haven)
Now, let’s download the read_long.dta
data file from the
website. Download the
file and store it in a sensible folder in your computer (preferably the
folder of your R Project). Then, we can use the
read_stata()
function to import the dataset into
R
.
Because the data is available online, we can also use the following
function to import data into R
, which is more efficient.
(Though do note that most datasets are not readily available
online).
# import data into R
reading <- read_stata("https://thiagoroliveira.com/LDA-2025/data/read_long.dta")
The variables available in the reading
dataset are the
following:
Name | Description |
---|---|
childid |
Identifier for each respondent |
t |
Occasion (0 = 1986, 1 = 1988, 2 = 1990, 3 = 1992) |
male |
Gender (1 = male) |
homecog |
Amount of cognitive support at home |
read |
Reading scores |
year |
Year |
Let’s reproduce some of the analyses shown in the lecture slides.
Considering read
as your main dependent variable:
# We can use the group_by() and summarise() functions
reading %>%
group_by(year) %>%
summarise(average_read = mean(read))
## # A tibble: 4 × 2
## year average_read
## <dbl> <dbl>
## 1 1986 2.52
## 2 1988 4.04
## 3 1990 5.02
## 4 1992 5.80
# We can also plot the overall trajectory
ggplot(data = reading %>% group_by(year) %>% summarise(average_read = mean(read)),
aes(x = year, y = average_read)) +
geom_line() + ylim(0,10) +
theme_minimal()
# We can use the filter() function and then ggplot()
ggplot(data = reading %>% filter(childid %in% c(1:10)) %>% mutate(childid = factor(childid)),
aes(x = year, y = read, group = childid, colour = childid)) +
geom_line() + ylim(0,10) +
theme_minimal()
# We can use the lmer() function
random_intercepts <- lmer(read ~ t + (1 | childid), data = reading)
# see results
summary(random_intercepts)
## Linear mixed model fit by REML ['lmerMod']
## Formula: read ~ t + (1 | childid)
## Data: reading
##
## REML criterion at convergence: 2212.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.5054 -0.5301 0.0371 0.5530 3.8621
##
## Random effects:
## Groups Name Variance Std.Dev.
## childid (Intercept) 0.7323 0.8557
## Residual 0.4225 0.6500
## Number of obs: 884, groups: childid, 221
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 2.71932 0.06820 39.87
## t 1.08403 0.01955 55.44
##
## Correlation of Fixed Effects:
## (Intr)
## t -0.430
# We can use the lmer() function
random_slope <- lmer(read ~ t + (t | childid), data = reading)
# see results
summary(random_slope)
## Linear mixed model fit by REML ['lmerMod']
## Formula: read ~ t + (t | childid)
## Data: reading
##
## REML criterion at convergence: 2128.6
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.3579 -0.5406 -0.0376 0.4967 4.1818
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## childid (Intercept) 0.51927 0.7206
## t 0.06981 0.2642 0.15
## Residual 0.30645 0.5536
## Number of obs: 884, groups: childid, 221
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 2.71932 0.05762 47.19
## t 1.08403 0.02436 44.51
##
## Correlation of Fixed Effects:
## (Intr)
## t -0.205
# We can use the lmer() function
quadratic_simple <- lmer(read ~ t + t_squared + (t | childid), data = reading %>% mutate(t_squared = t^2))
# see results
summary(quadratic_simple)
## Linear mixed model fit by REML ['lmerMod']
## Formula: read ~ t + t_squared + (t | childid)
## Data: reading %>% mutate(t_squared = t^2)
##
## REML criterion at convergence: 2022.3
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.9572 -0.4935 -0.0371 0.4502 4.5190
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## childid (Intercept) 0.56713 0.7531
## t 0.08349 0.2889 0.04
## Residual 0.23807 0.4879
## Number of obs: 884, groups: childid, 221
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 2.53369 0.05991 42.29
## t 1.64093 0.05493 29.88
## t_squared -0.18563 0.01641 -11.31
##
## Correlation of Fixed Effects:
## (Intr) t
## t -0.333
## t_squared 0.274 -0.896
# We can also include the quadratic term as a growth parameter
quadratic_growth <- lmer(read ~ t + t_squared + (t + t_squared | childid), data = reading %>% mutate(t_squared = t^2))
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.00866492 (tol = 0.002, component 1)
# see results
summary(quadratic_growth)
## Linear mixed model fit by REML ['lmerMod']
## Formula: read ~ t + t_squared + (t + t_squared | childid)
## Data: reading %>% mutate(t_squared = t^2)
##
## REML criterion at convergence: 2003.9
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.2667 -0.4781 -0.0332 0.4251 3.9510
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## childid (Intercept) 0.57085 0.7555
## t 0.36464 0.6039 -0.04
## t_squared 0.01733 0.1316 -0.02 -0.90
## Residual 0.20349 0.4511
## Number of obs: 884, groups: childid, 221
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 2.53369 0.05880 43.09
## t 1.64093 0.06250 26.26
## t_squared -0.18563 0.01757 -10.57
##
## Correlation of Fixed Effects:
## (Intr) t
## t -0.285
## t_squared 0.216 -0.925
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.00866492 (tol = 0.002, component 1)
# We can use the texreg package for such comparisons
# Install the texreg package; you only need to do that once
## install.packages("texreg") # (uncomment if you need to install)
# Load the texreg package
library(texreg)
# We can then use the screenreg() function to make direct comparisons between models
list(random_intercepts, random_slope, quadratic_simple, quadratic_growth) %>% screenreg()
##
## ==========================================================================================
## Model 1 Model 2 Model 3 Model 4
## ------------------------------------------------------------------------------------------
## (Intercept) 2.72 *** 2.72 *** 2.53 *** 2.53 ***
## (0.07) (0.06) (0.06) (0.06)
## t 1.08 *** 1.08 *** 1.64 *** 1.64 ***
## (0.02) (0.02) (0.05) (0.06)
## t_squared -0.19 *** -0.19 ***
## (0.02) (0.02)
## ------------------------------------------------------------------------------------------
## AIC 2220.45 2140.56 2036.34 2023.92
## BIC 2239.59 2169.27 2069.83 2071.76
## Log Likelihood -1106.23 -1064.28 -1011.17 -1001.96
## Num. obs. 884 884 884 884
## Num. groups: childid 221 221 221 221
## Var: childid (Intercept) 0.73 0.52 0.57 0.57
## Var: Residual 0.42 0.31 0.24 0.20
## Var: childid t 0.07 0.08 0.36
## Cov: childid (Intercept) t 0.03 0.01 -0.02
## Var: childid t_squared 0.02
## Cov: childid (Intercept) t_squared -0.00
## Cov: childid t t_squared -0.07
## ==========================================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05
The quadratic growth model with a fixed coefficient for age squared is
the preferred solution.
# fit growth curve model with covariates
quadratic_simple_covs <- lmer(read ~ t + t_squared + male + homecog + (t | childid), data = reading %>% mutate(t_squared = t^2))
# see results
summary(quadratic_simple_covs)
## Linear mixed model fit by REML ['lmerMod']
## Formula: read ~ t + t_squared + male + homecog + (t | childid)
## Data: reading %>% mutate(t_squared = t^2)
##
## REML criterion at convergence: 2024.2
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.9753 -0.5063 -0.0452 0.4480 4.5622
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## childid (Intercept) 0.55944 0.7480
## t 0.08349 0.2889 0.01
## Residual 0.23807 0.4879
## Number of obs: 884, groups: childid, 221
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 2.12315 0.22398 9.479
## t 1.64093 0.05493 29.875
## t_squared -0.18563 0.01641 -11.312
## male -0.12537 0.11181 -1.121
## homecog 0.05235 0.02280 2.296
##
## Correlation of Fixed Effects:
## (Intr) t t_sqrd male
## t -0.091
## t_squared 0.073 -0.896
## male -0.267 0.000 0.000
## homecog -0.928 0.000 0.000 0.006
Download the read_long.dta
data file from the website. Download the
file and store it in a sensible folder in your computer (preferably the
folder of your R Project). Then, we can use the
read_stata()
function to import the dataset into
R
.
Again, because the data is available online, we can also use the
following function to import data into R
, which is more
efficient.
# import data into R
physfunc <- read_stata("https://thiagoroliveira.com/LDA-2025/data/physfunc.dta")
The data file physfunc.dta
contains repeated
measurements (for up to six occasions) on health functioning from a
study of British civil servants, the Whitehall II study. Health
functioning was assessed by the SF-36, a 36 item instrument that
comprises eight subscales covering physical, psychological and social
functioning. These eight scales can be summarised into physical and
mental health components. These are scaled using general US population
norms to have mean values of 50 and low scores imply poor functioning.
We will analyse physical health functioning (phf
).
The variables available in the physfunc
dataset are the
following:
Name | Description |
---|---|
id |
Identifier for each respondent |
occ |
Occasion (1, 2, 3, 4, 5, 6) |
female |
Gender (1 = female) |
grade |
Employment grade at baseline (1 = high grade, 2 = intermediate grade, 3 = low grade) |
age50 |
Mean age at baseline |
phf |
Physical health functioning |
The aim of this exercise is to investigate the following questions:
Considering the subset of female respondents only, carry out the following analysis and interpret the results:
Obtain summary statistics for phf
and
age50
at each occasion
Plot physical health functioning trajectories by age for any five individuals in the dataset.
Fit a simple random intercept growth model for phf
with no explanatory variables.
Fit a random slope growth model for phf
which allows
the rate of change in physical health functioning to vary among
individuals.
Fit a quadratic growth model for phf
which allows
for non-linear trajectories.
Which model is preferred?