Diagnosing multilevel models


I’m designing an experiment in which I will randomise an education intervention, the outcome will be test scores (out of 100) over a three month period. Individual students will receive a number of tests, and some students will have more tests than others.

My instinct is to just average over these tests for each student and make the outcome “average test score over three months”. However, a statistician suggested I should use a multilevel model to avoid throwing away data and estimate a random intercept for student.

I’m actually quite reluctant to introduce the kind of assumptions multilevel models require into my RCT so I’m wondering if it’s possible to use lme4 with DeclareDesign so I can compare the aggregated approach to a multilevel approach and check for bias/precision? Unfortunately I’m not sure where to start with declaring the estimator, any help/existing resources you have would be appreciated.

Thanks in advance!

1 Like

Hi Simsco, thanks for the great question! This is exactly the sort of thing DeclareDesign was built to answer.

So, luckily the folk in charge of broom.mixed have made some tidy methods for lmer objects – anything that works with tidy() is fairly easy to integrate into the DD workflow.

Let’s load the packages and build a multilevel DGP with some assumptions baked in:


N_schools <- 1
N_classes <- 10
N_students_per_class <- 20
N_tests <- 3

ml_design <- 
  # Tests are clustered within students, students within classes, 
  # classes within schools. Each level has a level specific effect, e.g. each i
  # student in class j gets shock alpha_j ~ N(0,.001), etc.
    school = add_level(N = N_schools, school_effect = rnorm(N, 0, .001)), 
    class = add_level(N = N_classes, class_effect = rnorm(N, 0, .001), 
                      N_students = rpois(N,N_students_per_class) + 5), 
    student = add_level(N = N_students, 
                        test_ability = runif(N, -.01, .01),
                        student_baseline = runif(N, min = .4,max = .8)), 
    test = add_level(N = 3, test_effect = rnorm(N, test_ability, .001))) +
  # Each student's score on a given test is a function of their baseline, 
  # plus the school, class and test-specific effects -- AND the treatment effect
  # is heterogeneous by all three (random slopes and intercepts). Note here that
  # the mean of the k'th test effect is specific to the i'th student:
  # test_ability_i ~ Uniform(-.01, .01); test_effect_ik ~ N(test_ability_i, .001)
    score ~ student_baseline + (1 + treatment) * (school_effect + class_effect + test_effect), 
    assignment_variable = "treatment") +
  # We want the average treatment effect across tests (but we could reparameterize this
  # as the average effect on the student's average score)
  declare_estimand(ate = mean(score_treatment_1 - score_treatment_0)) +
  # I'm assuming that the treatment is clustered at the class-level (easy to change)
  declare_assignment(prob = .5, clusters = class, assignment_variable = "treatment") +
  # Reveal the score potential outcome as a function of the treatment
  declare_reveal(score, treatment) 

You can look at the data using draw_data(ml_design) %>% head().

Now, we build a little helper function for the lmer() call, to ensure it returns estimates as a data.frame:

# We will use the fact that broom.mixed has a tidy method for lmer:
tidy_lmer <- function(data, formula){
  lmer_fit <- lmer(formula = formula, data = data)
  tidy_fit <- broom.mixed::tidy(lmer_fit)
  subset(tidy_fit, term == "treatment")

And we add it to the design. Notice that we are also using dplyr in the second-to-last step to aggregate test scores to the student-level average, before running our regression estimator.

ml_design <- 
  ml_design +
  # lmer
  declare_estimator(formula = score ~ treatment + (1 | student), 
                    handler = tidy_estimator(tidy_lmer),
                    estimand = "ate",
                    label = "Random Intercept") +
  # Aggregate the data to student level using dplyr
  declare_step(handler = function(data){
    data %>% group_by(school,class,student,treatment) %>%
      summarize(av_score = mean(score))},
    label = "Aggregate to student level") +
  # Run a regression of average scores on treatment
  declare_estimator(formula = av_score ~ treatment, 
                    model = lm_robust, 
                    clusters = class, 
                    estimand = "ate",
                    label = "Average Score")

Unfortunately, it seems as though lmer() does not automatically include p-values and confidence intervals, so you’re not going to get power and coverage from the diagnosis. If you need these, you could use the standard errors in the tidy_fit object that is created inside tidy_lmer to add a column of p-values and confidence intervals to the output. Then DD will also calculate power and coverage. But this will at least get you bias, RMSE, and things like the average SE versus the standard deviation of the estimates across simulations.

You’ll want to run a large number of simulations (e.g. 1000), but here’s what it looks like when you run 5:

simulate_design(ml_design,sims =  5)

diagnose_design(ml_design,sims =  5)


1 Like

@Jasper_Cooper - there’s a long running argument about dfs for t-stats for mixed models, and the lmer authors choose not to include it in their output - so I won’t go there, but would refer to the lmer FAQs. However, if you are using glmer, you get z-stats and normal diagnosands work fine.

You can also use glmer with the default declare_estimator handler, so you can do

declare_estimator(formula = score ~ treatment + (1 | student), 
                    model=glmer, family="binomial",
                    estimand = "ate",
                    label = "glmer") 

and it should work fine. Assumming broom.mixed is already loaded, it will get called automagically via multiple dispatch.

You would need to write custom handlers if you are interested in anything involving the random effects components of the mixed model, though.

I also really recommend cranking up the # of sims with multilevel models, 1000 is likely way too few, 50k seems to be fairly stable and not too slow.

Thanks @Jasper_Cooper and @nfultz this got me off to a great start.

On p-values, the package Lmrtest adds p-values (calculated using satterthwaite’s degrees of freedom method) to summary output for lme4 models. I was able to get this working with broom and output a tidy version that Declare Design can work with. Honestly, the lack of consensus around calculating p-values is another reason why I’m skeptical about using multilevel models when analysing RCTs.

Thanks again,