DeclareDesign Community

Subsetting errors in a glm.nb multi-arm experiment


#1

I have been using DeclareDesign to set up a multi-armed design where:

  • the DV has a negative binomial distribution
  • glm.nb from MASS is used for the estimator.

My main question is how best to make comparisons between arms in my estimators?

Following the patterns in the example for multi-armed experiments, I have been using the subset argument to glm.nb to compare specific arms to other arms.

On my OSX machine with the latest R and DeclareDesign, I am able to define the following estimator (for example, to compare the third arm to the first):

declare_estimator(formula = Y ~ Z == 2, estimand="ate_Y_2_0", 
                  subset = data$Z %in% c(0,2) , 
                  model = glm.nb, label="Y_2_0.negbin")

However, when I run this code on an Ubuntu 14.04 machine with exactly the same R version, DeclareDesign version, and dplyr version, I get the following error:

ERROR while rich displaying an object: Error in data$Z: object of type ‘closure’ is not subsettable

According to this StackOverflow thread, the error could be a consequence of DeclareDesign having the convention of using data as the variable name for data. I’ve tried a few different ways of defining the subset criteria, and all of them produce errors on the Ubuntu machine.

The purpose of this question: my main goal is to specify a multi-armed experiment that involves a negative binomial estimator, with code that runs reliably on different machines. I’m still learning DeclareDesign, and there may be a better way to set this up. If a particular approach will get me to my goal without substantial debugging, I would be very happy. I have included full details, just in case a solution does require debugging.

Full code example (This code runs on OSX but not on Ubuntu)
#Note: I know the estimand that uses effect.a.b.irr is incorrectly calculated. I was working to test some potential corrections on a faster machine when I encountered this problem
#Note: I have reduced the n.size to speed up the debugging

library(MASS)
library(DeclareDesign)

# Return the difference in mu associated with an incidence rate ratio                                                                       
# from a negative binomial model. This difference can then be used to 
# simulate a negative binomial distribution for the effect of a given irr
#
#` @param mu The baseline mu in question
#` @param irr The incidence rate ratio in question
mu.diff.from.mu.irr <- function(mu, irr){
    mu*(irr-1)
}

effect.a.irr <- 1.15
effect.b.irr <- 1.1
baseline.mu <- 100
baseline.theta = 0.155

effect.a.b.irr <- (baseline.mu + mu.diff.from.mu.irr(mu.baseline,effect.b.irr)) / (baseline.mu + mu.diff.from.mu.irr(baseline.mu,effect.a.irr))

n.size = 10000

design <-  
    declare_population(
        N = n.size,
        group = draw_binary(N, prob=0.5),
        u = baseline.mu,
        theta = baseline.theta
    ) +
    declare_potential_outcomes(
        Y_Z_0 = rnegbin(n = N,mu=u ,theta=theta),
        Y_Z_1 = rnegbin(n=N,mu= u + mu.diff.from.mu.irr(u,effect.a.irr), theta=theta),
        Y_Z_2 = rnegbin(n=N,mu= u + mu.diff.from.mu.irr(u,effect.b.irr), theta=theta)
    ) +
    declare_assignment(num_arms = 3, 
                        conditions = c("0", "1", "2"), 
                        assignment_variable = Z) +
    declare_estimand(ate_Y_1_0 = log(effect.a.irr)) +
    declare_estimand(ate_Y_2_0 = log(effect.b.irr)) +
    declare_estimand(ate_Y_2_1 = log(effect.a.b.irr)) +
    declare_reveal(assignment_variables = Z) +
    declare_estimator(formula = Y ~ Z, estimand="ate_Y_1_0", subset = data$Z %in% c(0,1), model = glm.nb, label="Y_1_0.negbin") +
    declare_estimator(formula = Y ~ Z == 2, estimand="ate_Y_2_0", subset = data$Z %in% c(0,2) , model = glm.nb, label="Y_2_0.negbin") +
    declare_estimator(formula = Y ~ Z == 2, estimand="ate_Y_2_1", subset = data$Z %in% c(1,2) , model = glm.nb, label="Y_2_1.negbin")

Appendix: About My System Configurations
The only different package is repr 0.17 on OSX vs repr 0.18 on Ubuntu

Ubuntu Machine:
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] repr_0.18 DeclareDesign_0.12.0 estimatr_0.14
[4] fabricatr_0.6.0 randomizr_0.16.1 skimr_1.0.3
[7] forcats_0.3.0 stringr_1.3.1 purrr_0.2.5
[10] readr_1.2.1 tidyr_0.8.2 tibble_1.4.2
[13] tidyverse_1.2.1 rlang_0.3.0.1 ggplot2_3.1.0
[16] MASS_7.3-51.1 dplyr_0.7.8

OSX Machine:
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS 10.14.1

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] repr_0.17 DeclareDesign_0.12.0 estimatr_0.14
[4] fabricatr_0.6.0 randomizr_0.16.1 skimr_1.0.3
[7] forcats_0.3.0 stringr_1.3.1 purrr_0.2.5
[10] readr_1.2.1 tidyr_0.8.2 tibble_1.4.2
[13] tidyverse_1.2.1 rlang_0.3.0.1 ggplot2_3.1.0
[16] MASS_7.3-51.1 dplyr_0.7.8


#2

Thanks for sharing this.

In general, subset is applied within data already, so you should do Z %in% 0:1 without the data$. And that is why you got an error about data being a function.

It’s very strange that it would work on OSX at all - did you perhaps have something like data <- draw_data(design) in your workspace ahead of time? If so, then the estimator would subset based on the Z in your global environment’s data instead of the current simulation, leading to incorrect calculations.

At least that’s how it’s supposed to work.

Sadly, you’ve run in to a compatibility issue. declare_estimator was written to work with estimatr, which is in turn written using rlang quosures to control non-standard evaluation. This is different from the NSE that the stats and MASS packages use, which is written using substitute.

The easiest way to work around this would be to write a custom estimator:

  declare_estimator(label = "1/0", estimand="ate_Y_1_0",     handler=tidy_estimator(function(data){
    subset(tidy(glm.nb(formula = Y ~ Z, data, subset = (Z %in% c(0,1)))), term == "Z1")
  }))+
  declare_estimator(label = "2/0", estimand="ate_Y_2_0", handler=tidy_estimator(function(data){
    subset(tidy(glm.nb(formula = Y ~ Z, data, subset = (Z %in% c(0,2)))), term == "Z2")
  }))+
  declare_estimator(label = "2/1", estimand="ate_Y_2_1", handler=tidy_estimator(function(data){
    subset(tidy(glm.nb(formula = Y ~ Z, data, subset = (Z %in% c(2,1)))), term == "Z2")
  }))

However, these seem to be very slow, so I’m not going to try to run more than 2-3 sims, but it seems like it works?

@graemeblair :

I’m able to repro the error on Ubuntu 18.04.

The offending line in DeclareDesign is:

    results <- eval_tidy(quo(model(!!!args, data = data)))

in the estimator handler.

We have

model <- MASS::glm.nb

and

>args
$formula
<quosure>
  expr: ^Y ~ Z
  env:  0x3a205bc0

$subset
<quosure>
  expr: ^(Z %in% c(0, 1))
  env:  0x3a1fc7d0

So:

> (e <- quo(model(!!!args, data = data)))
<quosure>
  expr: ^model(formula = ^Y ~ Z, subset = ^(Z %in% c(0, 1)), data = data)
  env:  global
> eval_tidy(e)
Error in Z %in% c(0, 1) : object 'Z' not found

We can fix this by stripping the expression out of the quosure -

> args2 <- args
> args2$subset <- f_rhs(args2$subset)
> (e2 <- quo(model(!!!args2, data = data)))
<quosure>
  expr: ^model(formula = ^Y ~ Z, subset = (Z %in% c(0, 1)), data = data)
  env:  global

Note the lack of ^ indicating quos in front of the subset argument.

> eval_tidy(e2)

Call:
model(formula = ~(Y ~ Z), data = data, subset = (Z %in% c(0, 
    1)))

Coefficients:
(Intercept)           Z1  
   94.36064     16.62437  

Works. This also affects lm, glm and others that use substitute() to do the subsetting. I see two ways to address this:

  • Current behavior + special case stats:: and MASS:: models
  • Always strip subset + special case estimatr:: models

I would lean towards the first option.

Can you verify if it works as written on OSX? If so I would be concerned.


#3

Also is there any reason you are estimating (0vs1) and (0vs2) separately, instead of just doing one model Y~Z for Z=0,1,2 and pulling out the Z1 and Z2 coefficients?

That’s an easy 33% speed up.


#4

Hi @nfultz, thanks for this very fast and helpful response, this is great!

I’m going to respond to you in two parts, one related to reproducing the issue and a second related to meeting my design diagnosis goals.

did you perhaps have something like data <- draw_data(design) in your workspace ahead of time?

My code does not define a variable called data anywhere. And I remember re-running the code on OSX from a restarted kernel several times when preparing this post. However, when I run the code (updated to fix a bug in the example above, which I have amended in the post), I now get the error on OSX as well:

`Error in data$Z: object of type 'closure' is not subsettable`

So I think it’s fair to conclude that there must have been something in my environment that I didn’t realize, and that this contributed to the difference in behavior between OSX and Ubuntu. I now see no difference in behavior between OSes.


#5

Thanks again for a wonderfully thoughtful, constructive response, @nfultz!

Here is my response to your comments more related to the research goals.

Z %in% 0:1

I tried this initially after reading through the unit tests, but I was getting the “object not found” error you described further in your comment, so I searched for another approach. As you can see above, the “solution” I found was not a solution and was probably related to some variable I unknowingly had in my environment. It looks like my question may still have helped you discover a related issue, which makes me feel better about it [grin].

instead of just doing one model Y~Z for Z=0,1,2 and pulling out the Z1 and Z2 coefficients?

Statistically this would work. I wasn’t aware it was possible to return more than one term from a single declare_estimator call. This is my first DeclareDesign project and the multi-arm experiment example does individual comparisons. What would this look like in code, if you don’t mind providing an example?

Since I’m working with maximum likelihood models, even linear improvements in performance are very helpful!


#6

Oh, one final question: using the helpful code you suggested, the Coverage Mean is NA. Is there a way with this approach to estimate coverage? Thanks @nfultz!


#7

If you are willing to use the multcomp package, this is a bit simpler to understand:

estimator <- declare_estimator(handler=tidy_estimator(function(data){
  
  m <- glm.nb(Y~Z, data)
  
  tidy(summary(glht(m, c(Z1_Z0="Z1 = 0", Z2_Z0="Z2 = 0", Z2_Z1="Z2 -Z1= 0"))))
  
}), estimand=c("ate_Y_1_0", "ate_Y_2_0", "ate_Y_2_1"))

It should also run a good deal faster from only fitting the model once, and using glht to get the differences after the fact.

EDIT:

To get coverage working, you’ll have to calculate the CI, eg

estimator <- declare_estimator(handler=tidy_estimator(function(data){
  
  m <- glm.nb(Y~Z, data)
  
  out <- tidy(summary(glht(m, c(Z1_Z0="Z1 = 0", Z2_Z0="Z2 = 0", Z2_Z1="Z2 -Z1= 0"))))
  
  transform(out,
         conf.low = estimate - 1.96*std.error,
         conf.high = estimate + 1.96*std.error
         )
}), estimand=c("ate_Y_1_0", "ate_Y_2_0", "ate_Y_2_1"))

#8

Hi Neal, thanks for this extremely helpful response. Thanks to you, I now understand key ways to customize DeclareDesign for my research needs. In case you missed it, I have published a single-arm version of the code was writing on github here. I also expect that I’ll be using multi-arm versions with students in my class this semester..

Thanks again!