DeclareDesign Community

How should I think about the relationship between `declare_sampling` & `declare_assignment`?

#1

I am trying to develop intuition on the difference between declare_assignment and declare_sampling, I think I understand declare_assignment fairly well so I won’t ask too much about that:

  • Can I crudely think of declare_sampling as akin to sample, subsetting or filtering? I.e., anything that reduces the number of participants?

  • So far, I’ve been noticing declare_assignment is a lot more commonly used. Are there any DesignLibrary vignettes that use declare_sampling?

I have one last question thinking in relation to blocking and clustering:

For example in the population once I’ve used add_level I can randomize on block/cluster level e.g.,

declare_assignment(block_prob = .5, 
                   blocks = blocks, 
                   clusters = clusters)

On the declare_sampling webpage, there is an example: declare_sampling(strata = female). Is it safe to assume that the name of the variable is female(?) and that it was defined in the population using the add_level function? If not, is this a workaround to not using the add_level function?

I am assuming strata and blocks are synonymous, is that correct?

0 Likes

#2
  1. You are spot on, sampling literally subsets a population down to a sample.

  2. For examples in DesignLibrary, there’s the cluster_sampling designer, and regression_discontinuity designer is an example with a custom handler (implemented using subset).

  3. It looks like the manual for this step is a little undercooked. For that particular example, female needs to be a column on the data frame, but it need not be generated by add_level - something like this should work just fine:

 p <- declare_population(N=30, 
          female=rbinom(1, N, .5), 
          pet=sample(c('Dog', 'Cat', 'Turtle'), N, replace=TRUE))
  1. The words have different connotations. Strata/stratification/stratifying connotes sampling (and is used on the sampling functions in randomizr), blocks/blocking connotes a designed experiment (and is used on the assignment functions).
1 Like

#3

Hi John-Henry,

Neal gave a great reply – just wanted to add one point on the correspondence between stratification, blocking, and clustering in sampling and random assignment.

declare_assignment and declare_sampling both use randomizr::block_and_cluster_ra()and randomizr::strata_and_cluster_rs(), respectively, as their default handlers.

It turns out that block random assignment is stratified random sampling: you condition on some covariate (pet), and “sample” equal numbers of people with dogs, cats, and turtles into control and into treatment. Similarly, cluster random assignment is clustered sampling: you sample whole groups of people into treatment or into control. The difference between assignment and sampling is simply that, in the case of sampling, when you get a 0 you drop out of the data, whereas when you get a 0 in assignment you drop out of the treatment, into control. This provides an intuition for why we use IPW when we estimate treatment effects under heterogeneous assignment propensities: it’s the same as saying “when we estimate the treatment (control) average, let’s take account of the fact that units with these attributes are overrepresented in the treatment (control) group,” in exactly the same way you would when using sampling weights to get a population estimate under random sampling.

So, at a fundamental level randomizr::block_ra() and randomizr::strata_rs() are doing essentially the same thing (and, for the most part, have parallel arguments); same for randomizr::cluster_ra() and randomizr::cluster_rs().

1 Like

#4

Thanks Neal & Jasper. Bringing this back around to my add_level question, what are the main advantage of using add_level when defining my population as compared to just blocking on a variable (e.g., declare_assignment(blocks = female))?

My impression is that the advantage of add_level allows me to:

  1. precisely decide the size of my blocks
  2. block on multiple vars
  3. Define certain block level characteristics e.g., u_b = rnorm(N) * sd_block

Is this all correct? Am I missing something? Does it ever make sense to just do something like declare_assignment(blocks = female)?

0 Likes

#5

hi JH,

Typically, declare_population() helps determine the data-generating process that takes place prior to any researcher intervention like random assignment of a treatment. add_level is used to give you more control over making your data hierarchical, and really for that purpose only. So, both of these give you a dataset with four groups and forty individuals
declare_population(groups = add_level(N = 4, letter = LETTERS[1:4]), individuals = add_level(N = 10, noise = rnorm(N))
declare_population(N = 40, letter = sample(LETTERS[1:4],40,TRUE), noise = rnorm(N))
but the second version, as you point out, won’t have precisely controlled groups sizes.

This is separate from the question of blocked random assignment. When you say declare_assignment(blocks = female) you’re telling DD to randomize once among men and once among women. Not every level of groups added with add_level() is a block, and not every blocking scheme conditions on variables added to your data using add_level. So, the following design is completely permissable:
declare_population(N = 10, female = rbinom(N, 1, .5)) + declare_assignment(blocks = female).

My impression is that the advantage of add_level allows me to:

  1. precisely decide the size of my blocks

Yes. But you could also precisely determine block sizes without add_level() (e.g. declare_population(N = 10, blocks = rep(c(1,2),c(5,5)) + declare_assignment(blocks = blocks). And also it’s generally helpful for determining the size of groups in general, not just ones you plan to block on.

  1. block on multiple vars

I don’t think this is an advantage of add_level – this sounds like a job for some other function or package, such as blockTools, which forms blocks for the random assignment based on multiple variables. add_level won’t do this – it just adds a hierarchical level to your data (e.g. students within classes within schools).

  1. Define certain block level characteristics e.g., u_b = rnorm(N) * sd_block

Yes! Block or group level characteristics. This is probably the most helpful thing about add_level. If you want to say that students in class j get a common shock because they share the same teacher, add_level makes parameterizing this so much easier.

Is this all correct? Am I missing something? Does it ever make sense to just do something like declare_assignment(blocks = female) ?

Yes, when you want to ensure that equal numbers of men and women are assigned to treatment – which, as I hope is clear from the above, is a separate question from that of adding hierarchy to the data.

1 Like

DeclareDesign, dplyr, & blocking on multiple vars in an elegant way