Block_RA 'Wrapping'

Hi All,

I’m looking for some guidance on (a) whether a technique is possible in randomizr and/or (b) what is the most efficient way to achieve the technique. I have schools nested within districts, and i want to randomly assign schools to intervention or control within district as a block. Assume i have 5 schools each in 5 districts, and the odd number of units in a block is where the problem potentially arises.

I generated a scenario where 3 schools in each district were assigned to intervention (n=15) and only 2 schools were assigned to control (n=10; see attached where 3-5 mfa represents treatment). Is there a way to ‘wrap’ the assignment mechanism in such a way so that if block 1 was 3-2 the next block would be 2-3 so that the greatest balance of assignment for the overall study could be achieved?

In Excel, let’s say, i might assign random numbers to all schools, then sort the file by district (block) and the random number and begin assignment at unit number 1. So assignment would proceed as T, C, T, C, T for block 1. Block 2 would then pick up where block 1 left off: C, T, C, T, C and so on.

I think this has been on the roadmap for a while:

Macartan’s writeup should still apply conceptually, but I think much of the randomizr syntax has shifted. I definitely feel like load balancing is not quite clear given how it could interact with the rest of the randomizr / DeclareDesign feature set.

I was thinking about this last night, may have come up with a workaround for you.

You have five districts (A-E) each with five schools:

A1, A2, A3, A4, A5
B1, B2, B3, B4, B5
C1, C2, C3, C4, C5
D1, D2, D3, D4, D5
E1, E2, E3, E4, E5

Create a new variable, with the remainder schools set to a new unit:

A1, A2, A3, A4
B1, B2, B3, B4
C1, C2, C3, C4
D1, D2, D3, D4
E1, E2, E3, E4
F1(A5), F2(B5), F3(C5), F4(D5), F5(E5)

Then do the assignment, blocking on that new variable.

The code is even easier if instead of the final school per group, we use the first:

A2, A3, A4, A5
B2, B3, B4, B5
C2, C3, C4, C5
D2, D3, D4, D5
E2, E3, E4, E5
F1(A1), F2(B1), F3(C1), F4(D1), F5(E1)

In code, that would be something like:

df$assn_district <- ifelse(df$school_id ==1, 'F', df$real_district)

fwiw below is the code and description of one approach to this:

Description of a randomization function that handles heterogeneous probabilities and awkward integer blocks sizes. The function is a lot slower than randomizr for routine tasks but can handle less routine tasks that randomizr cannot.

Two illustrations of functionality made possible with this function.

1: Jack and Jill have a race. Jill is faster than Jack and has a higher probability of winning. You want to simulate a distribution of wins. This is a situation where probabilites are heterogeneous and in which there is a target number of units to be selected. This problem is neither simple nor complete, as understood by randomizr.

2: You have 2 districts with 3 villages each. You want to assign 3 villages to treatment, blocking by district, and with equal probabilities for all units. This randomization requires an allocation both across and within blocks whereas randomizr only allocates within blocks. More generally, the issues here is that the target number to be assigned in a given block is not an integer.

These problems can both occur in a given problem and indeed you would expect them to whenever there are generic probabilities and blocks. They are not convoluted examples and it would be nice to have functionality that can handle them.

The function

The basic function works by doing systematic sampling over a random (but block preserving) order.

.prob_ra <- function(p = .5, 
                  b = NULL, 
                  n = NULL,
                  tol = 10){
  # Housekeeping
  if(is.null(n)) {if(!is.null(b)) n <- length(b)
                  if( is.null(b) & length(p)>1)  n <- length(p)}
  if(length(p) == 1) p <- rep(p, n)
  if(is.null(b))     b <- rep(1, n)
  p   <- round(p, tol)  
  m   <- ceiling(sum(p))
  if(m == 0) return(rep(0, length(p)))
  # Figure out if we have to deal with a random total
  tag <- m > floor(sum(p))
    p <- c(p, ceiling(sum(p)) - sum(p))
    n <- n+1
    b <- c(b, ".dummy")

  base <- p - p%%1
  p <- p - base 
  # randomly order blocks then reorder within blocks
  b_names   <- unique(b)
  k         <- length(b_names)
  seq1      <- rep(NA, length(b))
  b_shuffle <- sample(1:k)
  for(j in 1:k) seq1[b==b_names[j]] <- b_shuffle[j]
  seq2      <- rank(seq1 + runif(n))
  p[seq2]   <- p
  # Now  do systematic assignment
  s   <- (cumsum(p) +m*runif(1))%%m
  e   <- s - floor(s)
  out <- 1*(e < c(e[n], e[-n]))
  out <- out[seq2]
  out <- out + base
  if(tag) out <- out[-n]

The more general function applies this for each treatment:

prob_ra <- function(p = .5, 
                  b = NULL, 
                  n = NULL){

if(is.null(ncol(p))) {Z <- .prob_ra(p, b, n)
} else {
Z <- matrix(NA, nrow(p), ncol(p))
Z[,1] <- .prob_ra(p[,1],b,n)

for(j in 2:ncol(p)){
  q <- p[,j]
  q[apply(Z, 1, sum, na.rm = TRUE)==1] <- 0
  q <- q/(1-apply(as.matrix(p[,1:(j-1)]), 1, sum, na.rm = TRUE))  
  q[is.nan(q)] <- 0
  Z[,j] <- .prob_ra(as.vector(q), b, n)
Z <-Z%*%matrix(1:ncol(p),ncol(p))

Illustration: One arm

With random data:

s    <- 100
p    <- runif(s)
b    <- sample(1:5, s, replace = TRUE, prob = 1:5)
sims <- 10000
runs <- replicate(sims, prob_ra(p, b))

Total selected is as tight as possible

There should only be a unit difference between the totals assigned in any set of runs:

table(apply(runs, 2, sum))

Total selected in each block is also as tight as possible

Should be only max 1 unit between min and max

bin_dist <- apply(runs, 2, function(j) table(b, j)[,2])
table_check <- t(rbind(apply(bin_dist, 1, function(j)  c(mean(j), min(j), max(j)))))
colnames(table_check) <- c("sim_p", "min", "max")
kable(round(cbind(size = table(b), true_p = aggregate(p, by = list(b), FUN = sum)[,2], table_check), 2))

True assignment probabilities are respected at the lowest level

plot(p, apply(runs, 1, mean), xlim = c(0,1), ylim = c(0,1))

Illustration: Multiple arms

The function can also be used sequentially for multiple treatment. In this case it implements the based treatment in a hierarchical manner, which preserves individual probabilities, but prioritizes balancing by order.

Multiple Treatments Illustration

s  <- 100
b  <- sample(1:5, s, replace = TRUE, prob = 1:5)
p1 <- runif(s)
p2 <- runif(s)*(1-p1)
p <- cbind(p1,p2)

Note that t2 will be systematic, like t1, given t1, but not unconditionally systematic

We do two step allocation: first allocate t1 optimally and then given this allocation we allocate t2. We do this many times to check that the probability of assignments are all correct for p2.

runs2 <- replicate(sims, 1*(as.vector(prob_ra(p))==2))

The result is much tighter than independent, but not as tight as possible as possible

indep <- replicate(sims,  sum(rbinom(length(p2), 1, p2)))
hist(indep, main = "Total t2 allocation | indep") 
hist(apply(runs2, 2, sum), main = "Total t2 allocation | scheme", xlim = range(indep))

(Aside: would be useful to compare with distribution given random independent block targets.)

Similarly total selected in each bin is tight but not as tight as possible

Ideally max 1 unit between min and max

bin_dist <- apply(runs2, 2, function(j) table(b, j)[,2])
table_check <- t(rbind(apply(bin_dist, 1, function(j)  c(mean(j), min(j), max(j)))))
colnames(table_check) <- c("sim_p", "min", "max")

kable(round(cbind(size = table(b), true_p = aggregate(p2, by = list(b), FUN = sum)[,2], table_check), 2))

But again the true probabilities preserved at unit level (and so also at block levels)

plot(p2, apply(runs2, 1, mean), xlim = c(0,1), ylim = c(0,1))

In your case this would be:

prob_ra(p = .5, b = rep(1:3, each = 5))

which should produce always either 7 or 8 in treatment and always 2 or 3 in each block

We have found that the issues around this become quite hard when there are multiple treatments

Thanks @nfultz and @Macartan for devoting time to thinking about this and providing a potential solution. I’m going to read through this and digest. In the meantime i went down a rabbit hole with my scenario and ‘hacked’ a solution:

I have my data frame with schools nested within their blocks. I created random numbers and then repeated the pattern of assignment across all units (and blocks) based on the first unit’s random number. Forgive the rookie code…still learning R and all that comes with it :crazy_face:

#engage in block randomization using 'wrapping' when counts are odd within block
#create two random numbers: first to order units, second to assign first sorted unit to tx (< .5) or ct
pre_rand_data$ranord <- runif(nrow(pre_rand_data))
pre_rand_data$ranassign <- runif(nrow(pre_rand_data))
pre_rand_data <- pre_rand_data[order(pre_rand_data$dis_reg,pre_rand_data$ranord),]

#make initial assignment to tx or ct with first unit in first block
post_blk_rand <- pre_rand_data
#post_blk_rand$mfa_tx <- " "
#post_blk_rand[1,"mfa_tx"] <- ifelse(post_blk_rand[1,"ranassign"] < .5,"k-2 mfa","3-5 mfa")

#create wrap vector that will repeat the assignment pattern
post_blk_rand$ra_lt5 <-"k-2 mfa","3-5 mfa"), length.out = nrow(post_blk_rand)))
colnames(post_blk_rand$ra_lt5) <- "ra_lt5"
post_blk_rand$ra_gt5 <-"3-5 mfa","k-2 mfa"), length.out = nrow(post_blk_rand)))
colnames(post_blk_rand$ra_gt5) <- "ra_gt5"

#set mfa_tx = appropriate column based on ranassign
post_blk_rand$mfa_tx_temp <-[1,"ranassign"] < .5, 1, 0), length.out = nrow(post_blk_rand)))
colnames(post_blk_rand$mfa_tx_temp) <- "mfa_tx_temp"
post_blk_rand$mfa_tx <-[1,"mfa_tx_temp"] == 1,post_blk_rand$ra_gt5, post_blk_rand$ra_lt5))
colnames(post_blk_rand$mfa_tx) <- "mfa_tx"