fwiw below is the code and description of one approach to this:

Description of a randomization function that handles heterogeneous probabilities and awkward integer blocks sizes. The function is a lot slower than `randomizr`

for routine tasks but can handle less routine tasks that `randomizr`

cannot.

Two illustrations of functionality made possible with this function.

1: Jack and Jill have a race. Jill is faster than Jack and has a higher probability of winning. You want to simulate a distribution of wins. This is a situation where probabilites are heterogeneous and in which there is a target number of units to be selected. This problem is neither simple nor complete, as understood by `randomizr`

.

2: You have 2 districts with 3 villages each. You want to assign 3 villages to treatment, blocking by district, and with equal probabilities for all units. This randomization requires an allocation both across and within blocks whereas `randomizr`

only allocates within blocks. More generally, the issues here is that the target number to be assigned in a given block is not an integer.

These problems can both occur in a given problem and indeed you would expect them to whenever there are generic probabilities and blocks. They are not convoluted examples and it would be nice to have functionality that can handle them.

# The function

The basic function works by doing systematic sampling over a random (but block preserving) order.

```
.prob_ra <- function(p = .5,
b = NULL,
n = NULL,
tol = 10){
# Housekeeping
if(is.null(n)) {if(!is.null(b)) n <- length(b)
if( is.null(b) & length(p)>1) n <- length(p)}
if(length(p) == 1) p <- rep(p, n)
if(is.null(b)) b <- rep(1, n)
p <- round(p, tol)
m <- ceiling(sum(p))
if(m == 0) return(rep(0, length(p)))
# Figure out if we have to deal with a random total
tag <- m > floor(sum(p))
if(tag){
p <- c(p, ceiling(sum(p)) - sum(p))
n <- n+1
b <- c(b, ".dummy")
}
base <- p - p%%1
p <- p - base
# randomly order blocks then reorder within blocks
b_names <- unique(b)
k <- length(b_names)
seq1 <- rep(NA, length(b))
b_shuffle <- sample(1:k)
for(j in 1:k) seq1[b==b_names[j]] <- b_shuffle[j]
seq2 <- rank(seq1 + runif(n))
p[seq2] <- p
# Now do systematic assignment
s <- (cumsum(p) +m*runif(1))%%m
e <- s - floor(s)
out <- 1*(e < c(e[n], e[-n]))
out <- out[seq2]
out <- out + base
if(tag) out <- out[-n]
return(out)
}
```

The more general function applies this for each treatment:

```
prob_ra <- function(p = .5,
b = NULL,
n = NULL){
if(is.null(ncol(p))) {Z <- .prob_ra(p, b, n)
} else {
Z <- matrix(NA, nrow(p), ncol(p))
Z[,1] <- .prob_ra(p[,1],b,n)
for(j in 2:ncol(p)){
q <- p[,j]
q[apply(Z, 1, sum, na.rm = TRUE)==1] <- 0
q <- q/(1-apply(as.matrix(p[,1:(j-1)]), 1, sum, na.rm = TRUE))
q[is.nan(q)] <- 0
Z[,j] <- .prob_ra(as.vector(q), b, n)
}
Z <-Z%*%matrix(1:ncol(p),ncol(p))
}
Z}
```

# Illustration: One arm

With random data:

```
s <- 100
p <- runif(s)
b <- sample(1:5, s, replace = TRUE, prob = 1:5)
sims <- 10000
runs <- replicate(sims, prob_ra(p, b))
```

## Total selected is as tight as possible

There should only be a unit difference between the totals assigned in any set of runs:

```
table(apply(runs, 2, sum))
```

## Total selected *in each block* is also as tight as possible

Should be only max 1 unit between min and max

```
bin_dist <- apply(runs, 2, function(j) table(b, j)[,2])
table_check <- t(rbind(apply(bin_dist, 1, function(j) c(mean(j), min(j), max(j)))))
colnames(table_check) <- c("sim_p", "min", "max")
```

```
kable(round(cbind(size = table(b), true_p = aggregate(p, by = list(b), FUN = sum)[,2], table_check), 2))
```

## True assignment probabilities are respected at the lowest level

```
plot(p, apply(runs, 1, mean), xlim = c(0,1), ylim = c(0,1))
abline(0,1)
```

# Illustration: Multiple arms

The function can also be used sequentially for multiple treatment. In this case it implements the based treatment in a *hierarchical* manner, which preserves individual probabilities, but prioritizes balancing by order.

## Multiple Treatments Illustration

```
s <- 100
b <- sample(1:5, s, replace = TRUE, prob = 1:5)
p1 <- runif(s)
p2 <- runif(s)*(1-p1)
p <- cbind(p1,p2)
```

Note that t2 will be systematic, like t1, *given* t1, but not unconditionally systematic

We do two step allocation: first allocate t1 optimally and then given this allocation we allocate t2. We do this many times to check that the probability of assignments are all correct for p2.

```
runs2 <- replicate(sims, 1*(as.vector(prob_ra(p))==2))
```

The result is much tighter than independent, but not as tight as possible as possible

```
par(mfrow=c(1,2))
indep <- replicate(sims, sum(rbinom(length(p2), 1, p2)))
hist(indep, main = "Total t2 allocation | indep")
hist(apply(runs2, 2, sum), main = "Total t2 allocation | scheme", xlim = range(indep))
```

(Aside: would be useful to compare with distribution given random independent block targets.)

## Similarly total selected in each bin is tight but not as tight as possible

Ideally max 1 unit between min and max

```
bin_dist <- apply(runs2, 2, function(j) table(b, j)[,2])
table_check <- t(rbind(apply(bin_dist, 1, function(j) c(mean(j), min(j), max(j)))))
colnames(table_check) <- c("sim_p", "min", "max")
kable(round(cbind(size = table(b), true_p = aggregate(p2, by = list(b), FUN = sum)[,2], table_check), 2))
```

## But again the true probabilities preserved at unit level (and so also at block levels)

```
plot(p2, apply(runs2, 1, mean), xlim = c(0,1), ylim = c(0,1))
abline(0,1)
```