DeclareDesign Community

Best practices for very large populations

Are any examples available for using DeclareDesign with very large populations? For example, imagine I am working with a population of 10,000 firms over 180 days, where the firm-day is the unit of observation. I am interested in diagnosing different assignment strategies, such as clustering assignment on firm, firm-week, firm-month, etc.

Is a reproducible example available about how to run the within-design simulations in parallel using plan()? I need to make this kind of diagnosis faster.

I note this description in the documentation:
If the packages future and future.apply are installed, you can set plan to run multiple simulations in parallel.

Three comments:

  • Any example can be simulated in parallel by setting the plan first. I recommend plan(multicore) except on windows. Once that is set, multiple cores will be used under the hood for simulation. This can eat up a lot of memory, though, so you may not want to use all cores.

  • Installing OpenBLAS / ATLAS can be a big improvement depending on what estimator you are using and you have extra cores to use, generally won’t increase memory usage.

  • Depending on what you are doing, you might be able to reap substantial performance improvements with a more complicated execution plan by reusing earlier steps in a design. There’s some allusions to sims as a vector but that feature hasn’t been used very much in practice. I think Macartan has used this feature to diagnose biases between SATE and PATE. In your case, you can probably reuse the declared population and estimand steps across all simulations and only rerun sampling, assignment and estimation, for example. Further examples are in the tests/ folder of the package.

  • You can have multiple assignments and estimators in a single design, which is likely an order of magnitude more efficient than creating a set of designs and simulating them individually - I think you already know this though since you talked about within-design comparisons.

  • The exception to the above is sampling, which subsets the data frame in transit. There’s a new feature for declare_sampling (keep=0:1) which would let you generate multiple samples from one population draw and different sampling strategies - but if you do this, you will have to match up the estimator to the sample using the subset argument on the estimators.

1 Like

Also see https://github.com/DeclareDesign/DeclareDesign/issues/417 for a caveat about using future inside Rstudio.

Thanks very much for the pointers! Let me try some of these things out and circle back about anything that I learned along the way.