DeclareDesign Community

Simulating school attendance data

I have a tricky question I’m not sure how to handle on my own. Let’s say I have a dataset of students with their ID and school.

library(DeclareDesign)

draw_date <- function(n, lower, upper){
  sample(seq(as.Date(lower), as.Date(upper), by="day"), n)
}

all_students <- declare_population(N = 100,
                   birthday = draw_date(N, "1990/01/01", "1994/01/01"),
                   grade_level = as.numeric(as.factor(year(birthday))),
                   school = draw_categorical(N = N, 
                                             category_labels = LETTERS[1:4], 
                                             prob = c(.25, .25, .25, .25)))()

How could I use this to make a dataset to simulate “attendance” for a year? Here the attendance dataset has the attendance everyday for each student (so N * date rows) and has the variables student_ID, school, date, and attendance.

Ideally I’d like to make the probability of attendance conditional on the school, and then try make some of the data “missing”, again conditional on school

Thank you for your help!

I think the gist of the solution is to do something like:

one_year <- seq(as.Date("2009/01/01"), as.Date("2010/01/01"), by="day")

declare_population(data = all_students,
                   dates = add_level(N = length(one_year), dates = one_year)) +
declare_potential_outcomes(attendance = case_when(school == "A" ~ rbinom(N, 1, .7),
                                                     school == "B" ~ rbinom(N, 1, .8),
                                                     school == "C" ~ rbinom(N, 1, .9),
                                                     school == "D" ~ rbinom(N, 1, .9)))

is this the right idea?

Hi johnhenry!

I think this is great. The only modification I’d offer is to make the probability of attendance a function of both individual and school level details. From there, you can elaborate the model to let attendance also be a function of time, for example, if you wanted. I’ve also included one way to make missingness. HTH!

library(tidyverse)
library(lubridate)
library(DeclareDesign)



draw_date <- function(n, lower, upper) {
  sample(seq(as.Date(lower), as.Date(upper), by = "day"), n)
}

all_students <-
  fabricate(
    N = 100,
    birthday = draw_date(N, "1990/01/01", "1994/01/01"),
    grade_level = as.numeric(as.factor(year(birthday))),
    school = draw_categorical(
      N = N,
      category_labels = LETTERS[1:4],
      prob = c(.25, .25, .25, .25)
    ),
    latent_attendance = rnorm(N, mean = 1.5)
  )


one_year <-
  seq(as.Date("2009/01/01"), as.Date("2010/01/01"), by = "day")

design <-
  declare_population(data = all_students,
                     dates = add_level(N = length(one_year), dates = one_year)) +
  declare_potential_outcomes(
    attendance_prob = pnorm(
      latent_attendance +
        0.1 * (school == "A") +
        0.2 * (school == "B") +
        0.3 * (school == "C") +
        0.4 * (school == "D")
    ),
    attendance = rbinom(N, 1, attendance_prob),
    missing_data = rbinom(N, 1, 0.02),
    attendance_observed = if_else(missing_data == 1, NA_integer_, attendance)
  )



dat <- draw_data(design)
head(dat)

dat %>% group_by(school) %>% summarise(mean(attendance))
dat %>% group_by(school) %>% summarise(mean(missing_data))

1 Like

love this, thank you!

@Alex_Coppock That looks correct to me, the only thing I’d add would be to filter out Sat/Sun from `one_year.

@johnhenry The above approach should work fine, but it’s usually easier in fabricatr to start from the top and go down - in your case, schools, then grades within schools, then students within grades.

This gives some extra flexibility - for example, each school could have a different calendar, which you could generate and then join back on, which might allow you to model the fact that UCLA is on quarters and Princeton is on semesters. When you looked at attendance rates by school the denominators of the proportions could differ in a realistic fashion.

1 Like

@nfultz can you help walk me through this a little? Is the idea to add a conditional to attendence_day based on calendar_type?

library(DeclareDesign)
library(bizdays)
create.calendar(name="mycal", weekdays=c('saturday', 'sunday'))
one_year <- bizseq(from = "2009/01/01", to = "2010/01/01", "mycal") # exclude weekends

fabricate(school = add_level(4, school = c("Princeton", "UCLA", "C", "D"), calendar_type = rep(c("Semester", "Quarter"), 2)),
          grades = add_level(4, grades = 1:4),
          student_id = add_level(N = sample(1:10, 16, replace = T)),
          attendence_day = add_level(N = length(one_year), attendence_day = one_year))

Exactly, I was thinking something like this (TERM representing either a quarter or semester):


                           +-------------+
                           |  SCHOOL     |
                           |             |
                           +-------------+
                                 |  |
                                 |  |
                                 |  |
                                 |  |
        +------------------+     |  |    +---------------+
        |   GRADE          +<----+  +--->+   CALENDAR    |
        |                  |             |               |
        +------------------+             +---------------+
                |                               |
                |                               |
                |                               v
                |                        +---------------+
                |                        |   TERM        |
                |                        |               |
                |                        +---------------+
                |                               |
                |                               |
                v                               v
        +------------------+             +---------------+
        |   STUDENT        |             |   DAYS        |
        |                  |             |               |
        +------------------+             +---------------+



If there are known differences in calendars, your code will be much simpler by generating the days using the calendar, rather than trying to generate a superset and marking days missing after the fact.

Right now, fabricatr only supports outer joins and (correlated) random join - since both Students and Days are descendents of School, you would want to join on the school ID as well, which we don’t support yet.

You can get the same result by nesting day inside of student (like you had), but then you have to write some conditional logic:

library(DeclareDesign)
library(bizdays)
create.calendar(name="mycal", weekdays=c('saturday', 'sunday'))
one_year <- bizseq(from = "2009/01/01", to = "2010/01/01", "mycal") # exclude weekends

cal_s <- bizseq(from = "2019/08/17", to = "2019/12/06", "mycal")
cal_q <- bizseq(from = "2019/09/23", to = "2019/12/13", "mycal")

fabricate(school = add_level(4, school = c("Princeton", "UCLA", "C", "D"), calendar_type = rep(c("Semester", "Quarter"), 2)),
          grades = add_level(4, grades = 1:4),
          student_id = add_level(N = sample(1:10, 16, replace = T)),
          attendence_day = add_level(N = ifelse(calendar_type == "Semester", length(cal_s), length(cal_q)), 
                                                attendence_day = ifelse(calendar_type == "Semester", cal_s, cal_q)
                                    )
          )

If this is more than a one-shot study, we might consider adding inner joins to fabricatr, but that would require some thought around how to specify join conditions.

1 Like