Background

Suppose you are interested in the distribution of a continuous quantity through a survey, but you cannot record the quantity directly – only through bins. For example, to protect the responder’s privacy you may not want to design a question that asks them for their exact age or income, or to lessen the cognitive burden on the responder you may not want to ask them how much time they took to complete a task. To that end, you might present the responder with a set of pre-defined intervals from which they can pick the one that contains the true quantity – thus hiding it from the survey analyst.

We are then interested in performing inference on this latent quantity of interest, and specifically its distribution and the qualities of that distribution. We might be interested in the average age, the median income, or the median time to task completion – but the data is only observed through bins.

Furthermore, those bins may not be evenly spaced or sized. For example, income is almost always right-skewed – with more people earning less money, and fewer people earning more. Similarly, task completion times may be right-skewed as well – with most tasks being short and quick-to-complete, and only few tasks that take a really long time. So the methodology for analyzing this binned data must enable us to infer a distribution which may be skewed or symmetrical.

In this post, we present a model ¹ and we demonstrate its application through the example of inferring the median task completion time based on an imaginary survey in which the non-linearly scaled options were: “0-5 minutes”, “5-60 minutes”, “1-6 hours”, “6-12 hours”, and “12-14 hours.”

Model Derivation

Data from the instruments described above are, in essence, realizations from a multinomial distribution. Let \(\mathbf{y}_i\) be the count of respondents in bin \(i = 1 \dots n\), and let \(N = \sum \mathbf{y}_i\) be the total number of respondents. Then, we can model the counts in each bin as

\[ \mathbf{y} \sim \operatorname{Multinomial}(\boldsymbol{\theta}, N) \]

where \(\boldsymbol{\theta}_i\) is the probability of observing a response in bin \(i\). We assume that the measured phenomenon (in this case, time to task completion) is distributed according to density \(f\) parameterized by \(\boldsymbol{\phi}\), \(f_{\boldsymbol{\phi}}\). The elements of \(\boldsymbol{\theta}\) can be computed using the cumulative distribution function (CDF) of \(f_{\boldsymbol{\phi}}\), \(F_{\boldsymbol{\phi}}\). Let \(t_i\) be bin edge \(i\). Then the elements of \(\boldsymbol{\theta}\) are

\[\begin{align} \boldsymbol{\theta}_1 &= F_{\boldsymbol{\phi}}(t_2)\>, \\ \boldsymbol{\theta}_2 &= F_{\boldsymbol{\phi}}(t_3) - F_{\boldsymbol{\phi}}( t_2) \>, \\ &\vdots \\ \boldsymbol{\theta}_{n-1} &= F_{\boldsymbol{\phi}}(t_{n}) - F_{\boldsymbol{\phi}}( t_{n-1}) \>, \\ \boldsymbol{\theta}_{n} &= 1 - F_{\boldsymbol{\phi}}( t_{n}) \>. \end{align}\]

The first and last elements of \(\boldsymbol{\theta}\) are constructed to soak up additional observations which may fall below/above the smallest/largest bin in the case the survey designer has included limits which do not span the support of \(f_{\boldsymbol{\phi}}\). The relation between \(\boldsymbol{\theta}\) and the latent density \(f_{\boldsymbol{\phi}}\) allows for estimation of \(\boldsymbol{\phi}\) and our latent quantity of interest “median time to task completion” using Stan.

Setup

library(tidyverse)
library(cmdstanr)
library(gt)
library(ggplot2)
library(ggdist)

register_knitr_engine(override = FALSE) # 'cmdstan' engine
theme_set(theme_minimal(base_size = 12))

time_scale_labeller <- function(x) {
  gsub("\\s0[dhms]", "", tolower(lubridate::seconds_to_period(x)))
}
# time_scale_labeller(c(3600, 3660, 3661)) # "1h" "1h 1m" "1h 1m 1s"

Example

This example is motivated by a real problem in which teams of data analysts and data scientists were surveyed on how much time it took them to complete requests. The survey designers were measuring time to delivery of requested data and/or actionable insights to stakeholders as part of an organizational goal to reduce the median time through improvements to tools and processes.

Suppose we are interested in the median task completion time based on an imaginary survey in which the non-linearly, unevenly scaled options were: “0-5 minutes”, “5-60 minutes”, “1-6 hours”, “6-12 hours”, and “12-14 hours.”

Data Simulation

set.seed(3600)
samples_from_latent_dist <- tibble(
    x1 = pmin(pmax(rlnorm(100, 8, 2), 60 * 2), 14 * 3600) # min: 2min, max: 14hr
) %>%
    mutate(
        x2 = cut(
            x1,
            breaks = c(0, 60 * 5, 3600, 3600 * 6, 3600 * 12, 3600 * 14),
            dig.lab = 5
        ),
        x3 = cut(
            x1, # in seconds
            breaks = c(0, 60 * 5, 3600, 3600 * 6, 3600 * 12, 3600 * 14),
            labels = c("(0,5]", "(5,60]", "(60,360]", "[360,720]", "(720,840]") # in minutes
        ),
        x4 = cut(
            x1, # in seconds
            breaks = c(0, 60 * 5, 3600, 3600 * 6, 3600 * 12, 3600 * 14),
            labels = c("0-5 minutes", "5-60 minutes", "1-6 hours", "6-12 hours", "12-14 hours")
        )
    )

ggplot(samples_from_latent_dist) +
    geom_histogram(aes(x = x1), bins = 30) +
    scale_x_continuous(
        "Time spent on completing task",
        labels = function(x) time_scale_labeller(round(x)),
        breaks = 60 * c(30, 60 * 2, 60 * 6, 60 * 12),
        minor_breaks = NULL
    ) +
    labs(
        y = NULL,
        title = "Distribution of simulated task completion times"
    )

Figure 1: Distribution of simulated task completion times, which will only be observed as bins.

ggplot(samples_from_latent_dist) +
    geom_bar(aes(x = x4, y = ..count../sum(..count..))) +
    scale_y_continuous("Proportion of responses", labels = scales::percent_format(1)) +
    labs(
        x = "Time to complete task",
        title = "Simulated responses to binned survey question"
    )

Figure 2: Simulated responses to binned survey question, by binning the hidden-to-analyst simulated task completion times.

lognormal_bins <- samples_from_latent_dist %>%
    count(x2) %>%
    mutate(
      binned_samples = str_replace_all(x2, "\\[|\\]", "") %>%
        str_replace_all("\\(|\\)", '')
    ) %>% 
    separate(binned_samples, c('left', 'right'), sep = ",")

model_data <- list(
    n = nrow(lognormal_bins),
    edges = sort(
      as.numeric(unique(c(lognormal_bins$left, lognormal_bins$right)))
    ),
    counts = lognormal_bins$n,
    prior_mu_mean = 4.5,
    prior_mu_sigma = 2
)

theta_true <- with(model_data, {
    e <- edges
    e[1] <- 0
    e[length(e)] <- Inf
    plnorm(e[2:length(edges)], 8, 2) - plnorm(e[1:length(edges)-1], 8, 2)
})

Model Code

data {
  int<lower=1> n;
  vector[n+1] edges; // always one more edge than counts
  int counts[n];

  real prior_mu_mean;
  real prior_mu_sigma;
}
parameters {
  real mu;
  real<lower=0> sigma;
}
transformed parameters {
  simplex[n] theta;
  real exp_mu; // median

  theta[1] = lognormal_cdf(edges[2], mu, sigma);
  for (i in 2:(n-1)) {
    theta[i] = lognormal_cdf(edges[i+1], mu, sigma) - lognormal_cdf(edges[i], mu, sigma);
  }

  theta[n] = 1 - sum(theta[1:(n-1)]);
  exp_mu = exp(mu);
}
model {
    mu ~ normal(prior_mu_mean, prior_mu_sigma);
    sigma ~ cauchy(0, 5);
    counts ~ multinomial(theta);
}

Results

multinomial_samples <- multinomial_model$sample(
  model_data, chains = 12,
  refresh = 0, show_messages = FALSE
)

posterior_summary <- multinomial_samples$summary(
  variables = c("mu", "sigma", "theta", "exp_mu")
)

model_draws_tidy <- posterior::as_draws_df(
  multinomial_samples$draws(variables = c("mu", "sigma", "theta", "exp_mu"))
)

	truth	median	ci95
mu	8.00	7.97	(7.62, 8.31)
sigma	2.00	1.99	(1.73, 2.31)
theta[1]	0.13	0.13	(0.08, 0.19)
theta[2]	0.41	0.41	(0.37, 0.46)
theta[3]	0.30	0.30	(0.26, 0.34)
theta[4]	0.07	0.07	(0.05, 0.08)
theta[5]	0.09	0.09	(0.05, 0.13)
exp_mu	49m 41s	48m 4s	(33m 50s, 1h 7m 32s)

set.seed(42)
model_draws_tidy %>%
    sample_n(200) %>%
    ggplot(aes(y = 0, color = "draw")) +
    geom_vline(xintercept = exp(8), linetype = "dashed") +
    stat_dist_slab(
      aes(dist = "lnorm", arg1 = mu, arg2 = sigma),
      fill = NA, slab_type = "cdf", scale = 1
    ) +
    stat_dist_slab(
      aes(dist = "lnorm", arg1 = 8, arg2 = 2),
      fill = NA, slab_type = "cdf", color = "red", data = NULL, scale = 1
    ) +
    scale_color_manual(
      name = NULL, values = c("draw" = rgb(0, 0, 0, 0.025)), guide = FALSE
    ) +
    scale_x_log10(
        "T: time spent on completing task",
        labels = function(x) time_scale_labeller(round(x)),
        breaks = 60 * c(1, 5, 30, 60, 60 * 8, 60 * 24, 60 * 24 * 7),
        minor_breaks = NULL
    ) +
    scale_y_continuous(
        "Probability of a task completed in T time or faster",
        labels = scales::percent_format(1),
        breaks = c(0, 0.25, 0.5, 0.75, 1.0), minor_breaks = NULL
    ) +
    ggtitle(
      "Cumulative distribution of task completion times",
      "Based on posterior draws of lognormal model parameters"
    )

Cumulative distribution functions of latent distribution of task completion times, based on posterior draws of the parameters. Red curve represents the true latent lognormal distribution.

Figure 3: Cumulative distribution functions of latent distribution of task completion times, based on posterior draws of the parameters. Red curve represents the true latent lognormal distribution.

Further Work

If you’re interested in extending this model or expanding on this work, we suggest looking into the following:

Regression

A natural extension of this methodology is regression and using predictors to model \(\boldsymbol{\phi}\), the parameterization of \(f\). In the example, if more is known about the types of tasks being worked on, that data can be used in a regression model to assess the effect of the type of task on the median time to completion. If the counts are separated by teams within departments, this model can be extended to a hierarchical regression model.

Mixtures

Suppose that in a survey of incomes, we only let responders pick the interval containing their actual income rather than asking for their exact income – to preserve their privacy and because this is a sensitive topic. As with the previous example our options are not evenly or linearly scaled: “0-8k”, “8k-15k”, “15k-30k”, “30k-60k”, “60k-100k”, “100k-200k”, “200k-500k”, “More than 500k.”

Now, when such a survey is sent out suppose that there are two groups of people who make up the sample: a group which tends to earn less on average and a group which tends to earn more on average. In such a scenario no additional information was collected to inform group membership (or else we could use regression), so we must infer this additional quantity through a mixture model.

Simulation-Based Calibration

The models in these examples (and any future extensions of this general approach) should be evaluated with Simulation-Based Calibration (SBC) to check whether they are well-calibrated with the respect to the specified priors.

Sensitivity Analysis

We were able to recover the truth in both examples – which is great and promising – but in the real world the truth isn’t known and the assumptions in the model may not be correct. How well would the method work when the model is misspecified, e.g. if the data generating process uses a Gamma distribution but we still model with the log-normal?

One potential approach is to perform graphical posterior predictive checks and compare each response interval’s observed proportion with many proportions simulated from the posterior predictive distribution.

Though the authors arrived at the model independently, we were made aware that this approach is actually the standard approach to modelling such data! Thank you to Corey Yanofsky for pointing us towards Heitjan 1989 paper and letting us know that a problem similar to the one we describe is actually a homework question in Bayesian Data Analysis 3rd Edition (BDA3). See p.g. 80, question 5 in BDA3.↩︎

Estimation with non-linearly scaled interval responses in surveys