Surveys often present responders with answer options formatted as intervals/bins when asking sensitive questions, to preserve privacy, or to simplify cognitive load. Instead of asking for exact age, survey designer may ask for an age group; instead of asking for exact income, the designer may ask for a range; instead of asking how much time a task took, the options may be “5-15 minutes” and “1-2 days” to make it easier for the survey taker to answer the question and to make the analysis of responses easier. In this post we present a simple model which may be used to infer a latent, potentially skewed continuous distribution which can only be observed indirectly through discretized realizations.
Suppose you are interested in the distribution of a continuous quantity through a survey, but you cannot record the quantity directly – only through bins. For example, to protect the responder’s privacy you may not want to design a question that asks them for their exact age or income, or to lessen the cognitive burden on the responder you may not want to ask them how much time they took to complete a task. To that end, you might present the responder with a set of pre-defined intervals from which they can pick the one that contains the true quantity – thus hiding it from the survey analyst.
We are then interested in performing inference on this latent quantity of interest, and specifically its distribution and the qualities of that distribution. We might be interested in the average age, the median income, or the median time to task completion – but the data is only observed through bins.
Furthermore, those bins may not be evenly spaced or sized. For example, income is almost always right-skewed – with more people earning less money, and fewer people earning more. Similarly, task completion times may be right-skewed as well – with most tasks being short and quick-to-complete, and only few tasks that take a really long time. So the methodology for analyzing this binned data must enable us to infer a distribution which may be skewed or symmetrical.
In this post, we present a model 1 and we demonstrate its application through the example of inferring the median task completion time based on an imaginary survey in which the non-linearly scaled options were: “0-5 minutes”, “5-60 minutes”, “1-6 hours”, “6-12 hours”, and “12-14 hours.”
Data from the instruments described above are, in essence, realizations from a multinomial distribution. Let \(\mathbf{y}_i\) be the count of respondents in bin \(i = 1 \dots n\), and let \(N = \sum \mathbf{y}_i\) be the total number of respondents. Then, we can model the counts in each bin as
\[ \mathbf{y} \sim \operatorname{Multinomial}(\boldsymbol{\theta}, N) \]
where \(\boldsymbol{\theta}_i\) is the probability of observing a response in bin \(i\). We assume that the measured phenomenon (in this case, time to task completion) is distributed according to density \(f\) parameterized by \(\boldsymbol{\phi}\), \(f_{\boldsymbol{\phi}}\). The elements of \(\boldsymbol{\theta}\) can be computed using the cumulative distribution function (CDF) of \(f_{\boldsymbol{\phi}}\), \(F_{\boldsymbol{\phi}}\). Let \(t_i\) be bin edge \(i\). Then the elements of \(\boldsymbol{\theta}\) are
\[\begin{align} \boldsymbol{\theta}_1 &= F_{\boldsymbol{\phi}}(t_2)\>, \\ \boldsymbol{\theta}_2 &= F_{\boldsymbol{\phi}}(t_3) - F_{\boldsymbol{\phi}}( t_2) \>, \\ &\vdots \\ \boldsymbol{\theta}_{n-1} &= F_{\boldsymbol{\phi}}(t_{n}) - F_{\boldsymbol{\phi}}( t_{n-1}) \>, \\ \boldsymbol{\theta}_{n} &= 1 - F_{\boldsymbol{\phi}}( t_{n}) \>. \end{align}\]
The first and last elements of \(\boldsymbol{\theta}\) are constructed to soak up additional observations which may fall below/above the smallest/largest bin in the case the survey designer has included limits which do not span the support of \(f_{\boldsymbol{\phi}}\). The relation between \(\boldsymbol{\theta}\) and the latent density \(f_{\boldsymbol{\phi}}\) allows for estimation of \(\boldsymbol{\phi}\) and our latent quantity of interest “median time to task completion” using Stan.
time_scale_labeller <- function(x) {
gsub("\\s0[dhms]", "", tolower(lubridate::seconds_to_period(x)))
}
# time_scale_labeller(c(3600, 3660, 3661)) # "1h" "1h 1m" "1h 1m 1s"
This example is motivated by a real problem in which teams of data analysts and data scientists were surveyed on how much time it took them to complete requests. The survey designers were measuring time to delivery of requested data and/or actionable insights to stakeholders as part of an organizational goal to reduce the median time through improvements to tools and processes.
Suppose we are interested in the median task completion time based on an imaginary survey in which the non-linearly, unevenly scaled options were: “0-5 minutes”, “5-60 minutes”, “1-6 hours”, “6-12 hours”, and “12-14 hours.”
set.seed(3600)
samples_from_latent_dist <- tibble(
x1 = pmin(pmax(rlnorm(100, 8, 2), 60 * 2), 14 * 3600) # min: 2min, max: 14hr
) %>%
mutate(
x2 = cut(
x1,
breaks = c(0, 60 * 5, 3600, 3600 * 6, 3600 * 12, 3600 * 14),
dig.lab = 5
),
x3 = cut(
x1, # in seconds
breaks = c(0, 60 * 5, 3600, 3600 * 6, 3600 * 12, 3600 * 14),
labels = c("(0,5]", "(5,60]", "(60,360]", "[360,720]", "(720,840]") # in minutes
),
x4 = cut(
x1, # in seconds
breaks = c(0, 60 * 5, 3600, 3600 * 6, 3600 * 12, 3600 * 14),
labels = c("0-5 minutes", "5-60 minutes", "1-6 hours", "6-12 hours", "12-14 hours")
)
)
ggplot(samples_from_latent_dist) +
geom_histogram(aes(x = x1), bins = 30) +
scale_x_continuous(
"Time spent on completing task",
labels = function(x) time_scale_labeller(round(x)),
breaks = 60 * c(30, 60 * 2, 60 * 6, 60 * 12),
minor_breaks = NULL
) +
labs(
y = NULL,
title = "Distribution of simulated task completion times"
)
ggplot(samples_from_latent_dist) +
geom_bar(aes(x = x4, y = ..count../sum(..count..))) +
scale_y_continuous("Proportion of responses", labels = scales::percent_format(1)) +
labs(
x = "Time to complete task",
title = "Simulated responses to binned survey question"
)
lognormal_bins <- samples_from_latent_dist %>%
count(x2) %>%
mutate(
binned_samples = str_replace_all(x2, "\\[|\\]", "") %>%
str_replace_all("\\(|\\)", '')
) %>%
separate(binned_samples, c('left', 'right'), sep = ",")
data {
int<lower=1> n;
vector[n+1] edges; // always one more edge than counts
int counts[n];
real prior_mu_mean;
real prior_mu_sigma;
}
parameters {
real mu;
real<lower=0> sigma;
}
transformed parameters {
simplex[n] theta;
real exp_mu; // median
theta[1] = lognormal_cdf(edges[2], mu, sigma);
for (i in 2:(n-1)) {
theta[i] = lognormal_cdf(edges[i+1], mu, sigma) - lognormal_cdf(edges[i], mu, sigma);
}
theta[n] = 1 - sum(theta[1:(n-1)]);
exp_mu = exp(mu);
}
model {
mu ~ normal(prior_mu_mean, prior_mu_sigma);
sigma ~ cauchy(0, 5);
counts ~ multinomial(theta);
}
multinomial_samples <- multinomial_model$sample(
model_data, chains = 12,
refresh = 0, show_messages = FALSE
)
posterior_summary <- multinomial_samples$summary(
variables = c("mu", "sigma", "theta", "exp_mu")
)
model_draws_tidy <- posterior::as_draws_df(
multinomial_samples$draws(variables = c("mu", "sigma", "theta", "exp_mu"))
)
truth | median | ci95 | |
---|---|---|---|
mu | 8.00 | 7.97 | (7.62, 8.31) |
sigma | 2.00 | 1.99 | (1.73, 2.31) |
theta[1] | 0.13 | 0.13 | (0.08, 0.19) |
theta[2] | 0.41 | 0.41 | (0.37, 0.46) |
theta[3] | 0.30 | 0.30 | (0.26, 0.34) |
theta[4] | 0.07 | 0.07 | (0.05, 0.08) |
theta[5] | 0.09 | 0.09 | (0.05, 0.13) |
exp_mu | 49m 41s | 48m 4s | (33m 50s, 1h 7m 32s) |
set.seed(42)
model_draws_tidy %>%
sample_n(200) %>%
ggplot(aes(y = 0, color = "draw")) +
geom_vline(xintercept = exp(8), linetype = "dashed") +
stat_dist_slab(
aes(dist = "lnorm", arg1 = mu, arg2 = sigma),
fill = NA, slab_type = "cdf", scale = 1
) +
stat_dist_slab(
aes(dist = "lnorm", arg1 = 8, arg2 = 2),
fill = NA, slab_type = "cdf", color = "red", data = NULL, scale = 1
) +
scale_color_manual(
name = NULL, values = c("draw" = rgb(0, 0, 0, 0.025)), guide = FALSE
) +
scale_x_log10(
"T: time spent on completing task",
labels = function(x) time_scale_labeller(round(x)),
breaks = 60 * c(1, 5, 30, 60, 60 * 8, 60 * 24, 60 * 24 * 7),
minor_breaks = NULL
) +
scale_y_continuous(
"Probability of a task completed in T time or faster",
labels = scales::percent_format(1),
breaks = c(0, 0.25, 0.5, 0.75, 1.0), minor_breaks = NULL
) +
ggtitle(
"Cumulative distribution of task completion times",
"Based on posterior draws of lognormal model parameters"
)
If you’re interested in extending this model or expanding on this work, we suggest looking into the following:
Regression
A natural extension of this methodology is regression and using predictors to model \(\boldsymbol{\phi}\), the parameterization of \(f\). In the example, if more is known about the types of tasks being worked on, that data can be used in a regression model to assess the effect of the type of task on the median time to completion. If the counts are separated by teams within departments, this model can be extended to a hierarchical regression model.
Mixtures
Suppose that in a survey of incomes, we only let responders pick the interval containing their actual income rather than asking for their exact income – to preserve their privacy and because this is a sensitive topic. As with the previous example our options are not evenly or linearly scaled: “0-8k”, “8k-15k”, “15k-30k”, “30k-60k”, “60k-100k”, “100k-200k”, “200k-500k”, “More than 500k.”
Now, when such a survey is sent out suppose that there are two groups of people who make up the sample: a group which tends to earn less on average and a group which tends to earn more on average. In such a scenario no additional information was collected to inform group membership (or else we could use regression), so we must infer this additional quantity through a mixture model.
Simulation-Based Calibration
The models in these examples (and any future extensions of this general approach) should be evaluated with Simulation-Based Calibration (SBC) to check whether they are well-calibrated with the respect to the specified priors.
Sensitivity Analysis
We were able to recover the truth in both examples – which is great and promising – but in the real world the truth isn’t known and the assumptions in the model may not be correct. How well would the method work when the model is misspecified, e.g. if the data generating process uses a Gamma distribution but we still model with the log-normal?
One potential approach is to perform graphical posterior predictive checks and compare each response interval’s observed proportion with many proportions simulated from the posterior predictive distribution.
Though the authors arrived at the model independently, we were made aware that this approach is actually the standard approach to modelling such data! Thank you to Corey Yanofsky for pointing us towards Heitjan 1989 paper and letting us know that a problem similar to the one we describe is actually a homework question in Bayesian Data Analysis 3rd Edition (BDA3). See p.g. 80, question 5 in BDA3.↩︎
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/wikimedia-research/survey-interval-responses, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".