#StackBounty: #regression #sampling #mean #population #temporal-difference Compare means between samples, while controlling for samplin…

Bounty: 50

There are two independent samples of people, drawn from a population of a city at times $t_1$ and $t_2$, a decade apart*. The people were asked rate their preference regarding some question $Q$ on a scale of 1…10, as well as report their age and sex. The research question is, has average preference towards $Q$ changed in the city over time.

enter image description here

However, the sample composition varies somewhat between the two measurement time points in terms of both sex ratios and ages represented in the sample. The issue is therefore comparing means of $Q$ between the two temporal samples in a way that controls for sampling differences. E.g., in case say old men are thought to give a higher $Q$ rating, and there are slightly more older people and slightly more men in the $t_2$ sample, the the mean might look higher than in $t_1$ – but it might be really just an effect of having more old dudes in there. So it seems I can’t just run a simple t-test.

Here is an attempt: concatenate data from the two samples and record time {$t_1, t_2$} as a variable, build two regression models predicting the $Q$ rating, with age and sex as control variables, one model has time as a categorical predictor, and the other doesn’t.

model a: $Q sim age ~+~sex $

model b: $Q sim age ~+~sex ~+~time $

…and then compare the models to see if time adds descriptive power to the model (could also run cross-validation, compare prediction error, etc.). The idea is, if model b performs significantly better, then the differences between the two temporal samples must be significant too, after controlling for differences in sampling. So, for example, if mean $Q$ rating is a point higher in $t_2$ – but model b is not an improvement over model a – then I would conclude that the apparent “increase” is just a sampling effect.

Bottom line question: does this make sense? If not, why, and is there a better way to answer this sort of a research question?


* So technically, two samples from two different populations. There are only these two samples in the dataset, not any longer time series; and there are no overlaps in terms of participants, i.e. not that q; the sample sizes are almost the same, i.e. not that q. The differences between sex&age ratios are assumed to stem from sampling issues, not because of major changes in the city population in terms of sex ratio or age composition.


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.