*Bounty: 50*

I am reviewing a sampling design devised by a colleague and completely fail to understand it, although I am not a novice in statistics (but not a huge expert either). The said colleague is no longer available for providing clarifications, so I would appreciate any help.

**Situation and aim**

The following situation is made up but reflects the set up in the real situation. Suppose we are interested in the amount of money people have spent on candy in a particular shopping mall over some fixed period of time (e.g. a year). Also, suppose we know the total amount spent on everything by each person, but not the breakdown (people who didn’t spend anything can be disregarded). So we need to survey a random sample of people who have spent non-zero amounts and visited the mall in the target period, and find out the amounts they have spent on candy. This estimate can then be extrapolated to the whole population of people who visited the mall in the same period, which is a known number.

The crux is that we have to aim for a desired level of accuracy for this estimate. The 95% CI needs to be within 5% of the total expenditure across all spenders in that mall in the focal year.

**Colleague’s method**

Colleague devised a method involving stratification, claiming that it reduces the required sample size. However, his method becomes unintelligible to me way before stratification is introduced. The method is described generally for $k$ strata. For example, if 3 strata are used, these could be low spenders, medium spenders and high spenders.

Given that the known quantities are:

$T$ — total $ spent by people in the mall over the year

$N$ — number of spenders over the year

$N_i$ — number of spenders in each stratum

$W_i = N_i/N$ — proportion of spenders in each stratum

$S_i$ — standard deviation of $ spent per person in each stratum

$z = 1.96$ — Z-score corresponding to the 95% CI

The workings are:

As we aim for a 95% CI of the mean $pm~ d$, where $d$ is 5% of the total expenditure, then

$d = 0.05T$

Then, the target variance is worked out as

$text{Var}_text{t} = frac{d^2}{z^2N^2}$ — I do not see where this comes from at all.

(There is another version of the document which has

$text{Var}_text{t} = frac{z^2}{d^2N^2}$. I know it looks shambolic. Neither formula makes sense to me.)

The first approximation to the sample size is then calculated as:

$n_o = sum_{i=1}^k frac{(W_iS_i)^2}{text{Var}_text{t}}$

And the final required sample size as:

$n = frac{n_o}{frac{sum_{i=1}^k W_iS_i^2}{Ntext{Var}*text{t}}} = frac{n_oNtext{Var}_text{t}}{sum*{i=1}^k W_iS_i^2}$

Colleague then describes extrapolation of the overall proportion spent on candy as follows. After the questionnaires are in, we have the values of $C_{ip}$ – the proportions each person $p$ spent on candy in each stratum $i$. We then use these values to get $C_{i*}$ – the (presumably weighted by person) proportions spent on candy in each stratum. And so the amount spent on candy in each stratum is worked out by extrapolation:

$A_i = N_i times C_{i*} $ – this makes sense.

But the uncertainty around this estimate is caclulated in a weird way. Suppose $S_{i*}$ is the standard deviation in the amount spent **on candy** per person in each stratum.

Colleague first calculates a “sampling factor” for each stratum:

$text{fact}_i = 1 – frac{n_i}{N_i}$

and then uses it to calculate the stratum population SD:

$N_isqrt{frac{text{fact}*i S*{i*}^2}{n_i}}$ — this step is particularly unclear to me

then squares it to calculate the stratum population variance:

$left(N_isqrt{frac{text{fact}*i S*{i*}^2}{n_i}}right)^2$

then adds up across all strata to get the overall population variance:

$sumleft(N_isqrt{frac{text{fact}*i S*{i*}^2}{n_i}}right)^2$

then takes a square root to get the overall population SD.

$sqrt{sumleft(N_isqrt{frac{text{fact}*i S*{i*}^2}{n_i}}right)^2}$

this is then used in the calculation of a confidence interval…

**My thinking**

I fail to follow the colleague’s method after $d = 0.05T$.

This is a deviation (a half-width) in a confidence interval, which is equal to:

$d = zfrac{s}{sqrt{n}}$, where $s$ is the sample standard deviation (ignoring stratification for the moment).

Rearranging this gives:

$s^2 = frac{d^2n}{z^2}$, which is the variance, but is different from both versions of $text{Var}_text{t}$ formula in colleague’s method.

We can also express the sample size:

$n = frac{z^2s^2}{d^2}$

$z^2$ and $d^2$ are known, but the variance, $s^2$, is unknown. A further glaring fact is that this needs to be the variance in the $ spent per person **on candy**, not the $ spent per person on everything. This leads me to the conclusion that is impossible to calculate the required sample size prior to sampling given that the variance in the $ spent per person **on candy** is unknown (overall or per stratum). Although I hesitate labelling colleagues method as wrong as I don’t understand it. There must be some rationale behind it.

Any ideas?

Get this bounty!!!