#StackBounty: #neural-networks #dataset #sample Is it better to split sequences into overlapping or non-overlapping training samples?

Bounty: 50

I have $N$ (time) sequences of data with length $2048$. Each of these sequences correseponds to a different target output. However, I know that only a small part of the sequence is needed to actually predict this target output, say a sub-sequence of length $128$.

I could split up each of the sequences into $16$ partitions of $128$, so that I end up with $16N$ training smaples. However, I could drastically increase the number of training samples if I use a sliding window instead: there are $2048-128 = 1920$ unique sub-sequences of length $128$ that preserve the time series. That means I could in fact generate $1920N$ unique training samples, even though most of the input is overlapping.

I could also use a larger increment between individual "windows", which would reduce the number of sub-sequences but it could remove any autocorrelation between them.

Is it better to split my data into $16N$ non-overlapping sub-sequences or $1920N$ partially overlapping sub-sequences?

Get this bounty!!!

#StackBounty: #hypothesis-testing #t-test #survey #sample Representativeness test: how?

Bounty: 50

A survey to a representative sample of the population has been submitted. This sample is composed by 4.400 individuals. However, only 2003 individuals have completed to survey.
I want to verify if this subsample (2003 individuals) is representative of the population with regards to some key variables. Concerning this issue, I have two questions:

  1. What is the most suitable method? I was thinking to t-test for continuous variable and chi-square goodness of fit test for categorical variable (or is more appropriate z-test for the latter?)
  2. Should I compare who has completed the survey with who hasn’t (2003 vs. 2397), or alternatively who has completed the survey with the whole sample (2003 vs. 4400)?

Thank you

Get this bounty!!!

#StackBounty: #variance #sampling #sample-size #sample #stratification Calculating a sample size based on the target width of a confide…

Bounty: 50

I am reviewing a sampling design devised by a colleague and completely fail to understand it, although I am not a novice in statistics (but not a huge expert either). The said colleague is no longer available for providing clarifications, so I would appreciate any help.

Situation and aim

The following situation is made up but reflects the set up in the real situation. Suppose we are interested in the amount of money people have spent on candy in a particular shopping mall over some fixed period of time (e.g. a year). Also, suppose we know the total amount spent on everything by each person, but not the breakdown (people who didn’t spend anything can be disregarded). So we need to survey a random sample of people who have spent non-zero amounts and visited the mall in the target period, and find out the amounts they have spent on candy. This estimate can then be extrapolated to the whole population of people who visited the mall in the same period, which is a known number.

The crux is that we have to aim for a desired level of accuracy for this estimate. The 95% CI needs to be within 5% of the total expenditure across all spenders in that mall in the focal year.

Colleague’s method

Colleague devised a method involving stratification, claiming that it reduces the required sample size. However, his method becomes unintelligible to me way before stratification is introduced. The method is described generally for $k$ strata. For example, if 3 strata are used, these could be low spenders, medium spenders and high spenders.

Given that the known quantities are:

$T$ — total $ spent by people in the mall over the year

$N$ — number of spenders over the year

$N_i$ — number of spenders in each stratum

$W_i = N_i/N$ — proportion of spenders in each stratum

$S_i$ — standard deviation of $ spent per person in each stratum

$z = 1.96$ — Z-score corresponding to the 95% CI

The workings are:

As we aim for a 95% CI of the mean $pm~ d$, where $d$ is 5% of the total expenditure, then

$d = 0.05T$

Then, the target variance is worked out as

$text{Var}_text{t} = frac{d^2}{z^2N^2}$ — I do not see where this comes from at all.

(There is another version of the document which has
$text{Var}_text{t} = frac{z^2}{d^2N^2}$. I know it looks shambolic. Neither formula makes sense to me.)

The first approximation to the sample size is then calculated as:

$n_o = sum_{i=1}^k frac{(W_iS_i)^2}{text{Var}_text{t}}$

And the final required sample size as:

$n = frac{n_o}{frac{sum_{i=1}^k W_iS_i^2}{Ntext{Var}text{t}}} = frac{n_oNtext{Var}_text{t}}{sum{i=1}^k W_iS_i^2}$

Colleague then describes extrapolation of the overall proportion spent on candy as follows. After the questionnaires are in, we have the values of $C_{ip}$ – the proportions each person $p$ spent on candy in each stratum $i$. We then use these values to get $C_{i*}$ – the (presumably weighted by person) proportions spent on candy in each stratum. And so the amount spent on candy in each stratum is worked out by extrapolation:

$A_i = N_i times C_{i*} $ – this makes sense.

But the uncertainty around this estimate is caclulated in a weird way. Suppose $S_{i*}$ is the standard deviation in the amount spent on candy per person in each stratum.

Colleague first calculates a “sampling factor” for each stratum:

$text{fact}_i = 1 – frac{n_i}{N_i}$

and then uses it to calculate the stratum population SD:

$N_isqrt{frac{text{fact}i S{i*}^2}{n_i}}$ — this step is particularly unclear to me

then squares it to calculate the stratum population variance:

$left(N_isqrt{frac{text{fact}i S{i*}^2}{n_i}}right)^2$

then adds up across all strata to get the overall population variance:

$sumleft(N_isqrt{frac{text{fact}i S{i*}^2}{n_i}}right)^2$

then takes a square root to get the overall population SD.

$sqrt{sumleft(N_isqrt{frac{text{fact}i S{i*}^2}{n_i}}right)^2}$

this is then used in the calculation of a confidence interval…

My thinking

I fail to follow the colleague’s method after $d = 0.05T$.

This is a deviation (a half-width) in a confidence interval, which is equal to:

$d = zfrac{s}{sqrt{n}}$, where $s$ is the sample standard deviation (ignoring stratification for the moment).

Rearranging this gives:

$s^2 = frac{d^2n}{z^2}$, which is the variance, but is different from both versions of $text{Var}_text{t}$ formula in colleague’s method.

We can also express the sample size:

$n = frac{z^2s^2}{d^2}$

$z^2$ and $d^2$ are known, but the variance, $s^2$, is unknown. A further glaring fact is that this needs to be the variance in the $ spent per person on candy, not the $ spent per person on everything. This leads me to the conclusion that is impossible to calculate the required sample size prior to sampling given that the variance in the $ spent per person on candy is unknown (overall or per stratum). Although I hesitate labelling colleagues method as wrong as I don’t understand it. There must be some rationale behind it.

Any ideas?

Get this bounty!!!