#StackBounty: #r #mathematical-statistics #variance #sampling #mean Right way to compute mean and variance

Bounty: 50

1.If I take as definition of $a_{lm}$ following a normal distribution with mean equal to zero and $C_ell=langle a_{lm}^2 rangle=text{Var}(a_{lm})$, and taking the following random variable $Z$ defined by this expression :

$$begin{aligned}
Z = sum_{ell=ell_{min}}^{l_{max}} sum_{m=-ell}^{ell} a_{ell m}^{2}
end{aligned}$$

Then, the goal is to compute $langle Zrangle$ :

If I consider the random variable $Y=sum_{m=-ell}^{ell} C_ell bigg(dfrac{a_{ell m}}{sqrt{C_ell}}bigg)^{2}
$
, this random variable $Y$ follows a $chi^2(1)$ distribution weighted by $C_ell$.

  1. Can I write from this that mean of $Y$ is equal to :

$$langle Yrangle =langlebigg(sum_{m=-ell}^{ell} a_{ell m}^{2}bigg)rangle = (2ell+1),C_ell$$

??

and so, we would have :

$$langle Zrangle = sum_{ell=ell_{min}}^{ell_{max}},C_ell,(2ell+1)$$

?? I have serious doubts since the $a_{lm}$ doesn’t follow a reduced Normal distribution $mathcal{N}(0,1)$.

Shouldn’t it be rather :

$$begin{align}
Z&equiv sum_{ell=ell_{min}}^{ell_{max}} sum_{m=-ell}^ell a_{ell,m}^2 [6pt]
&= sum_{ell=ell_{min}}^{ell_{max}} sum_{m=-ell}^ell C_ell cdot bigg( frac{a_{ell,m}}{sqrt{C_ell}} bigg)^2 [6pt]
&sim sum_{ell=ell_{min}}^{ell_{max}} sum_{m=-ell}^ell C_ell cdot text{ChiSq}(1) [6pt]
&= sum_{ell=ell_{min}}^{ell_{max}} C_ell sum_{m=-ell}^ell text{ChiSq}(1) [6pt]
&= sum_{ell=ell_{min}}^{ell_{max}} C_ell cdot text{ChiSq}(2 ell + 1). [6pt]
end{align}$$

  1. Now, I want to calculate the mean $langle Zrangle$ of $Z$ :

Do you agree that my case here is the computation of a mean for a weighted sum of $chi^2$ ?

So the computation is not trivial, isn’t it ? Maybe I could compute the mean by starting from analytical :

$$langle Zrangle=sum_{ell=ell_{min}}^{ell_{max}} C_ell (2ell + 1)quad(1)$$

and directly doing the numerical computation :

$$langle Zrangle=sum_{i=1}^{N} C_{ell_{i}} (2ell_{i} + 1)quad(2)$$

  1. What do you think about this direct computation, is it correct ?

I make confusions between $(1)$ and $(2)$ above since there is each $C_ell$ corresponds to each $ell$ (I mean on a numerically point of view, each $C_{ell_{i}}$ value is associated to a $ell_{i}$ value)

  1. If the direct computation $langle Zrangle=sum_{i=1}^{N} C_{ell_{i}} (2ell_{i} + 1)$ not correct, then I have to consider random variable $Z$ following a weighted sum of different chisquared distrbutions :

I have tried with following R script where nRed is one of the 5 bins considered and nRow the number of values for $ell$ (from $ell_{min}$ to $ell_{max}$), and also the Cl_sp[,i] the vector of nRow values of $C_ell$ for each bin $i$ taken into acccount.

   # Number of bin
   nRed <- 5
    
   # Number of rows
   nRow <- 36
    
   # Size of sample
   nSample_var <- 1000
    
   # NRow values of multipoles l
   L <- 2*(array_2D[,1])+1
    
   # Weighted sum of Chi squared distribution
   y3_1<-array(0,dim=c(nSample_var,nRed))
      for (i in 1:nRed) {
        for (j in 1:nRow) { 
          y3_1[,i] <- y3_1[,i] + Cl_sp[j,i] * rchisq(nSample_var,df=L[j])
        }
      } 
    
   # Print the mean of Z for each bib
   for (i in 1:nRed) {
     print(paste0('red=',i,'mean_exp = ', mean(y3[,i])))
   }
  1. Is it the right thing to implement to compute the mean of $Z$ if I can’t compute it analytically (see expression $(2)$ above).

I would like to compute also the variance of $Z$, maybe a simple adding in my R script like :

# Print the variance of Z for each bin
for (i in 1:nRed) {
  print(paste0('red=',i,'mean_exp = ', var(y3[,i])))
}

should be enough. What do you think about this ?


Get this bounty!!!

#StackBounty: #multiple-regression #variance #econometrics #standard-error #sphericity Formula for variance of an individual regressor …

Bounty: 50

if I run the regression:

$Y_i= sum_{j=0}^{k} beta_j x_{i,j} + epsilon_i$, where $x_{i,0}$ = 1, i.e. the intercept term.

My understanding, is from the Frisch-Waugh-Lovell theorem, the formula for $beta_j$ is:

$beta_j$ = $frac{sum{(eta_i-bar{eta})y_i}}{sum{(eta_i-bar{eta})^2}}$

where $eta$ is the residual of $x_j$ projected onto all the other covariates.

with spherical errors, I believe one can derive the variance as:

$var(beta_i)=frac{sigma^2}{sum{(eta_i-bar{eta})^2}}$ (is this correct? and could the $bar{eta}$ be dropped if a constant was included as a covariate?)

If I wished to derive the formula for hetereoscedastic and clustered standard errors, would this be derived analogously for the simple linear regression case, just replacing $x$ with the residual $eta$? i.e. is:

$var(beta_j) = frac{sum{(eta_i-bar{eta})^2 E[epsilon_i^2]}}{(sum{(eta_i-bar{eta})^2})^2}$

under hetereoscadascity?


Get this bounty!!!

#StackBounty: #anova #variance #experiment-design #heteroscedasticity #bias Ensuring that sequence of repeated trials does not bias res…

Bounty: 100

Imagine the following experimental design:

You have four conditions, A,B,C and D, regarding an online platform. Conditions A,B,C have a different feature enabled that you want to compare and D has none enabled (control).

You have a sample of 18 users interacting with this platform each time posting something there. The activity is executed one after another, the newcoming users see what the previous users have posted. You repeat this process for 3 times (3×18=54 for each condition) – starting from the same starting point and each time with a completely new sample (this is not a longitudinal study)

At the end of each activity, you take measurements on several psychometric variables that you will evaluate your features against.

Before you proceed to analyse (with ANOVA or Kruskal-Wallis), how can we verify that the users’ activities and interactions between the 18 users within the same group (sequence of messaging, type of content or users comments etc) do not bias the results?
e.g. if a "troll" user posted first did not negatively affect the behaviour of the rest of the users

What statistical test would check this?

Would Levene’s test to check homogeneity of variance across the groups of 18 be sufficient? How to proceed if it does pass the test?

Would the simplification, of treating the total of 54 as one big pot, be wrong?


Get this bounty!!!

#StackBounty: #distributions #variance #interpretation #standard-deviation #reporting Variance of a derived magnitude

Bounty: 50

I’m wondering about how to present results on a report (and how to interpret it).

Let $Y = f(mathbf{X})$ be a random variable. Of course, if we derive it’s PDF $f_{Y}(y)$, we could present it on the report and the reader would have all the information for that variable. But, suppose we compute it’s variance approximately (without passing through their PDF) by using the formula
begin{equation*}
text{Var}left(Yright)
= sum_{i=1}^{n}
left(
left. frac{partial y}{partial x_i} rightvert _{mu_i} sigma_{X_{i}}
right)^2
end{equation*}

When we present the result as $(mu_{Y} – sigma_{Y}, mu_{Y} + sigma_{Y})$, we aren’t really giving any information about how it’s distributed.

When the standard deviation of a variable is given without their PDF, we are supposed to interpret it with some inequality (Chebyshev’s, for example) to give us a bound for a confidence interval?


I’m asking this becuase I took two laboratory courses reporting magnitudes as mentioned and now, doing a probability course, I’ve learned that a function of gaussian distributed variables doesn’t follows, in general, a gaussian. So I want to know what’s the point on reporting standard deviation for an unknown distribution: using an upper bound for the confidence interval (Chebyshev, the only one that works for any distribution) or there are other reasons.

The question implicitly asks if what I’m saying is correct. If it’s not clear what I mean, please leave a comment so I can make an attempt to clarify.


Get this bounty!!!

#StackBounty: #time-series #normal-distribution #variance #mean Are power law relations between means and standard deviations inherent …

Bounty: 100

In a recent paper I submitted for publication I document a power law relation between the means and standard deviations of several time series. That is, when plotting the log of the means of each of these (stationary) series against the log of their respective standard deviations, you get a straight, positively sloped line (with non-zero y axis intercept).

When researching for this paper I scoured the internet for any possible statistical or mathematical explanation for this behavior, but found none, and could recall nothing from my own training in statistics that would explain this either. I discovered variance functions and Bartlett’s identities along the way, but this still fell far short of explaining the relation I was documenting. The data I am dealing with are all normally distributed.

My paper was rejected, and one of the main grounds for rejection given by the editor was that the power law relation between means and standard deviations I had observed is "inherently true of more or less normally distributed sets of data".

Can someone please explain to me what the editor is talking about? Do power law relations trivially exist between the means and standard deviations of different normally distributed sets of data?

Edit: Some details on the data – Each data set is a stationary yearly time series. Number of observations in each series is the same. My logged plot of the means against their respective standard deviations follows below. In this graphic, the different shapes and colors of the points correspond to different commodity groups.

enter image description here


Get this bounty!!!

#StackBounty: #time-series #normal-distribution #variance #mean Are log-linear relations between means and standard deviations inherent…

Bounty: 100

In a recent paper I submitted for publication I document a log linear relation between the means and standard deviations of several time series. That is, when plotting the log of the means of each of these (stationary) series against the log of their respective standard deviations, you get a straight, positively sloped line (with non-zero y axis intercept).

When researching for this paper I scoured the internet for any possible statistical or mathematical explanation for this behavior, but found none, and could recall nothing from my own training in statistics that would explain this either. I discovered variance functions and Bartlett’s identities along the way, but this still fell far short of explaining the relation I was documenting. The data I am dealing with are all normally distributed.

My paper was rejected, and one of the main grounds for rejection given by the editor was that the log-linear relation between means and standard deviations I had observed is "inherently true of more or less normally distributed sets of data".

Can someone please explain to me what the editor is talking about? Does a positive log-linear relation trivially exist between the means and standard deviations of different normally distributed sets of data?


Get this bounty!!!

#StackBounty: #hypothesis-testing #variance #sample-size #ab-test A/B test sample size formula confusion

Bounty: 50

I was trying to understand the math behind some commonly used calculators and formulas for A/B tests and it seems that there might be some variations. Ideally, I would like to understand how each of them is derived and under what assumptions.

$$left(frac{Phi_{1-alpha/2}^{-1}sqrt{2bar p(1-bar p)}+Phi_{1 -beta}^{-1}sqrt{p_1(1-p_1)+p_2(1-p_2)}}{|p_2-p_1|}right)^2$$ where $$ bar p = frac{p_1 + p_2}{2}$$ as far as I understand.

This is the formula I found here, for example:
https://jeffshow.com/caculate-abtest-required-sample-size.html

It’s in Chinese, but from what I read after using Google Translate, this is exactly that type of calculation.

Honestly speaking, I have no idea where this formula comes from.

In the general case, the formula for the sample size is:

$$n ge 2 Big(frac{Phi^{-1}(1-alpha/2)+Phi^{-1}(1-beta)}{Delta/sigma}Big)^2$$

Under the assumption of equal variances, we can calculate the pooled variance as:
$$sigma_p=sqrt{dfrac{(n_1-1)s_1^2+(n_2-1)s^2_2}{n_1+n_2-2}}$$, which can be simplified given that $n_1 = n_2$ as $sigma_p = sqrt{frac{p_1cdot(1-p_1) + p_2 cdot (1-p_2)}{2}}$.

Plugging that into the formula for the general case, we get:

$$n geq frac{p_1cdot(1-p_1) + p_2 cdot (1-p_2)}{(p_2-p_1)^2} cdot (z_{frac{alpha}{2}} + z_{beta})^2$$

The results, however, are not the same when using these two formulas. As far as I understand, the famous Evan Miller calculator is based on the first formula.

For $p_1=0.5$ and $p_2=0.6$, my formula gives $n=384.16$ whereas this calculator reports $390$ as the answer.

Can you please clarify it a bit? I believe there might be some assumptions or, even worse, something really trivial I fail to take into consideration.

Thanks!


Get this bounty!!!

#StackBounty: #confidence-interval #variance #relative-risk Confidence interval for relative risk with uncertain incidences

Bounty: 50

The usual suggestion for computing a confidence interval for a relative risk estimate is to start from the variance:

$$
text{CI}_R = R pm z cdot exp sqrt{sigma^2_{log R}}
$$

where $R$ is the risk estimate, $z$ is the z-score for the desired confidence level, and
$$
sigma^2_{log R} = frac{|I| – A_I}{|I| cdot A_I} + frac{|C| – A_C}{|C| cdot A_C}
$$

where $|I|$ and $|E|$ are the sizes of the intervention and control groups, and $A_I$ and $A_C$ are the number of affected individuals in each group.

But suppose that $A_I$ and $A_C$ themselves are uncertain (e.g. because they are predictions of a model). We have values for $sigma^2_{A_I}$ and $sigma^2_{A_C}$ and we want to incorporate those uncertainties into the confidence interval. Assuming incidence rates in each group are independent, we can compute the variance of the variance as

$$
sigma^2[sigma^2_{log R}] = frac{partial^2 sigma^2_{log R}}{partial, A_I^{,2}} sigma^2_{A_I} + frac{partial^2 sigma^2_{log R}}{partial, A_C^{,2}} sigma^2_{A_C} = frac{2 sigma^2_{A_I}}{A_I^{,3}} + frac{2 sigma^2_{A_C}}{A_C^{,3}}
$$

And this is where I get stuck: I don’t know how to combine $sigma^2_{log R}$ with $sigma^2[sigma^2_{log R}]$ to get something to put into the original equation.

Any advice?


Get this bounty!!!

#StackBounty: #variance #effect-size #threshold #subset Is there a name for the increase in variance upon remeasurement after subsettin…

Bounty: 50

Context: My problem relates to estimating effect sizes, such as Cohen’s d, when looking at a subset of the population defined by a cut-off threshold. This effect size is the difference in two population means divided by the (assumed equal) population standard deviation.

Suppose there is a sample from a population with a variable $Y$ with "true" values $Y_{i0}$ that will be measured with error at two time points, $t_1$ and $t_2$, giving measurements $Y_{i1} = Y_{i0} + epsilon_{i1}$, $Y_{i2} = Y_{i0} + epsilon_{i2}$. At time $t_1$ we define a subset $J$ of the population by "$i in J$ if $Y_{i1} > a$" for some fixed $a$. The objective is to estimate the variance of the subset at $t_2$, $V[Y_{2j}|j in J]$ (or equivalently, the variance of $Y$ in the subset measured at any time other than $t_1$). We cannot use the subset’s estimated variance at $t_1$ because the variance at $t_2$ will be larger.

Example code showing that the standard deviation of the subset at $t_2$ is greater than the standard deviation at $t_1$.

set.seed(1)
N <- 1000
Y0 <- rnorm(N,mean=0,sd=1)
Y1 <- Y0 + rnorm(N,mean=0,sd=0.5)
Y2 <- Y0 + rnorm(N,mean=0,sd=0.5)
indx <- Y1 > 1
sd(Y1[indx])
# [1] 0.6007802
sd(Y2[indx])
# [1] 0.8145581

Does this phenomenon, the variance of a thresholded subset increasing upon re-measurement, have a name? Can anyone share any references to help understand it either generally or in the specific context of effect sizes?


Get this bounty!!!

#StackBounty: #variance #random-variable How does the sample variance change if you now take sets of five observations from the origina…

Bounty: 50

Suppose we have measured the weight of 338 silver pennies. The mean weight is 15.722 grams and the variance is 1.999 squared grams.

What would happen to the mean and variance if each observation was now a set of five randomly selected observations from the original dataset? That is: $y_i = sum_{j=1}^{5} x_j$.

In theory, $sum_{i=1}^{n}x_i$ does not change. The only difference is that you now have one observation for every five observations in the original calculation. Thus, $sum x_i$ is now divided by $frac{338}{5}$, which is equivalent to $5 bar{x}$.

What would happen to the variance though? At first, I thought it would be $5^2 V(X)$, but the correct answer is $5V(X)$. Why is this so?

PS: This is exercise 5.2 from Principles of Statistics by M.G. Bulmer.


Get this bounty!!!