#StackBounty: #regression #mathematical-statistics #multivariate-analysis #least-squares #covariance Robust Covariance in Multivariate …

Bounty: 50

Assume we are in the OLS setting with $y = Xbeta + epsilon$. When $y$ is a response vector, and $X$ are covariates, we can get two types of covariance estimates:

The homoskedastic covariance
$cov(hat{beta}) = (X’X)^{-1} (e’e)$, and robust covariance
$cov(hat{beta}) = (X’X)^{-1} X’ diag(e^2) X (X’X)^{-1}$.

I’m looking for help on how to derive these covariances when $Y$ is a response matrix, and $E$ is a residual matrix. There is a fairly detailed derivation on slide 49 here, but I think there are some steps missing.

For the homoskedastic case, each column of $E$ is assumed to have a covariance structure of $sigma_{kk} I$, which is the usual structure for a single vector response. Each row of $E$ is also assumed to be i.i.d with covariance $Sigma$.

The derivation starts with collapsing the $Y$ and $E$ matrices back into vectors. In this structure $Var(Vec(E)) = Sigma otimes I$.

First question: I understand the kronecker product produces a block diagonal matrix with $Sigma$ on the block diagonal, but where did $sigma_{kk}$ go to? Is it intentional that the $sigma_{kk}$ values are pooled together so that the covariance is constant on the diagonal, similar to the vector response case?

Using $Sigma otimes I$, the author gives a derivation for $cov(hat{beta})$ on slide 66.

$
begin{align}
cov(hat{beta}) &= ((X’X)^{-1} X’ otimes I) (I otimes Sigma) (X (X’X)^{-1} otimes I) \
&= (X’X)^{-1} otimes Sigma
end{align}
$
.

The first line looks like a standard sandwich estimator. The second line is an elegant reduction because of the I matrix and properties of the kronecker product.

Second question: What is the extension for robust covariances?
I imagine we need to revisit the meat of the sandwich estimator, ($I otimes Sigma$), which comes from the homoskedastic assumption per response in the Y matrix. If we use robust covariances, we should say that each column of $E$ has variance $diag(e_k^2)$. We can retain the second assumption that rows in E are i.i.d. Since the different columns in $E$ no longer follow the pattern $scalar * I$, I don’t believe $Var(Vec(E))$ factors into a kronecker product as it did before. Perhaps $Var(Vec(E))$ is some diagonal matrix, $D$?

Revisiting the sandwich-like estimator, is the extension for robust covariance

$
begin{align}
cov(hat{beta}) &= ((X’X)^{-1} X’ otimes I) (D) (X (X’X)^{-1} otimes I) \
&= ?
end{align}
$
.

This product doesn’t seem to reduce; we cannot invoke the mixed product property because D does not take the form of a scalar multiplier on I.

The first question is connected to this second question. In the first question on homoskedastic variances, $sigma_{kk}$ disappeared, allowing $Var(Vec(E))$ to take the form $Sigma otimes I$. But if the diagonal of $Var(Vec(E))$ was not constant, it would actually have the same structure as the robust covariance case ($Var(Vec(E))$ is some diagonal matrix $D$). So, what allowed $sigma_{kk}$ to disappear, and is there a similar trick for the robust case that would allow the $D$ matrix to factor?

Thank you for your help.


Get this bounty!!!

#StackBounty: #regression #bayesian #covariance Interpretation of multiple regressions posterior distribution

Bounty: 50

I’m interested in how we evaluate performance of Bayesian regression (linear, multiple, logistic, etc.) The posterior distribution will capture the relative likeliness of any parameter combination. So a 2D heatmap, for example of B1 and B2 (coefficients) might give us some insight into their relationship.

Recently, a colleague of mine mentioned that the posterior’s covariance matrix is effectively "all you need." I want to ask, is this oversimplifying the matter (and even if so) what does the posterior covariance matrix tell you?

My guesses are:

(1) Along the diagonal you get single parameter’s variance. The lower the number, the more confidence we have in the estimate. Whereas high variance might indicate that we’re less confident in our estimate.

(2) Covariance between parameters might be trickier to interpret. The direction (+/-) of the covariance might give an indication of the nature of the relationship (is an increase in one parameter associated with an increase, decrease or neither in the other.)

(3) The magnitude of the covariance gives me pause. Does a small value imply high confidence in the relationship or little to no association? (Very different meanings!)

(4) I can imagine a situation where the variance of B1 is quite small, so perhaps we’re confident in the estimate, whereas the variance of B2 might be rather large, so less confident. I’m not sure how this would affect our understanding of covariance direction and magnitude.

*All the above assumes proper analysis, no multicollinearity, collider bias, etc.

Any thoughts?


Get this bounty!!!

#StackBounty: #time-series #autocorrelation #covariance #stochastic-processes #brownian Time-series Auto-Covariance vs. Stochastic Proc…

Bounty: 50

My background is more on the Stochastic processes side, and I am new to Time series analysis. I would like to ask about estimating a time-series auto-covariance:

$$ lambda(u):=frac{1}{t}sum_{t}(Y_{t+u}-bar{Y})(Y_{t}-bar{Y}) $$

When I think of the covariance of Standard Brownian motion $W(t)$ with itself, i.e. $Cov(W_s,W_t)=min(s,t)$, the way I interpret the covariance is as follows: Since $mathbb{E}[W_s|W_0]=mathbb{E}[W_t|W_0]=0$, the Covariance is a measure of how "often" one would "expect" a specific Brownian motion path at time $s$ to be on the same side of the x-axis as as the same Brownian motion path at time t.

It’s perhaps easier to think of correlation rather than covariance, since $Corr(W_s,W_t)=frac{min(s,t)}{sqrt(s) sqrt(t)}$: with the correlation, one can see that the closer $s$ and $t$ are together, the closer the Corr should get to 1, as indeed one would expect intuitively.

The main point here is that at each time $s$ and $t$, the Brownian motion will have a distribution of paths: so if I were to "estimate" the covariance from sampling, I’d want to simulate many paths (or observe many paths), and then I would fix $t$ and $s=t-h$ ($h$ can be negative), and I would compute:

$$ lambda(s,t):=frac{1}{i}sum_{i}(W_{i,t}-bar{W_i})(W_{i,t-h}-bar{W_i}) $$

For each Brownian path $i$.

With the time-series approach, it seems to be the case that we "generate" just one path (or observe just one path) and then estimate the auto-covariance from just that one path by shifting throught time.

Hopefully I am making my point clear: my question is on the intuitive interpretation of the estimation methods.


Get this bounty!!!

#StackBounty: #regression #econometrics #covariance #residuals #covariance-matrix Covariance matrix of the residuals in the linear regr…

Bounty: 50

I estimate the linear regression model:

$Y = Xbeta + varepsilon$

where $y$ is an ($n times 1$) dependent variable vector, $X$ is an ($n times p$) matrix of independent variables, $beta$ is a ($p times 1$) vector of the regression coefficients, and $varepsilon$ is an ($n times 1$) vector of random errors.

I want to estimate the covariance matrix of the residuals. To do so I use the following formula:

$Cov(varepsilon) = sigma^2 (I-H)$

where I estimate $sigma^2$ with $hat{sigma}^2 = frac{e’e}{n-p}$ and where $I$ is an identity matrix and $H = X(X’X)^{-1}X$ is a hat matrix.

However, in some source I saw that the covariance matrix of the residuals is estimated in other way.
The residuals are assumed to following $AR(1)$ process:

$varepsilon_t = rho varepsilon_{t-1} + eta_t$

where $E(eta) = 0$ and $Var({eta}) = sigma^2_{0}I$.

The covariance matrix is estimated as follows

$Cov(varepsilon) = sigma^2 begin{bmatrix}
1 & rho & rho^2 & … & rho^{n-1}\
rho & 1 & rho & … & rho^{n-2} \
… & … & … & … & … \
rho^{n-1} & rho^{n-2} & … & … & 1
end{bmatrix}$

where $sigma^2 = frac{1}{1-rho}sigma^2_0$

My question is are there two different specifications of the covariance matrix of residuals or these are somehow connected with each other?


Get this bounty!!!

#StackBounty: #covariance #covariance-matrix #multivariate-normal #fisher-information #geometry What is the geometric relationship betw…

Bounty: 50

The covariance matrix represents the dispersion of data points while the inverse of the covariance matrix represents the tightness of data points. How is the dispersion and tightness related geometrically?

For example, the determinant of the covariance matrix represents the volume of the dispersion of data points. What does the determinant of the inverse of the covariance matrix represent? The determinant is related to volume, but I don’t understand how to interpret the volume of the inverse of the covariance matrix (or the volume of the information matrix).

Similarly, the trace represents kind of the mean square error of the data points, but what does the trace of the inverse of the covariance matrix represent?

I don’t quite understand how to interpret the inverse of the covariance matrix geometrically, or how it is related to the covariance matrix.


Get this bounty!!!

#StackBounty: #correlation #mixed-model #covariance #multilevel-analysis #non-independent Mixed Effects Model (3 level model?)

Bounty: 100

Consider the following problem. The dataset that I am considering has $n=1800$ units (high-end copying machines). Label the units $i = 1,dots,n$. Unit $i$ has $n_i$ recordings. It is of interest to model the use-rate for these copying machines. All machines are in the same building.

The following linear mixed effects model is used:

begin{equation}
begin{aligned}
X_i(t_{ij}) &= m_i(t)+ varepsilon_{ij} \
&= eta + z_i(t_{ij})w_i + varepsilon_{ij},
end{aligned}
end{equation}

where $eta$ is the mean, $z_i(t_{ij}) = [1, log(t_{ij})]$, $w_i = (w_{0i}, w_{1i})^top sim N(0,Sigma_w)$, $varepsilon_{ij} sim N(0, sigma^2)$, and

begin{equation}
Sigma_w =
begin{pmatrix}
sigma^2_1& rhosigma_1sigma_2 \
rhosigma_1sigma_2 & sigma^2_2
end{pmatrix}.
end{equation}

I can write this model in matrix form. More specifically, I have the model (I write this out for a reason)

begin{equation}
X = 1eta + Zw + varepsilon,
end{equation}

where

begin{equation}
X =
begin{pmatrix}
X_1\
vdots \
X_n
end{pmatrix} in mathbb{R}^N,
varepsilon =
begin{pmatrix}
varepsilon_1\
vdots \
varepsilon_n
end{pmatrix} in mathbb{R}^N,
1 =
begin{pmatrix}
1_{n_1}\
vdots \
1_{n_n}
end{pmatrix} in mathbb{R}^{N times p},
w =
begin{pmatrix}
w_1\
vdots \
w_n
end{pmatrix} in mathbb{R}^{2n},
end{equation}

where $N = sum_{i=1}^n n_i$. In addition,

begin{equation}
Z =
begin{pmatrix}
Z_1 & 0_{n_1 times 2} & dots & 0_{n_1 times 2} \
0_{n_2 times 2} & Z_2 & dots & 0_{n_2 times 2} \
vdots & & ddots & vdots \
0_{n_n times 2} & dots & & Z_n
end{pmatrix} in mathbb{R}^{N times 2n},
0_{n_i times 2} =
begin{pmatrix}
0 & 0 \
vdots& vdots \
0 & 0
end{pmatrix} in mathbb{R}^{2n_i}.
end{equation}

Furthermore, we have that

begin{equation}
begin{bmatrix}
w\
varepsilon
end{bmatrix} sim
N
begin{bmatrix}
begin{pmatrix}
0\
0
end{pmatrix},&sigma^2
begin{pmatrix}
G(gamma) & 0 \
0 & R(rho)
end{pmatrix}
end{bmatrix},
end{equation}

where $gamma$ and $rho$ are $r times 1$ and $s times 1$ vectors of unknown variance parameters corresponding to $w$ and $varepsilon$, respectively. Mathematically,

begin{equation}
G = frac{1}{sigma^2}
begin{pmatrix}
Sigma_w & dots & 0 \
vdots & ddots & vdots \
0 & dots & Sigma_w
end{pmatrix} in mathbb{R}^{2n times 2n},
R =
begin{pmatrix}
I_{n_1} & dots & 0 \
vdots & ddots & vdots \
0 & dots & I_{n_n}
end{pmatrix} in mathbb{R}^{N times N},
end{equation}

where $w_i sim N(0, Sigma_w)$, and $varepsilon_i sim N(0, sigma^2I_{n_i})$. Here $gamma = (sigma_1, sigma_2, rho)^top$ and $rho = sigma^2$.

Imagine I now obtain a dataset for a new building with $n$ units. But now, unit $i$ is in the same room as unit $i+1$ for $i = 1,3,5,dots, n-1$. How would I model the additional dependence between units in the same room? At first I thought to use the exact same model as above but changing $G$ to

begin{equation}
G = frac{1}{sigma^2}
begin{pmatrix}
Sigma_w & Sigma_{1,2} & dots &0& 0 \
Sigma_{1,2}& Sigma_w & dots &0& 0 \
vdots & vdots& ddots & vdots& vdots \
0 & 0& dots& Sigma_w & Sigma_{1799,1800} \
0 & 0& dots & Sigma_{1799,1800}& Sigma_w
end{pmatrix} in mathbb{R}^{2n times 2n},
end{equation}

where $Sigma_{i, i+1}$ is the covariance matrix which models the dependence between units $i$ and $i+1$ for $i = 1,3, dots, 1799$.

Is this a possible way to model the problem? I guess it would not be possible to use nlm in R to do it but it would be possible using an analytic solution.

What else could be done? I think a three level hierarchical model (instead of two level model) could also work, but I am not sure how to formulate a three level model.

Any advice on past modelling experiences and how to write down the three level model would be appreciated.


Get this bounty!!!

#StackBounty: #covariance #covariance-matrix #quantiles #quantile-regression What is the quantile covariance?

Bounty: 50

Suppose that $X$ is a p-dimensional random vector and $Y$ is a random scalar. Then, Dodge and Whittaker (2009) indicate that the covariance of these two variables can be formulated as a minimization problem:

begin{equation}
text{Cov}(Y,X)^T=arginf_{alpha, beta}{mathbb{E}(Y-alpha-beta^Ttext{Var}(X)^{-1}[X-mathbb{E}(X)])}
end{equation}

And based on this definition of the covariance they propose a quantile covariance defined for the $tau^{th}$ quantile as:

begin{equation}
text{Cov}tau(Y,X)^T=arginf{alpha, beta}{mathbb{E}rho_tau(Y-alpha-beta^Ttext{Var}(X)^{-1}[X-mathbb{E}(X)])}
end{equation}

where $rho_tau(cdot)$ is the check function for quantile regression defined by Koenker and Basset (1978).

I am trying to understand the way this quantile covariance works, but I am having problems from the very beginning, since it is based on a definition for the covariance that I have never seen before. So my questions are:

  1. How is the covariance between a random scalar and a random vector calculated if the dimensions do not match?

  2. Where is this definition as an optimization problem for the covariance coming from?

  3. Any insights that help understanding the quantile covariance.

References:

  • Dodge, Y. and Whittaker, J. (2009). Partial quantile regression. Metrika, 70:35–57.
  • Koenker, R. and Bassett, G. (1978). Regression Quantiles. Econometrica, 46(1):33–50.


Get this bounty!!!

#StackBounty: #covariance #gaussian-process #conditional-expectation #prediction-interval Are the conditional expectation values of y a…

Bounty: 50

Suppose $y$ is a Gaussian process given by $y sim f + epsilon$, where $epsilon$ is a Gaussian noise model with zero mean, and $f$ is a deterministic yet unknown mean function (or a Gaussian process independent of $epsilon$). Therefore, one would find that $mathbb{E}[y] = mathbb{E}[f]$ since $mathbb{E}[epsilon] = 0$. But my question is: does $mathbb{E}[{{ bf y}_b vert { bf y}_a}] = mathbb{E}[{{ bf f}_b vert { bf y}_a}]$? Namely, are the conditional means of $bf f_b$ and $bf y_b$ equivalent?

The reason I ask is because we know $text{Var}[y] neq text{Var}[f]$ and $text{Var}[{{ bf y}_b vert { bf y}_a}] neq text{Var}[{{ bf f}_b vert { bf y}_a}]$. Additionally, the covariance matrix of $y$ is given by: $$Sigma_y(x_1,x_2) = k(x_1,x_2) + sigma^2 (x_1) delta(x_1 – x_2),$$ while the covariance matrix of $f$ is given by (c.f. the lines below equations 5.8 or below 2.30): $$Sigma_f(x_1,x_2) = k(x_1,x_2),$$ i.e. $y$ has an additional (possibly) heteroscedastic noise model, $sigma$, added along the diagonal of covariance matrix to represent the variance of the noise, $epsilon$. But after observing a set of measurements, $boldsymbol y_a$ at inputs $boldsymbol x_a$, the conditional mean of $boldsymbol y_b$ is given by:

$$mathbb{E}[{ bf y}_b vert { bf y}_a] = boldsymbolmu_b+Sigma_y(boldsymbol x_b,boldsymbol x_a){Sigma_y(boldsymbol x_a,boldsymbol x_a)}^{-1}({boldsymbol y_a}-boldsymbolmu_a) $$

but the conditional mean of $boldsymbol f_b$ is given by:

$$mathbb{E}[{ bf f}_b vert { bf y}_a] = boldsymbolmu_b+Sigma_f(boldsymbol x_b,boldsymbol x_a){Sigma_f(boldsymbol x_a,boldsymbol x_a)}^{-1}({boldsymbol y_a}-boldsymbolmu_a) $$

Therefore since $Sigma_y(x_1,x_2)$ does not necessarily equal $Sigma_f(x_1,x_2)$, is it accurate to state that $mathbb{E}[{{ bf y}_b vert { bf y}_a}]$ does not necessarily equal $mathbb{E}[{{ bf f}_b vert { bf y}_a}]$?


Get this bounty!!!

#StackBounty: #r #regression #covariance How does R function summary.glm calculate the covariance matrix for glm model?

Bounty: 50

I would like to know how the covariance matrix of estimated coefficients is actually calculated. The code uses QR-decomposition and inversion of some sort. I have an idea that it would go something like this:

$(X’X)^{-1}=[(QR)’QR]^{-1}=(R’R)^{-1}=Sigma$

Could someone explain the code?

p <- object$rank    
p1 <- 1L:p
Qr <- qr.lm(object)
covmat.unscaled <- chol2inv(Qr$qr[p1, p1, drop = FALSE])
covmat <- dispersion * covmat.unscaled


Get this bounty!!!

#StackBounty: #variance #covariance #pooling Pooled Covariance Matrix with very different amount of samples per class

Bounty: 50

I have a dataset with 10 classes, and want to estimate the covariance. It turns out that due to numerical stabilitiy, it is much better to use a pooled covariance matrix. Suppose I have $N$ samples per class. Then

$$S_{mathrm{pooled}} = frac{1}{10 * N} sum_{i=1}^{10} sum_{j=1}^N (x_j – mathrm{mean}(x_j)) (x_j – mathrm{mean}(x_j))’$$

I would also like to perform LDA (for dimensionality reduction and later classification) on the dataset, and the computation of the Within-Class Scatter Matrix $S_W$ is almost the same upto the scaling factor.

I do have a very unbalanced classes – quite different number of samples per class. For example for class 1, I have only $100$ samples whereas for class 4 I have $5000$ samples!

Let $N_1, N_2, … N_{10}$ denote the amount of samples for each class. According to https://en.wikipedia.org/wiki/Pooled_variance, one computes as

$$S_{mathrm{pooled}} = frac{1}{N_1 + N_2 + … N_{10} – 10} sum_{i=1}^{10} sum_{j=1}^{N_i} (x_j – mathrm{mean}(x_j)) (x_j – mathrm{mean}(x_j))’$$

For my data set this gives not a good estimation of the covariance matrix. What did work was to artifically equalize the amount of samples per class by re-using data, and then estimating the covariance by the first equation. For example for class 1, I artifically increased the number of samples by re-using the 100 samples 50 times to get an amount of 5000 samples for this class, i.e. adding the same data again and again. This seems to remove an apparent bias.

It works quite well, but I have no mathematical explanation or intuition why???


Get this bounty!!!