#StackBounty: #maximum-likelihood #least-squares #covariance #uncertainty #hessian Parameter uncertainity in least squares optimization…

Bounty: 50

Given a least squares optimization problem of the form:

$$ C(lambda) = sum_i ||y_i – f(x_i, lambda)||^2$$

I have found in multiple questions/answers (e.g. here) that an estimate for the covariance of the parameters can be computed from the inverse rescaled Hessian at the minimum point:

$$ mathrm{cov}(hatlambda) = hat H^{-1} hatsigma_r^2 = hat H^{-1} frac{sum_i ||y_i – f(x_i, hatlambda)||^2}{N_{DOF}} $$

While I understand why the covariance is related to the inverse Hessian (Fischer information), I haven’t found anywhere a demonstration or explanation for the $hatsigma_r^2$ term, although it appears reasonable to me on intuitive grounds.

Could anybody explain the need for the rescaling by the residual variance and/or provide a reference?


Get this bounty!!!

#StackBounty: #r #multivariate-analysis #covariance #vegan Covariance and independent variance in multivariate datasets – R

Bounty: 50

I have a dataset which describes field reference sites with different group of environmental indicators (continuous quantitative data). I’m interested in understanding how a group of parameters describe the total statistical heterogeneity (variance, inertia, …) of the dataset. More specifically, I would like to tackle 2 questions:

  • Question 1: How similarly the different groups of indicators describe the total heterogeneity of the entire dataset?

  • Question 2: How differently do these (groups of) indicators contribute to the total heterogeneity within my dataset? i.e: for one group of parameters, how much does its unshared variance explain the total heterogeneity of the entire dataset?

Here below I tried to make a reproducible example with the environmental part of the "doubs" dataset from the ade4 package as toy dataset in R environment. I think I found solutions to address Question 1 (see below a reproducible example) but I’m looking for statistical means to address Question 2.

library(ade4)
# This data set gives environmental variables, fish species and spatial coordinates for 30 sites.
data("doubs")

# extracting the environmental variables
env_heterogeneity <- doubs$env
head(env_heterogeneity)

# selecting 2 groups of environmental parameters
multivariate_dataset_1 <- env_heterogeneity[,1:4] # physical/morphology parameters
multivariate_dataset_2 <- env_heterogeneity[,5:11] # chemical parameters

# how similar to each other the two multivariate datasets?
RV.rtest(multivariate_dataset_1,multivariate_dataset_2) 

# RV.rtest(multivariate_dataset_1,multivariate_dataset_2) 
# Monte-Carlo test
# Call: RV.rtest(df1 = multivariate_dataset_1, df2 = multivariate_dataset_2)
# 
# Observation: 0.3940863 
# 
# Based on 99 replicates
# Simulated p-value: 0.01 
# Alternative hypothesis: greater 
# 
# Std.Obs Expectation    Variance 
# 6.578328982 0.043988310 0.002832357 

The RV test from ade4 package is a multivariate generalization of Pearson correlation coefficient. It provides a good estimate of the shared variance among multivariate_dataset_1 and multivariate_dataset_2.
I could also used variance portioning approach based on redundancy analysis in vegan package which here below tells me that 71% of the variance of multivariate_dataset_2 can be explained by multivariate_dataset_1:

# how  multivariate_dataset_1 can explain the variance of multivariate_dataset_2?
library(vegan)
RDA_1 <- rda(X = multivariate_dataset_1 , Y = multivariate_dataset_2)
summary(RDA_1)

# summary(RDA_1)
# 
# Call:
#   rda(X = multivariate_dataset_1, Y = multivariate_dataset_2) 
# 
# Partitioning of variance:
#   Inertia Proportion
# Total         5300660     1.0000
# Constrained   3786549     0.7144
# Unconstrained 1514111     0.2856

I think I have satisfactory solutions for Q1, but I’m completely in the dark about Q2. As my wording is not helping, I also tried the Venn diagram below representing the variance and covariance of the datasets. For me, Q1 is about the light grey area and Q2 is rather about the pure white, or pure grey areas (variance dataset 1- covariance1 / 2). Don’t hesitate to help me improve my wording through comments.

enter image description here


Get this bounty!!!

#StackBounty: #covariance #intuition Population covariance, are these two formulas equivalent?

Bounty: 50

for the population covariance, you can write it as:

$sigma_{x,y} = frac{sum_i(x_i-bar{x})(y_i-bar{y})}{N}$

Where N is the population size, or in expected values:

E(x-$mu_x$)(y-$mu_y$)

Are these two formulations actually equivalent? if you had the total N, and plugged in equation 1, does that converge to the true expectation value?

I am just confused why the former equation is used to denote the population value if this is so, is it just a intuitive way to formulate the population covariance?


Get this bounty!!!

#StackBounty: #anova #covariance #effective-rank ANOVA-like analysis of more complicated covariance-based metrics

Bounty: 50

There is a set of variables that is observed over multiple trials and conditions. In order to explore signal changes across variables and across conditions, one typically uses ANOVA/MANOVA. However, frequently, the target of the study are not the changes in the signal itself, but changes in some derived quantity describing the behaviour of the whole set of variables. For example, one commonly used metric is effective rank (eRank). It estimates the dimensionality of the system by counting the number of principal components above a certain threshold. For example, if we have 50 variables, we could compute eRank for some given condition. If that eRank evaluates to e.g. 20, we could (very loosely speaking) conclude that only 20 of those 50 variables are independent, and the rest are linear combinations of the former to a good approximation.

The question is: how to evaluate which conditions have most effect on a metric like eRank. For example, we have different individuals and different treatments.

The naive approach would be to evaluate the metric for each condition combination separately, and then compare the resulting numbers. But this does not work. For example, let’s say we evaluate eRank separately for treatment A and treatment B. Let’s imagine the situation where all variables have the same strong reaction to treatment, but also have a variable-specific random component. Then, eRank would be high if evaluated separately for each treatment, but low if evaluated for concatenation of both conditions, since then there will be one strong principal component due to reaction to treatment.

Hence, I am looking for an ANOVA-like protocol for an arbitrary covariance-based metric. I imagine it would approximately work as follows:

  1. Evaluate the metric for each combination of conditions and each combination of concatenations of conditions
  2. Use results to estimate the marginal effect of each condition on the metric value

Note: eRank is not the only metric of this kind that is frequently used in my field. For example, different functional connectivity estimators such as correlation, transfer entropy and partial information decomposition are frequently used. Thus, I require a protocol that treats the metric as a black box estimator that can be applied to different partitions of the data.


Get this bounty!!!

#StackBounty: #regression #mathematical-statistics #multivariate-analysis #least-squares #covariance Robust Covariance in Multivariate …

Bounty: 50

Assume we are in the OLS setting with $y = Xbeta + epsilon$. When $y$ is a response vector, and $X$ are covariates, we can get two types of covariance estimates:

The homoskedastic covariance
$cov(hat{beta}) = (X’X)^{-1} (e’e)$, and robust covariance
$cov(hat{beta}) = (X’X)^{-1} X’ diag(e^2) X (X’X)^{-1}$.

I’m looking for help on how to derive these covariances when $Y$ is a response matrix, and $E$ is a residual matrix. There is a fairly detailed derivation on slide 49 here, but I think there are some steps missing.

For the homoskedastic case, each column of $E$ is assumed to have a covariance structure of $sigma_{kk} I$, which is the usual structure for a single vector response. Each row of $E$ is also assumed to be i.i.d with covariance $Sigma$.

The derivation starts with collapsing the $Y$ and $E$ matrices back into vectors. In this structure $Var(Vec(E)) = Sigma otimes I$.

First question: I understand the kronecker product produces a block diagonal matrix with $Sigma$ on the block diagonal, but where did $sigma_{kk}$ go to? Is it intentional that the $sigma_{kk}$ values are pooled together so that the covariance is constant on the diagonal, similar to the vector response case?

Using $Sigma otimes I$, the author gives a derivation for $cov(hat{beta})$ on slide 66.

$
begin{align}
cov(hat{beta}) &= ((X’X)^{-1} X’ otimes I) (I otimes Sigma) (X (X’X)^{-1} otimes I) \
&= (X’X)^{-1} otimes Sigma
end{align}
$
.

The first line looks like a standard sandwich estimator. The second line is an elegant reduction because of the I matrix and properties of the kronecker product.

Second question: What is the extension for robust covariances?
I imagine we need to revisit the meat of the sandwich estimator, ($I otimes Sigma$), which comes from the homoskedastic assumption per response in the Y matrix. If we use robust covariances, we should say that each column of $E$ has variance $diag(e_k^2)$. We can retain the second assumption that rows in E are i.i.d. Since the different columns in $E$ no longer follow the pattern $scalar * I$, I don’t believe $Var(Vec(E))$ factors into a kronecker product as it did before. Perhaps $Var(Vec(E))$ is some diagonal matrix, $D$?

Revisiting the sandwich-like estimator, is the extension for robust covariance

$
begin{align}
cov(hat{beta}) &= ((X’X)^{-1} X’ otimes I) (D) (X (X’X)^{-1} otimes I) \
&= ?
end{align}
$
.

This product doesn’t seem to reduce; we cannot invoke the mixed product property because D does not take the form of a scalar multiplier on I.

The first question is connected to this second question. In the first question on homoskedastic variances, $sigma_{kk}$ disappeared, allowing $Var(Vec(E))$ to take the form $Sigma otimes I$. But if the diagonal of $Var(Vec(E))$ was not constant, it would actually have the same structure as the robust covariance case ($Var(Vec(E))$ is some diagonal matrix $D$). So, what allowed $sigma_{kk}$ to disappear, and is there a similar trick for the robust case that would allow the $D$ matrix to factor?

Thank you for your help.


Get this bounty!!!

#StackBounty: #regression #bayesian #covariance Interpretation of multiple regressions posterior distribution

Bounty: 50

I’m interested in how we evaluate performance of Bayesian regression (linear, multiple, logistic, etc.) The posterior distribution will capture the relative likeliness of any parameter combination. So a 2D heatmap, for example of B1 and B2 (coefficients) might give us some insight into their relationship.

Recently, a colleague of mine mentioned that the posterior’s covariance matrix is effectively "all you need." I want to ask, is this oversimplifying the matter (and even if so) what does the posterior covariance matrix tell you?

My guesses are:

(1) Along the diagonal you get single parameter’s variance. The lower the number, the more confidence we have in the estimate. Whereas high variance might indicate that we’re less confident in our estimate.

(2) Covariance between parameters might be trickier to interpret. The direction (+/-) of the covariance might give an indication of the nature of the relationship (is an increase in one parameter associated with an increase, decrease or neither in the other.)

(3) The magnitude of the covariance gives me pause. Does a small value imply high confidence in the relationship or little to no association? (Very different meanings!)

(4) I can imagine a situation where the variance of B1 is quite small, so perhaps we’re confident in the estimate, whereas the variance of B2 might be rather large, so less confident. I’m not sure how this would affect our understanding of covariance direction and magnitude.

*All the above assumes proper analysis, no multicollinearity, collider bias, etc.

Any thoughts?


Get this bounty!!!

#StackBounty: #time-series #autocorrelation #covariance #stochastic-processes #brownian Time-series Auto-Covariance vs. Stochastic Proc…

Bounty: 50

My background is more on the Stochastic processes side, and I am new to Time series analysis. I would like to ask about estimating a time-series auto-covariance:

$$ lambda(u):=frac{1}{t}sum_{t}(Y_{t+u}-bar{Y})(Y_{t}-bar{Y}) $$

When I think of the covariance of Standard Brownian motion $W(t)$ with itself, i.e. $Cov(W_s,W_t)=min(s,t)$, the way I interpret the covariance is as follows: Since $mathbb{E}[W_s|W_0]=mathbb{E}[W_t|W_0]=0$, the Covariance is a measure of how "often" one would "expect" a specific Brownian motion path at time $s$ to be on the same side of the x-axis as as the same Brownian motion path at time t.

It’s perhaps easier to think of correlation rather than covariance, since $Corr(W_s,W_t)=frac{min(s,t)}{sqrt(s) sqrt(t)}$: with the correlation, one can see that the closer $s$ and $t$ are together, the closer the Corr should get to 1, as indeed one would expect intuitively.

The main point here is that at each time $s$ and $t$, the Brownian motion will have a distribution of paths: so if I were to "estimate" the covariance from sampling, I’d want to simulate many paths (or observe many paths), and then I would fix $t$ and $s=t-h$ ($h$ can be negative), and I would compute:

$$ lambda(s,t):=frac{1}{i}sum_{i}(W_{i,t}-bar{W_i})(W_{i,t-h}-bar{W_i}) $$

For each Brownian path $i$.

With the time-series approach, it seems to be the case that we "generate" just one path (or observe just one path) and then estimate the auto-covariance from just that one path by shifting throught time.

Hopefully I am making my point clear: my question is on the intuitive interpretation of the estimation methods.


Get this bounty!!!

#StackBounty: #regression #econometrics #covariance #residuals #covariance-matrix Covariance matrix of the residuals in the linear regr…

Bounty: 50

I estimate the linear regression model:

$Y = Xbeta + varepsilon$

where $y$ is an ($n times 1$) dependent variable vector, $X$ is an ($n times p$) matrix of independent variables, $beta$ is a ($p times 1$) vector of the regression coefficients, and $varepsilon$ is an ($n times 1$) vector of random errors.

I want to estimate the covariance matrix of the residuals. To do so I use the following formula:

$Cov(varepsilon) = sigma^2 (I-H)$

where I estimate $sigma^2$ with $hat{sigma}^2 = frac{e’e}{n-p}$ and where $I$ is an identity matrix and $H = X(X’X)^{-1}X$ is a hat matrix.

However, in some source I saw that the covariance matrix of the residuals is estimated in other way.
The residuals are assumed to following $AR(1)$ process:

$varepsilon_t = rho varepsilon_{t-1} + eta_t$

where $E(eta) = 0$ and $Var({eta}) = sigma^2_{0}I$.

The covariance matrix is estimated as follows

$Cov(varepsilon) = sigma^2 begin{bmatrix}
1 & rho & rho^2 & … & rho^{n-1}\
rho & 1 & rho & … & rho^{n-2} \
… & … & … & … & … \
rho^{n-1} & rho^{n-2} & … & … & 1
end{bmatrix}$

where $sigma^2 = frac{1}{1-rho}sigma^2_0$

My question is are there two different specifications of the covariance matrix of residuals or these are somehow connected with each other?


Get this bounty!!!

#StackBounty: #covariance #covariance-matrix #multivariate-normal #fisher-information #geometry What is the geometric relationship betw…

Bounty: 50

The covariance matrix represents the dispersion of data points while the inverse of the covariance matrix represents the tightness of data points. How is the dispersion and tightness related geometrically?

For example, the determinant of the covariance matrix represents the volume of the dispersion of data points. What does the determinant of the inverse of the covariance matrix represent? The determinant is related to volume, but I don’t understand how to interpret the volume of the inverse of the covariance matrix (or the volume of the information matrix).

Similarly, the trace represents kind of the mean square error of the data points, but what does the trace of the inverse of the covariance matrix represent?

I don’t quite understand how to interpret the inverse of the covariance matrix geometrically, or how it is related to the covariance matrix.


Get this bounty!!!

#StackBounty: #correlation #mixed-model #covariance #multilevel-analysis #non-independent Mixed Effects Model (3 level model?)

Bounty: 100

Consider the following problem. The dataset that I am considering has $n=1800$ units (high-end copying machines). Label the units $i = 1,dots,n$. Unit $i$ has $n_i$ recordings. It is of interest to model the use-rate for these copying machines. All machines are in the same building.

The following linear mixed effects model is used:

begin{equation}
begin{aligned}
X_i(t_{ij}) &= m_i(t)+ varepsilon_{ij} \
&= eta + z_i(t_{ij})w_i + varepsilon_{ij},
end{aligned}
end{equation}

where $eta$ is the mean, $z_i(t_{ij}) = [1, log(t_{ij})]$, $w_i = (w_{0i}, w_{1i})^top sim N(0,Sigma_w)$, $varepsilon_{ij} sim N(0, sigma^2)$, and

begin{equation}
Sigma_w =
begin{pmatrix}
sigma^2_1& rhosigma_1sigma_2 \
rhosigma_1sigma_2 & sigma^2_2
end{pmatrix}.
end{equation}

I can write this model in matrix form. More specifically, I have the model (I write this out for a reason)

begin{equation}
X = 1eta + Zw + varepsilon,
end{equation}

where

begin{equation}
X =
begin{pmatrix}
X_1\
vdots \
X_n
end{pmatrix} in mathbb{R}^N,
varepsilon =
begin{pmatrix}
varepsilon_1\
vdots \
varepsilon_n
end{pmatrix} in mathbb{R}^N,
1 =
begin{pmatrix}
1_{n_1}\
vdots \
1_{n_n}
end{pmatrix} in mathbb{R}^{N times p},
w =
begin{pmatrix}
w_1\
vdots \
w_n
end{pmatrix} in mathbb{R}^{2n},
end{equation}

where $N = sum_{i=1}^n n_i$. In addition,

begin{equation}
Z =
begin{pmatrix}
Z_1 & 0_{n_1 times 2} & dots & 0_{n_1 times 2} \
0_{n_2 times 2} & Z_2 & dots & 0_{n_2 times 2} \
vdots & & ddots & vdots \
0_{n_n times 2} & dots & & Z_n
end{pmatrix} in mathbb{R}^{N times 2n},
0_{n_i times 2} =
begin{pmatrix}
0 & 0 \
vdots& vdots \
0 & 0
end{pmatrix} in mathbb{R}^{2n_i}.
end{equation}

Furthermore, we have that

begin{equation}
begin{bmatrix}
w\
varepsilon
end{bmatrix} sim
N
begin{bmatrix}
begin{pmatrix}
0\
0
end{pmatrix},&sigma^2
begin{pmatrix}
G(gamma) & 0 \
0 & R(rho)
end{pmatrix}
end{bmatrix},
end{equation}

where $gamma$ and $rho$ are $r times 1$ and $s times 1$ vectors of unknown variance parameters corresponding to $w$ and $varepsilon$, respectively. Mathematically,

begin{equation}
G = frac{1}{sigma^2}
begin{pmatrix}
Sigma_w & dots & 0 \
vdots & ddots & vdots \
0 & dots & Sigma_w
end{pmatrix} in mathbb{R}^{2n times 2n},
R =
begin{pmatrix}
I_{n_1} & dots & 0 \
vdots & ddots & vdots \
0 & dots & I_{n_n}
end{pmatrix} in mathbb{R}^{N times N},
end{equation}

where $w_i sim N(0, Sigma_w)$, and $varepsilon_i sim N(0, sigma^2I_{n_i})$. Here $gamma = (sigma_1, sigma_2, rho)^top$ and $rho = sigma^2$.

Imagine I now obtain a dataset for a new building with $n$ units. But now, unit $i$ is in the same room as unit $i+1$ for $i = 1,3,5,dots, n-1$. How would I model the additional dependence between units in the same room? At first I thought to use the exact same model as above but changing $G$ to

begin{equation}
G = frac{1}{sigma^2}
begin{pmatrix}
Sigma_w & Sigma_{1,2} & dots &0& 0 \
Sigma_{1,2}& Sigma_w & dots &0& 0 \
vdots & vdots& ddots & vdots& vdots \
0 & 0& dots& Sigma_w & Sigma_{1799,1800} \
0 & 0& dots & Sigma_{1799,1800}& Sigma_w
end{pmatrix} in mathbb{R}^{2n times 2n},
end{equation}

where $Sigma_{i, i+1}$ is the covariance matrix which models the dependence between units $i$ and $i+1$ for $i = 1,3, dots, 1799$.

Is this a possible way to model the problem? I guess it would not be possible to use nlm in R to do it but it would be possible using an analytic solution.

What else could be done? I think a three level hierarchical model (instead of two level model) could also work, but I am not sure how to formulate a three level model.

Any advice on past modelling experiences and how to write down the three level model would be appreciated.


Get this bounty!!!