#StackBounty: #r #regression #least-squares #linear How do Lawson and Hanson solve the unconstrained least squares problem?

Bounty: 350

For the non-negative least squares problem min(b - a %*% x) subject to x >= 0, the Lawson-Hanson fortran77 implementation (fortran code, R package) often gives a different solution than modern methods for the same set of unconstrained variables.

Modern benchmark methods might include multiway::fnnls, sequential coordinate descent nnls from NNLM, and TNT-NN. Note that bounded Variable Least Squares (bvls::bvls) uses Lawson/Hanson source code and often returns the same results.

Technical background: Non-negative least squares problems are comprised of two independent subproblems:

  1. determining what variables will be constrained to zero (the active set)
  2. solving the least squares solution for all unconstrained variables

This question deals with the second subproblem. However, note that these subproblems are blurred in coordinate descent which manipulates b to minimize error rather than active set selection. However, NNLM::nnlm is unable to replicate LH77 results and uses gradient descent.

Functions

library(nnls)
library(multiway)

# my own take on tnt-nn
fastnnls <- function(a, b){
  x <- solve(a, b)
  while(any(x < 0)){
    ind <- which(x > 0)
    x <- rep(0, length(b))
    x[ind] <- solve(a[ind, ind], b[ind])
  }
  as.vector(x)
}

Reproducible setup

set.seed(123)
X <- matrix(rnorm(2000),100,20)
y <- X %*% runif(20) + rnorm(100)*5
a <- crossprod(X)
b <- as.vector(crossprod(X, y))

Calculate solutions and residuals

x_fnnls <- as.vector(multiway::fnnls(a, b))
x_lh <- nnls::nnls(a, b)$x
x_fastnnls <- fastnnls(a, b)

all.equal(x_fnnls, x_fastnnls)
# [1] TRUE
all.equal(x_fnnls, x_lh)
# [1] "Mean relative difference: 0.08109434"
all.equal(x_fastnnls, x_lh)
# [1] "Mean relative difference: 0.08109434"

Note that multiway::fnnls is theoretically supposed to give the same result as the Lawson-Hanson implementation, but rarely does so. However, it does find the same active set, but then the solution is different — and often worse.

Residuals

# using mean l2-norm

mean(abs(b - as.vector(a %*% x_fnnls)))
# [1] 6.515709
mean(abs(b - as.vector(a %*% x_fastnnls)))
# [1] 6.515709
mean(abs(b - as.vector(a %*% x_lh)))
# [1] 9.086841

In the case of this simulated random example, the error of the Lawson-Hanson algorithm is greater, but I’ve often seen it significantly lower than other methods for the same active set.

So, why is this? What about the LH77 solver gives different results from a modern Cholesky symmetric positive definite solver that is used in R base::solve, Armadillo arma::solve, etc.? I have tried to read the Fortran code and had to throw in the towel.

Thoughts:

  • Does this come down to the method they use to solve the equations? For instance, do they use QR instead of Cholesky decomposition -> forward -> backward substitution, and would this matter?
  • Do LH77 use gradient optimization after finding the final active set, and if so, why are the coordinate descent solvers in NNLM::nnlm unable to match the performance of LH77 in well-determined systems?
  • It appears that sometimes the LH77 active set is actually different as well, and this leads to a slightly different (and almost invariably better) result. Thus, this difference seems to be a feature of their solver entirely.
  • Is there some NNLS-specific method for solving least squares equations that Lawson/Hanson discovered that all follow-up studies failed to replicate, because we like out-of-the-box solvers?
  • Is the true solution of NNLS actually not solvable by coordinate descent, and the best we can get is an approximation while LH77 somehow finds that true minimum?

This is a loaded question, I know, but I’m betting my reputation on it that somebody somewhere around here has a rabbit in their hat.


Get this bounty!!!

#StackBounty: #least-squares #econometrics #consistency Adapting OLS to a parametric regression coefficient

Bounty: 100

Consider the following linear model
$$
Y_i=X_{i1}beta_1+eta_i X_{i2}beta_2+epsilon_i
$$

Let $betaequiv (beta_1,beta_2)$ and $X_iequiv (X_{i1}, X_{i2})$.

Assume that

[A1] We have an i.i.d. sample of observations, ${Y_i, X_{i}}_{i=1}^n$.

[A2] $E(epsilon_i| X_{i})=0$.

[A3] $E(eta_i|X_{i})$ is known by the analyst. Let me denote this expected values as $A(X_{i})$. Note that the function $A(cdot)$ is known.

[A4] $begin{pmatrix}
E(X_{i1}^2) & E(X_{i1} X_{i2} A(X_{i1}, X_{i2}))\
E(X_{i1}X_{i2}) & E(X_{i2}^2 A(X_{i1}, X_{i2}))\
end{pmatrix}$
and its sample analogue are invertible.

My objective is to construct a consistent and asymptotically normal estimator for $beta$.

This looks similar to the OLS case, with the only exception that $eta_i$ premultiplies a covariate and is unobserved.

My claim is that, after some minor manipulations, we can still apply the "OLS machinery".

I report here the proof. Is it correct? Is this a well known result and, if yes, could you give me a reference?


Proof

Step 1: Observe that, by [A2], $beta$ solves the following system of equations
$$
begin{pmatrix}
E(X_{i1}Y_i)\
E(X_{i2}Y_i)\
end{pmatrix}=begin{pmatrix}
E(X_{i1}^2) & E(X_{i1} X_{i2} eta_i)\
E(X_{i1}X_{i2}) & E(X_{i2}^2 eta_i)\
end{pmatrix}begin{pmatrix}
beta_1\
beta_2
end{pmatrix}
$$

Step 2: By [A3] combined with the law of iterated expectations,
$$
begin{pmatrix}
E(X_{i1}^2) & E(X_{i1} X_{i2} eta_i)\
E(X_{i1}X_{i2}) & E(X_{i2}^2 eta_i)\
end{pmatrix}=begin{pmatrix}
E(X_{i1}^2) & E(X_{i1} X_{i2} A(X_{i1}, X_{i2}))\
E(X_{i1}X_{i2}) & E(X_{i2}^2 A(X_{i1}, X_{i2}))\
end{pmatrix}
$$

Step 3: By [A4],
$$
begin{pmatrix}
beta_1\
beta_2
end{pmatrix}=begin{pmatrix}
E(X_{i1}^2) & E(X_{i1} X_{i2} A(X_{i1}, X_{i2}))\
E(X_{i1}X_{i2}) & E(X_{i2}^2 A(X_{i1}, X_{i2}))\
end{pmatrix}^{-1}begin{pmatrix} E(X_{i1}Y_i)\
E(X_{i2}Y_i)\
end{pmatrix}
$$

Step 4: Take the sample analogue of Step 3
$$
begin{pmatrix}
hat{beta}_{1,n}\
hat{beta}_{2,n}
end{pmatrix}=begin{pmatrix}
frac{1}{n}sum_{i=1}^nX_{i1}^2 & frac{1}{n}sum_{i=1}^n X_{i1} X_{i2} A(X_{i1}, X_{i2})\
frac{1}{n}sum_{i=1}^n X_{i1}X_{i2} & frac{1}{n}sum_{i=1}^n X_{i2}^2 A(X_{i1}, X_{i2})\
end{pmatrix}^{-1}begin{pmatrix} frac{1}{n}sum_{i=1}^n X_{i1}Y_i\
frac{1}{n}sum_{i=1}^n X_{i2}Y_i\
end{pmatrix}
$$

Step 5 We can now show that this estimator is consistent and asymptotically normal using classic law of large numbers and central limit theorem as we do for the traditional OLS (under [A1] plus existence of some moments). I’m not reporting this part here.


Get this bounty!!!

#StackBounty: #regression #least-squares #error #error-propagation Estimating the errors in parameters in the ordinary least square

Bounty: 50

I am reading the book An Introduction to Error Analysis by John R. Taylor. In Ch8: Least-Squares Fitting, he has derived expressions for parameters $A$ and $B$ in fitting the line $A+Bx$ to the set of $N$ observations $(x_i, y_i)$. The maximum likelihood estimates of the parameters are:

$$A =frac{sum x^2sum y-sum xsum xy}{Delta}$$
$$B = frac{Nsum xy-sum xsum y}{Delta}$$
where
$$Delta = Nsum x^2 – left(sum xright)^2$$

I have verified these to be correct. However, now he argues that the errors in the estimation for $A$ and $B$ can be found by simple error propagation method since errors in all observation $y_i$ are assumed to be $sigma_y$ (i.e. all the errors are equal), while $x_i$ have no errors in them. He then claims that this should give the following errors on $A$ and $B$:

$$sigma_A = sigma_ysqrt{frac{sum x^2}{Delta}}$$ and
$$sigma_B = sigma_ysqrt{frac{N}{Delta}}$$

However, my calculations don’t match with these. The way I calculate error in $A$ for example is this: since there are no errors on $x$, all the terms that contain only $x$ remain unaffected (e.g. $sum x$ and $Delta$). The uncertainty in $xy$ is $xsigma_y$ and because all $y$ are distributed normally, I can get the uncertainty in $sum xy$ by adding these uncertainties in quadrature: $sqrt{sum x^2sigma_y^2}=sigma_ysqrt{sum x^2}$. Similarly the uncertainty in $sum x^2sum y$ is $sigma_y sqrt{N}sum x^2$. Now here I face the first problem: How do we combine the errors in the two terms in the numerator of $A$? Since they are not independent, I feel that I cannot add their errors in quadrature. But adding the errors directly leads me nowhere close to what is given in the book. So let’s just add the errors in quadrature to get the uncertainty in $A$ as:

$$sigma_A = frac{1}{Delta}sqrt{left(sigma_ysqrt{N}sum x^2right)^2+left(sum xsigma_ysqrt{sum x^2}right)^2}$$
$$sigma_A = frac{1}{Delta}sigma_ysqrt{sum x^2}sqrt{Nsum x^2 + left(sum xright)^2 }$$

Notice the similarity in the square root expression above and the expression for $Delta$ given at the start. If it were not for the plus sign inside the square root, the expression would have been exactly equal to $Delta$ leading to immediate cancellation of $sqrt{Delta}$ and we would have been left with the expression for $sigma_A$ given in the book! However the plus sign messes up everything. (And this is my second problem!). What am I exactly missing?


Get this bounty!!!

#StackBounty: #regression #least-squares #stepwise-regression How does forward stepwise regression select variables

Bounty: 100

Suppose there are $p = 3$ variables total and suppose the forward stepwise procedure selects the third variable. The forward stepwise procedure will assign it a positive coefficient if and only if the following two conditions are true:
$$X_3^Ty/||X_3||_2 geq pm X_1^Ty/||X_1||_2$$
and
$$X_3^Ty/||X_3||_2 geq pm X_2^Ty/||X_2||_2$$
where $X_j$ is the $j$th column vector of the design matrix, $X in mathbb{R}^{N times p}$, and $yin mathbb{R}^n$ is the response vector.

My question is: how does "the 3rd variable minimizes the residual sum error" translate to the above two conditions?

From my understanding, the procedure would select the third variable $X_3$ if $sum_{i=1}^N (y_i – hat{beta}_3X_{3i})^2$ is smaller than $sum_{i=1}^N (y_i – hat{beta}_1X_{1i})^2$ and $sum_{i=1}^N (y_i – hat{beta}_2X_{2i})^2,$ where $hat{beta}_j = left(X_j^TX_jright)^{-1}X_j^Ty$ is the OLS estimate for the $j$th variable. How does this translate to the 2 conditions I’ve listed above? I think the above conditions are saying the following:

"$hat{beta}_3 = frac{sum_{i=1}^N X_{3i}y_i}{sum_{i=1}^N X_{3i}^2}$ is positive iff $frac{sum_{i=1}^N X_{3i}y_i}{sqrt{sum_{i=1}^N X_{3i}^2}}geq pm frac{sum_{i=1}^N X_{1i}y_i}{sqrt{sum_{i=1}^N X_{1i}^2}}$ and $frac{sum_{i=1}^N X_{3i}y_i}{sqrt{sum_{i=1}^N X_{3i}^2}}geq pm frac{sum_{i=1}^N X_{2i}y_i}{sqrt{sum_{i=1}^N X_{2i}^2}}$"

"$hat{beta}_3 = frac{sum_{i=1}^N X_{3i}y_i}{sum_{i=1}^N X_{3i}^2}$ is negative iff $frac{-sum_{i=1}^N X_{3i}y_i}{sqrt{sum_{i=1}^N X_{3i}^2}}geq pm frac{sum_{i=1}^N X_{1i}y_i}{sqrt{sum_{i=1}^N X_{1i}^2}}$ and $frac{-sum_{i=1}^N X_{3i}y_i}{sqrt{sum_{i=1}^N X_{3i}^2}}geq pm frac{sum_{i=1}^N X_{2i}y_i}{sqrt{sum_{i=1}^N X_{2i}^2}}$"

But why are these two conditions true?


Get this bounty!!!

#StackBounty: #regression #mathematical-statistics #multivariate-analysis #least-squares #covariance Robust Covariance in Multivariate …

Bounty: 50

Assume we are in the OLS setting with $y = Xbeta + epsilon$. When $y$ is a response vector, and $X$ are covariates, we can get two types of covariance estimates:

The homoskedastic covariance
$cov(hat{beta}) = (X’X)^{-1} (e’e)$, and robust covariance
$cov(hat{beta}) = (X’X)^{-1} X’ diag(e^2) X (X’X)^{-1}$.

I’m looking for help on how to derive these covariances when $Y$ is a response matrix, and $E$ is a residual matrix. There is a fairly detailed derivation on slide 49 here, but I think there are some steps missing.

For the homoskedastic case, each column of $E$ is assumed to have a covariance structure of $sigma_{kk} I$, which is the usual structure for a single vector response. Each row of $E$ is also assumed to be i.i.d with covariance $Sigma$.

The derivation starts with collapsing the $Y$ and $E$ matrices back into vectors. In this structure $Var(Vec(E)) = Sigma otimes I$.

First question: I understand the kronecker product produces a block diagonal matrix with $Sigma$ on the block diagonal, but where did $sigma_{kk}$ go to? Is it intentional that the $sigma_{kk}$ values are pooled together so that the covariance is constant on the diagonal, similar to the vector response case?

Using $Sigma otimes I$, the author gives a derivation for $cov(hat{beta})$ on slide 66.

$
begin{align}
cov(hat{beta}) &= ((X’X)^{-1} X’ otimes I) (I otimes Sigma) (X (X’X)^{-1} otimes I) \
&= (X’X)^{-1} otimes Sigma
end{align}
$
.

The first line looks like a standard sandwich estimator. The second line is an elegant reduction because of the I matrix and properties of the kronecker product.

Second question: What is the extension for robust covariances?
I imagine we need to revisit the meat of the sandwich estimator, ($I otimes Sigma$), which comes from the homoskedastic assumption per response in the Y matrix. If we use robust covariances, we should say that each column of $E$ has variance $diag(e_k^2)$. We can retain the second assumption that rows in E are i.i.d. Since the different columns in $E$ no longer follow the pattern $scalar * I$, I don’t believe $Var(Vec(E))$ factors into a kronecker product as it did before. Perhaps $Var(Vec(E))$ is some diagonal matrix, $D$?

Revisiting the sandwich-like estimator, is the extension for robust covariance

$
begin{align}
cov(hat{beta}) &= ((X’X)^{-1} X’ otimes I) (D) (X (X’X)^{-1} otimes I) \
&= ?
end{align}
$
.

This product doesn’t seem to reduce; we cannot invoke the mixed product property because D does not take the form of a scalar multiplier on I.

The first question is connected to this second question. In the first question on homoskedastic variances, $sigma_{kk}$ disappeared, allowing $Var(Vec(E))$ to take the form $Sigma otimes I$. But if the diagonal of $Var(Vec(E))$ was not constant, it would actually have the same structure as the robust covariance case ($Var(Vec(E))$ is some diagonal matrix $D$). So, what allowed $sigma_{kk}$ to disappear, and is there a similar trick for the robust case that would allow the $D$ matrix to factor?

Thank you for your help.


Get this bounty!!!

#StackBounty: #least-squares #measurement-error Measurement error in one indep variable in OLS with multiple regression

Bounty: 50

Suppose I regress (with OLS) $y$ on $x_1$ and $x_2$. Suppose I have i.i.d. sample of size n, and that $x_1$ is observed with error but $y$ and $x_2$ are observed without error. What is the probability limit of the estimated coefficient on $x_1$?

Let us suppose for tractability that the measurement error of $x_1$ is "classical". That is the measurement error is normally distributed with mean 0 and is uncorrelated with $x_2$ or the error term.


Get this bounty!!!

#StackBounty: #least-squares #measurement-error Measurement error in one indep variable in OLS with multivariate regression

Bounty: 50

Suppose I regress (with OLS) $y$ on $x_1$ and $x_2$. Suppose I have i.i.d. sample of size n, and that $x_1$ is observed with error but $y$ and $x_2$ are observed without error. What is the probability limit of the estimated coefficient on $x_1$?

Let us suppose for tractability that the measurement error of $x_1$ is "classical". That is the measurement error is normally distributed with mean 0 and is uncorrelated with $x_2$ or the error term.


Get this bounty!!!

#StackBounty: #regression #multiple-regression #least-squares #mse What's the MSE of $hat{Y}$ in ordinary least squares using bias…

Bounty: 100

Suppose I have the following model: $$Y = mu + epsilon = Xbeta + epsilon,$$ where $Y$ is $n times 1$, $X$ is $n times p$, $beta$ is $p times 1$, and $epsilon$ is $n times 1$. I assume that $epsilon$ are independent with mean 0 and variance $sigma^2I$.

In OLS, the fitted values are $hat{Y} = HY$, where $H = X(X^TX)^{-1}X^T$ is the $N times N$ hat matrix. I want to find the MSE of $hat{Y}$.

By the bias-variance decomposition, I know that

begin{align}
MSE(hat{Y}) &= bias^2(hat{Y}) + var(hat{Y})\
&= (E[HY] – mu)^T(E[HY] – mu) + var(HY)\
&= (Hmu – mu)^T(Hmu – mu) + sigma^2H\
&= 0 + sigma^2H
end{align
}

I’m confused by the dimension in the last step. The $bias^2$ term is a scalar. However, $var(hat{Y})$ is an $N times N$ matrix. How can one add a scalar to an $N times N$ matrix where $N neq 1$?


Get this bounty!!!

#StackBounty: #regression #multiple-regression #least-squares #mse What's the MSE of $hat{Y}$ in ordinary least squares using bias…

Bounty: 100

Suppose I have the following model: $$Y = mu + epsilon = Xbeta + epsilon,$$ where $Y$ is $n times 1$, $X$ is $n times p$, $beta$ is $p times 1$, and $epsilon$ is $n times 1$. I assume that $epsilon$ are independent with mean 0 and variance $sigma^2I$.

In OLS, the fitted values are $hat{Y} = HY$, where $H = X(X^TX)^{-1}X^T$ is the $N times N$ hat matrix. I want to find the MSE of $hat{Y}$.

By the bias-variance decomposition, I know that

begin{align}
MSE(hat{Y}) &= bias^2(hat{Y}) + var(hat{Y})\
&= (E[HY] – mu)^T(E[HY] – mu) + var(HY)\
&= (Hmu – mu)^T(Hmu – mu) + sigma^2H\
&= 0 + sigma^2H
end{align
}

I’m confused by the dimension in the last step. The $bias^2$ term is a scalar. However, $var(hat{Y})$ is an $N times N$ matrix. How can one add a scalar to an $N times N$ matrix where $N neq 1$?


Get this bounty!!!

#StackBounty: #regression #multiple-regression #least-squares #mse What's the MSE of $hat{Y}$ in ordinary least squares using bias…

Bounty: 100

Suppose I have the following model: $$Y = mu + epsilon = Xbeta + epsilon,$$ where $Y$ is $n times 1$, $X$ is $n times p$, $beta$ is $p times 1$, and $epsilon$ is $n times 1$. I assume that $epsilon$ are independent with mean 0 and variance $sigma^2I$.

In OLS, the fitted values are $hat{Y} = HY$, where $H = X(X^TX)^{-1}X^T$ is the $N times N$ hat matrix. I want to find the MSE of $hat{Y}$.

By the bias-variance decomposition, I know that

begin{align}
MSE(hat{Y}) &= bias^2(hat{Y}) + var(hat{Y})\
&= (E[HY] – mu)^T(E[HY] – mu) + var(HY)\
&= (Hmu – mu)^T(Hmu – mu) + sigma^2H\
&= 0 + sigma^2H
end{align
}

I’m confused by the dimension in the last step. The $bias^2$ term is a scalar. However, $var(hat{Y})$ is an $N times N$ matrix. How can one add a scalar to an $N times N$ matrix where $N neq 1$?


Get this bounty!!!