#StackBounty: #optimization #lasso #glmnet choosing lambda for multi-reponse lasso in glmnet

Bounty: 50

I know that from Hastie et. al’s paper, that in the single response $y$ LASSO, the $lambda$ values are chosen such that:
$Nalphalambda_{max} = max_l |< x_l, y_l > |$
Also, $y$ is by default standardised before forming the grid of $lambda$ values on log-scale. Then, the grid is de-standardized by multiplying back by $sigma_y$.

I’m trying to understand how this is done if $Y$ becomes a matrix (i.e multiresponse). Any ideas how $lambda$ would then be formed?


Get this bounty!!!

#StackBounty: #regression #lasso #convergence #high-dimensional High-dimensional regression: why is $log p/n$ special?

Bounty: 100

I am trying to read up on the research in the area of high-dimensional regression; when $p >> n$. It seems like the term $log p/n$ appears often in terms of rate of convergence for regression estimators.

For example, here, equation (17) says that the lasso fit, $hat{beta}$ satisfies
$$ dfrac{1}{n}|Xhat{beta} – X beta|_2^2 = O_P left(sigma sqrt{dfrac{log p}{n} } |beta|_1right),.$$

Usually, this also implies that $log p$ should be smaller than $n$.

  1. Is there any intuition as to why this ratio of $log p/n$ is so prominent?
  2. Also, it seems from the literature the high-dimensional regression problem gets complicated when $log p geq n$. Why is it so?
  3. Is there a good reference that discusses the issues with how fast $p$ and $n$ should grow compared to each other?


Get this bounty!!!

#StackBounty: #regression #lasso #convergence #high-dimensional High-dimensional regression: why is $log p/n$ special?

Bounty: 100

I am trying to read up on the research in the area of high-dimensional regression; when $p >> n$. It seems like the term $log p/n$ appears often in terms of rate of convergence for regression estimators.

For example, here, equation (17) says that the lasso fit, $hat{beta}$ satisfies
$$ dfrac{1}{n}|Xhat{beta} – X beta|_2^2 = O_P left(sigma sqrt{dfrac{log p}{n} } |beta|_1right),.$$

Usually, this also implies that $log p$ should be smaller than $n$.

  1. Is there any intuition as to why this ratio of $log p/n$ is so prominent?
  2. Also, it seems from the literature the high-dimensional regression problem gets complicated when $log p geq n$. Why is it so?
  3. Is there a good reference that discusses the issues with how fast $p$ and $n$ should grow compared to each other?


Get this bounty!!!

#StackBounty: #regression #multiple-regression #lasso Lasso on squared parameter

Bounty: 100

Assume a linear regression problem where I want to force sparsity of some parameters. However, due to some physics, I know that one of my parameters is always positive. For instance, I have that

$$ y=sum beta_ix_i+epsilon $$ where $beta_5geq0$

Is it safe to find the parameter estimates through maximizing the penalized likelihood below while just adding the constraint $beta_5geq0$

$$l_p=l(boldsymbolbeta)+lambda sum |beta_i|$$

By safe I mean, can we still interpret the sparsity results the same way we do in the lasso and if yes why, is there another way to do it using an $l_1$ norm, or does this minimization retain the lasso properties at the MLE.


Get this bounty!!!

#StackBounty: #machine-learning #multiple-comparisons #regression-coefficients #lasso #high-dimensional High dimensional, correlated da…

Bounty: 50

I have a dataset with about 5,000 often correlated features / covariates and a binary response. The data was given to me, I didn’t collect it; it exists for other reasons. I use Lasso and gradient boosting to build models. I use iterated, nested cross validation. I report Lasso’s largest (absolute) 40 coefficients and the 40 most important features in the gradient boosted trees. (There was nothing special about 40; it just seemed to be a reasonable amount of information, but not so much as to overwhelm the audience.) I also report the variance of these quantities over the folds of CV and the iterations of CV.

I have kept track of the feature names and analyze them according to domain expertise. At the end, I kind of muse over the “important” features, making no statements about p-values or causality or anything, but instead considering this process a kind of—albeit imperfect and involving randomness—microscope into some phenomenon otherwise opaque behind dimensionality.

Assuming I have done all this correctly (e.g., executed cross validation correctly, scaled for lasso), is this approach reasonable? Are there issues with this; e.g., with multiple hypothesis testing, post hoc analysis, false discovery?


Get this bounty!!!

#StackBounty: #regression #optimization #lasso #lars #glmmlasso Why under joint least squares direction is it possible for some coeffic…

Bounty: 50

I think I understand how LARS regression works. It basically adds features to the model when they are more correlated with the residuals than the current model. And then, after adding the features to the model, it will increase the coefficients in the joint least squares direction (which is the same as increasing the least angle).

If the coefficients are increased in the joint least squares direction, then doesn’t that mean that they can’t decrease? joint least squares means that the $beta$’s move such that the $Sigmabeta_i^2$ is as low as possible, but the $beta$’s must be increasing.

I’ve seen some plots where the $beta$’s seem to be decreasing as LARS is finding its solution path. for example, in the original paper on the top of page 4, it shows the following plot:

LARS showing betas decreasing as it finds solution

Am I misunderstanding something about the LARS algorithm? Perhaps I’m not understanding how the joint least squares direction, and equiangular are both possible?


Get this bounty!!!

#StackBounty: #lasso #regularization What is the smallest $lambda$ that gives a 0 component in lasso?

Bounty: 50

Define the lasso estimate $$hatbeta^lambda = argmin_{beta in mathbb{R}^p} frac{1}{2n} |y – X beta|_2^2 + lambda |beta|_1,$$ where the $i^{th}$ row $x_i in mathbb{R}^p$ of the design matrix $X in mathbb{R}^{n times p}$ is a vector of covariates for explaining the stochastic response $y_i$ (for $i=1, dots n$).

We know that for $lambda geq frac{1}{n} |X^T y|_infty$, the lasso estimate $hatbeta^lambda = 0$. (See, for instance, Lasso and Ridge tuning parameter scope.) In other notation, this is expressing that $lambda_max = frac{1}{n} |X^T y|_infty$.

For full probability under a continuous distribution on $y$, we know that for $lambda$ very small, however, the lasso estimate $hatbeta^lambda$ has no zero entries almost surely. In other words, when there is little regularization, the shrinkage doesn’t zero out any component. What is the value of $lambda$ at at which a component of $hatbeta^lambda$ is initially zero? That is, what is $$lambda_textrm{min}^{(1)} = min_{exists j textrm{ s.t.} hatbeta^lambda_j ne 0 textrm{ and } , hatbeta^{mu} = 0 , forall mu < lambda} lambda$$ equal to, as a function of $X$ and $y$? It may be easier to compute $$lambda^{(2)}min = sup{hatbeta^lambda_j ne 0 , forall j} lambda,$$ recognizing that this change point doesn’t have a unique interpretation since nonzero components don’t have to “stay” nonzero as $lambda$ increases.

It seems like both of these may not be available in closed form, since, otherwise, it seems like lasso computational packages would take advantage of it when determining the tuning parameter depth. Be this as it may, what can be said about these?


Get this bounty!!!

#StackBounty: #lasso #regularization What is the smallest $lambda$ that gives a 0 component in lasso?

Bounty: 50

Define the lasso estimate $$hatbeta^lambda = argmin_{beta in mathbb{R}^p} frac{1}{2n} |y – X beta|_2^2 + lambda |beta|_1,$$ where the $i^{th}$ row $x_i in mathbb{R}^p$ of the design matrix $X in mathbb{R}^{n times p}$ is a vector of covariates for explaining the stochastic response $y_i$ (for $i=1, dots n$).

We know that for $lambda geq frac{1}{n} |X^T y|_infty$, the lasso estimate $hatbeta^lambda = 0$. (See, for instance, Lasso and Ridge tuning parameter scope.) In other notation, this is expressing that $lambda_max = frac{1}{n} |X^T y|_infty$.

For full probability under a continuous distribution on $y$, we know that for $lambda$ very small, however, the lasso estimate $hatbeta^lambda$ has no zero entries almost surely. In other words, when there is little regularization, the shrinkage doesn’t zero out any component. What is the value of $lambda$ at at which a component of $hatbeta^lambda$ is initially zero? That is, what is $$lambda_textrm{min}^{(1)} = min_{exists j textrm{ s.t.} hatbeta^lambda_j ne 0 textrm{ and } , hatbeta^{mu} = 0 , forall mu < lambda} lambda$$ equal to, as a function of $X$ and $y$? It may be easier to compute $$lambda^{(2)}min = sup{hatbeta^lambda_j ne 0 , forall j} lambda,$$ recognizing that this change point doesn’t have a unique interpretation since nonzero components don’t have to “stay” nonzero as $lambda$ increases.

It seems like both of these may not be available in closed form, since, otherwise, it seems like lasso computational packages would take advantage of it when determining the tuning parameter depth. Be this as it may, what can be said about these?


Get this bounty!!!

#StackBounty: #lasso #regularization What is the smallest $lambda$ that gives a 0 component in lasso?

Bounty: 50

Define the lasso estimate $$hatbeta^lambda = argmin_{beta in mathbb{R}^p} frac{1}{2n} |y – X beta|_2^2 + lambda |beta|_1,$$ where the $i^{th}$ row $x_i in mathbb{R}^p$ of the design matrix $X in mathbb{R}^{n times p}$ is a vector of covariates for explaining the stochastic response $y_i$ (for $i=1, dots n$).

We know that for $lambda geq frac{1}{n} |X^T y|_infty$, the lasso estimate $hatbeta^lambda = 0$. (See, for instance, Lasso and Ridge tuning parameter scope.) In other notation, this is expressing that $lambda_max = frac{1}{n} |X^T y|_infty$.

For full probability under a continuous distribution on $y$, we know that for $lambda$ very small, however, the lasso estimate $hatbeta^lambda$ has no zero entries almost surely. In other words, when there is little regularization, the shrinkage doesn’t zero out any component. What is the value of $lambda$ at at which a component of $hatbeta^lambda$ is initially zero? That is, what is $$lambda_textrm{min}^{(1)} = min_{exists j textrm{ s.t.} hatbeta^lambda_j ne 0 textrm{ and } , hatbeta^{mu} = 0 , forall mu < lambda} lambda$$ equal to, as a function of $X$ and $y$? It may be easier to compute $$lambda^{(2)}min = sup{hatbeta^lambda_j ne 0 , forall j} lambda,$$ recognizing that this change point doesn’t have a unique interpretation since nonzero components don’t have to “stay” nonzero as $lambda$ increases.

It seems like both of these may not be available in closed form, since, otherwise, it seems like lasso computational packages would take advantage of it when determining the tuning parameter depth. Be this as it may, what can be said about these?


Get this bounty!!!

#StackBounty: #lasso #regularization What is the smallest $lambda$ that gives a 0 component in lasso?

Bounty: 50

Define the lasso estimate $$hatbeta^lambda = argmin_{beta in mathbb{R}^p} frac{1}{2n} |y – X beta|_2^2 + lambda |beta|_1,$$ where the $i^{th}$ row $x_i in mathbb{R}^p$ of the design matrix $X in mathbb{R}^{n times p}$ is a vector of covariates for explaining the stochastic response $y_i$ (for $i=1, dots n$).

We know that for $lambda geq frac{1}{n} |X^T y|_infty$, the lasso estimate $hatbeta^lambda = 0$. (See, for instance, Lasso and Ridge tuning parameter scope.) In other notation, this is expressing that $lambda_max = frac{1}{n} |X^T y|_infty$.

For full probability under a continuous distribution on $y$, we know that for $lambda$ very small, however, the lasso estimate $hatbeta^lambda$ has no zero entries almost surely. In other words, when there is little regularization, the shrinkage doesn’t zero out any component. What is the value of $lambda$ at at which a component of $hatbeta^lambda$ is initially zero? That is, what is $$lambda_textrm{min}^{(1)} = min_{exists j textrm{ s.t.} hatbeta^lambda_j ne 0 textrm{ and } , hatbeta^{mu} = 0 , forall mu < lambda} lambda$$ equal to, as a function of $X$ and $y$? It may be easier to compute $$lambda^{(2)}min = sup{hatbeta^lambda_j ne 0 , forall j} lambda,$$ recognizing that this change point doesn’t have a unique interpretation since nonzero components don’t have to “stay” nonzero as $lambda$ increases.

It seems like both of these may not be available in closed form, since, otherwise, it seems like lasso computational packages would take advantage of it when determining the tuning parameter depth. Be this as it may, what can be said about these?


Get this bounty!!!