## #StackBounty: #feature-selection #linear-model Covariance-residual technique for linear regression feature selection

### Bounty: 50

When doing forward feature selection for linear regression, it is a well known trick that to select the next feature to add, we can compute the covariance of each candidate feature against the current set of residuals, and choose the one with maximum absolute value.

Intuitively, this makes sense to me, but I haven’t been able to find or derive a rigorous proof that this technique is equivalent to the naive approach of adding each candidate feature one-by-one, computing coefficients and a squared error for each one, and then choosing the feature yielding the minimum squared error.

Can someone share one, or provide a counterexample?

Get this bounty!!!

## #StackBounty: #machine-learning #correlation #linear-model #canonical-correlation Canonical correlation analysis with a tiny example an…

### Bounty: 150

I’ve tried reading many explanations of CCA, and I don’t understand it. For example, on Wikipedia, it refers to two “vectors” $a$ and $b$ such that $rho = text{corr}(a^{top} X, b^{top} Y)$ is maximal. But if $a$ and $X$ are vectors, isn’t $a^{top} X$ a scalar? What does it mean for two scalars to be correlated?

Other explanations use matrix notation without any dimensions, e.g.

Can someone explain CCA with a small example and all the dimensions provided?

Get this bounty!!!

## #StackBounty: #multiple-regression #multivariate-analysis #linear-model #r-squared Coefficient of Determination in case of Multi-Respon…

### Bounty: 50

In case of single response multiple linear model, I can write the population coefficient of determination as,

$$rho^2 = frac{sigma_{Xy}^tSigma_{XX}^{-1}sigma_{Xy}}{sigma_{yy}}$$

This gives the proportion of variance in the response explained by the model. What is the equivalent expression for a model with multiple response. For instance, consider a model with $m$ responses as,

$$E[mathbf{Y}|X_1, X_2, ldots, X_p] = boldsymbol{mu}_{Y} +mathbf{B}left(mathbf{X} – boldsymbol{mu}_Xright)$$

When I estimate this model with OLS, I will obtain estimated coefficient of determination $R^2$ for each response. What will the population expression for the coefficient of determination. I suppose I will obtain $mtimes m$ matrix of $R^2$ as,

$$begin{bmatrix} rho_1^2 & rho_{12} & ldots & rho_{1m} \ rho_{21} & rho_2^2 & ldots & rho_{2m} \ vdots & vdots & ddots & vdots \ rho_{m1} & rho_{m2}^2 & ldots & rho_m^2 \ end{bmatrix}$$

I can imagine the diagonal elements of this matrix as explained variation, but what can be the interpretation of the off diagonal elements?

In summary, following are to my main question:

• How can I calculate population coefficient of determination from the covariance structure in case of model with multiple response variables
• What will be the interpretation of off diagonal elements in the coefficient of determination matrix

Get this bounty!!!

## #StackBounty: #regression #self-study #linear-model #independence Implications pf mean dependence in the classical linear model

### Bounty: 50

Consider the classical linear model

(1) $Y_i=X_i’beta+epsilon_i$

(2) $(Y_i, X_i)_{i=1}^n$ i.i.d.

(3) $E(epsilon_i| X_1,…, X_n)=0$

Could you help me to show step by step that

(1), $(Y_i, X_i)_{i=1}^n$ mutually independent, (3) $Rightarrow$ $E(epsilon_i|X_i)=0$

Could you help me to show step by step that

(1), $(Y_i, X_i)_{i=1}^n$ mutually independent, $E(epsilon_i|X_i)=0$ $Rightarrow$ (3)

Also, is it true that

(1), (2), $epsilon_i perp X_i$ $Rightarrow$ $epsilon_i perp (X_1,…, X_n)$

? Why?

Get this bounty!!!

## What is the difference between linear regression on y with x and x with y?

The Pearson correlation coefficient of x and y is the same, whether you compute pearson(x, y) or pearson(y, x). This suggests that doing a linear regression of y given x or x given y should be the same, but that’s the case.

The best way to think about this is to imagine a scatter plot of points with y on the vertical axis and x represented by the horizontal axis. Given this framework, you see a cloud of points, which may be vaguely circular, or may be elongated into an ellipse. What you are trying to do in regression is find what might be called the ‘line of best fit’. However, while this seems straightforward, we need to figure out what we mean by ‘best’, and that means we must define what it would be for a line to be good, or for one line to be better than another, etc. Specifically, we must stipulate a loss function. A loss function gives us a way to say how ‘bad’ something is, and thus, when we minimize that, we make our line as ‘good’ as possible, or find the ‘best’ line.

Traditionally, when we conduct a regression analysis, we find estimates of the slope and intercept so as to minimize the sum of squared errors. These are defined as follows:

In terms of our scatter plot, this means we are minimizing the sum of the vertical distances between the observed data points and the line.

On the other hand, it is perfectly reasonable to regress x onto y, but in that case, we would put x on the vertical axis, and so on. If we kept our plot as is (with x on the horizontal axis), regressing x onto y (again, using a slightly adapted version of the above equation with x and y switched) means that we would be minimizing the sum of the horizontal distances between the observed data points and the line. This sounds very similar, but is not quite the same thing. (The way to recognize this is to do it both ways, and then algebraically convert one set of parameter estimates into the terms of the other. Comparing the first model with the rearranged version of the second model, it becomes easy to see that they are not the same.)

Note that neither way would produce the same line we would intuitively draw if someone handed us a piece of graph paper with points plotted on it. In that case, we would draw a line straight through the center, but minimizing the vertical distance yields a line that is slightly flatter (i.e., with a shallower slope), whereas minimizing the horizontal distance yields a line that is slightly steeper.

A correlation is symmetrical x is as correlated with y as y is with x. The Pearson product-moment correlation can be understood within a regression context, however. The correlation coefficient, r, is the slope of the regression line when both variables have been standardized first. That is, you first subtracted off the mean from each observation, and then divided the differences by the standard deviation. The cloud of data points will now be centered on the origin, and the slope would be the same whether you regressed y onto x, or x onto y.

Now, why does this matter? Using our traditional loss function, we are saying that all of the error is in only one of the variables (viz., y). That is, we are saying that x is measured without error and constitutes the set of values we care about, but that y has sampling error. This is very different from saying the converse. This was important in an interesting historical episode: In the late 70’s and early 80’s in the US, the case was made that there was discrimination against women in the workplace, and this was backed up with regression analyses showing that women with equal backgrounds (e.g., qualifications, experience, etc.) were paid, on average, less than men. Critics (or just people who were extra thorough) reasoned that if this was true, women who were paid equally with men would have to be more highly qualified, but when this was checked, it was found that although the results were ‘significant’ when assessed the one way, they were not ‘significant’ when checked the other way, which threw everyone involved into a tizzy. See here for a famous paper that tried to clear the issue up.