#StackBounty: #mixed-model #modeling #linear-model #aggregation Aggregate repeated measures data or model it?

Bounty: 50

The structure of my data is causing me some trouble because I’d prefer to aggregate it but I’m not sure of the implications.

It’s from an experiment comparing dwell time. The subjects were randomly assigned to one of two habitats. The habitats are identical except for an area that has been modified with characteristics that mimic (possibly) preferred habitats.

There are numerous subjects but we measure how much time each spends in the modified area each day.

There is evidence that differing amounts of sunlight affect habitat utilization and since we can’t control the weather, we record whether each day is cloudy or sunny. We have gathered data on twenty-one consecutive days.

I would like to estimate the habitat effect on the difference in dwell times using a simple linear model such as:

E(y mid mathbf{x}) = beta_0 + beta_1 x_1 + beta_2 x_2 + beta_3 x_1 x_2

where $x_1$ is the habitat indicator, $x_2$ is the sunny/cloudy indicator, and we consider the possibility that the habitat difference can be modulated by the weather by estimating the habitat and weather interaction: $x_1x_2$.

Obviously this completely ignores the sequential and grouped nature of the data since we measure each subject multiple times.

In order to assume independence of the error term, I think I would need something like

E(y mid mathbf{x}) = beta_0 + beta_1 x_1 + beta_2 x_2 + beta_3 x_1 x_2 + beta_4 x_3 + beta_5 x_5

where $x_1$ and $x_2$ are as above, but $x_3$ is a subject-level fixed effect, and $x_5$ is something like a subject-level AR(1) term to account for within subject autocorrelation.

So my question is this: what is the hazard in aggregating the data so that each subject’s outcome is the sum of time spent in the modified area on sunny days and cloudy days? If I simply add up each subject’s time, I can estimate my first model without worrying about the autocorrelation or the individual fixed effects, right?

Get this bounty!!!

#StackBounty: #linear-model #regression-coefficients #regularization Regularized parameter overfitting the data (example)

Bounty: 50

Possible duplicate of

In the Coursera’s machine learning course by Andrew Ng, I came across the following example.

enter image description here

$C = 1/ lambda$ i.e. the inverse of the actual regularization parameter.

The L2 regularization cost expression is
$ R = Sigma_{i=1}^{n} theta_i^2 $

For black classifier, we have $h_{theta}(x) = -3 + x_1;; theta = [-3, 1, 0] ;; R = 10$

For magenta classifier, we have $h_{theta}(x) = -1 + x_1 – x_2 ;; theta = [-1, 1, -1] ; ; R = 3$

Regularization cost for magenta classifier is low but still it seems to overfit the data and vice-versa for the black classifier. What’s going on? L2 regularization tend to make the coefficients close to zero. But how does that helps in reducing overfitting?

The intuition what I think of is that not much weight is given to a particular feature. But isn’t it sometimes necessary to focus on one feature (like in the above example $x_1$)?

Get this bounty!!!

#StackBounty: #feature-selection #linear-model Covariance-residual technique for linear regression feature selection

Bounty: 50

When doing forward feature selection for linear regression, it is a well known trick that to select the next feature to add, we can compute the covariance of each candidate feature against the current set of residuals, and choose the one with maximum absolute value.

Intuitively, this makes sense to me, but I haven’t been able to find or derive a rigorous proof that this technique is equivalent to the naive approach of adding each candidate feature one-by-one, computing coefficients and a squared error for each one, and then choosing the feature yielding the minimum squared error.

Can someone share one, or provide a counterexample?

Get this bounty!!!

#StackBounty: #machine-learning #correlation #linear-model #canonical-correlation Canonical correlation analysis with a tiny example an…

Bounty: 150

I’ve tried reading many explanations of CCA, and I don’t understand it. For example, on Wikipedia, it refers to two “vectors” $a$ and $b$ such that $rho = text{corr}(a^{top} X, b^{top} Y)$ is maximal. But if $a$ and $X$ are vectors, isn’t $a^{top} X$ a scalar? What does it mean for two scalars to be correlated?

Other explanations use matrix notation without any dimensions, e.g.

enter image description here

Can someone explain CCA with a small example and all the dimensions provided?

Get this bounty!!!

#StackBounty: #multiple-regression #multivariate-analysis #linear-model #r-squared Coefficient of Determination in case of Multi-Respon…

Bounty: 50

In case of single response multiple linear model, I can write the population coefficient of determination as,

rho^2 = frac{sigma_{Xy}^tSigma_{XX}^{-1}sigma_{Xy}}{sigma_{yy}}

This gives the proportion of variance in the response explained by the model. What is the equivalent expression for a model with multiple response. For instance, consider a model with $m$ responses as,

E[mathbf{Y}|X_1, X_2, ldots, X_p] = boldsymbol{mu}_{Y} +mathbf{B}left(mathbf{X} – boldsymbol{mu}_Xright)

When I estimate this model with OLS, I will obtain estimated coefficient of determination $R^2$ for each response. What will the population expression for the coefficient of determination. I suppose I will obtain $mtimes m$ matrix of $R^2$ as,

rho_1^2 & rho_{12} & ldots & rho_{1m} \
rho_{21} & rho_2^2 & ldots & rho_{2m} \
vdots & vdots & ddots & vdots \
rho_{m1} & rho_{m2}^2 & ldots & rho_m^2 \

I can imagine the diagonal elements of this matrix as explained variation, but what can be the interpretation of the off diagonal elements?

In summary, following are to my main question:

  • How can I calculate population coefficient of determination from the covariance structure in case of model with multiple response variables
  • What will be the interpretation of off diagonal elements in the coefficient of determination matrix

Get this bounty!!!

#StackBounty: #regression #self-study #linear-model #independence Implications pf mean dependence in the classical linear model

Bounty: 50

Consider the classical linear model

(1) $Y_i=X_i’beta+epsilon_i$

(2) $(Y_i, X_i)_{i=1}^n$ i.i.d.

(3) $E(epsilon_i| X_1,…, X_n)=0$

Could you help me to show step by step that

(1), $(Y_i, X_i)_{i=1}^n$ mutually independent, (3) $Rightarrow$ $E(epsilon_i|X_i)=0$

Could you help me to show step by step that

(1), $(Y_i, X_i)_{i=1}^n$ mutually independent, $E(epsilon_i|X_i)=0$ $Rightarrow$ (3)

Also, is it true that

(1), (2), $epsilon_i perp X_i$ $Rightarrow$ $epsilon_i perp (X_1,…, X_n)$

? Why?

Get this bounty!!!

What is the difference between linear regression on y with x and x with y?

The Pearson correlation coefficient of x and y is the same, whether you compute pearson(x, y) or pearson(y, x). This suggests that doing a linear regression of y given x or x given y should be the same, but that’s the case.

The best way to think about this is to imagine a scatter plot of points with y on the vertical axis and x represented by the horizontal axis. Given this framework, you see a cloud of points, which may be vaguely circular, or may be elongated into an ellipse. What you are trying to do in regression is find what might be called the ‘line of best fit’. However, while this seems straightforward, we need to figure out what we mean by ‘best’, and that means we must define what it would be for a line to be good, or for one line to be better than another, etc. Specifically, we must stipulate a loss function. A loss function gives us a way to say how ‘bad’ something is, and thus, when we minimize that, we make our line as ‘good’ as possible, or find the ‘best’ line.

Traditionally, when we conduct a regression analysis, we find estimates of the slope and intercept so as to minimize the sum of squared errors. These are defined as follows:

In terms of our scatter plot, this means we are minimizing the sum of the vertical distances between the observed data points and the line.

enter image description here

On the other hand, it is perfectly reasonable to regress x onto y, but in that case, we would put x on the vertical axis, and so on. If we kept our plot as is (with x on the horizontal axis), regressing x onto y (again, using a slightly adapted version of the above equation with x and y switched) means that we would be minimizing the sum of the horizontal distances between the observed data points and the line. This sounds very similar, but is not quite the same thing. (The way to recognize this is to do it both ways, and then algebraically convert one set of parameter estimates into the terms of the other. Comparing the first model with the rearranged version of the second model, it becomes easy to see that they are not the same.)

enter image description here

Note that neither way would produce the same line we would intuitively draw if someone handed us a piece of graph paper with points plotted on it. In that case, we would draw a line straight through the center, but minimizing the vertical distance yields a line that is slightly flatter (i.e., with a shallower slope), whereas minimizing the horizontal distance yields a line that is slightly steeper.

A correlation is symmetrical x is as correlated with y as y is with x. The Pearson product-moment correlation can be understood within a regression context, however. The correlation coefficient, r, is the slope of the regression line when both variables have been standardized first. That is, you first subtracted off the mean from each observation, and then divided the differences by the standard deviation. The cloud of data points will now be centered on the origin, and the slope would be the same whether you regressed y onto x, or x onto y.

enter image description here

Now, why does this matter? Using our traditional loss function, we are saying that all of the error is in only one of the variables (viz., y). That is, we are saying that x is measured without error and constitutes the set of values we care about, but that y has sampling error. This is very different from saying the converse. This was important in an interesting historical episode: In the late 70’s and early 80’s in the US, the case was made that there was discrimination against women in the workplace, and this was backed up with regression analyses showing that women with equal backgrounds (e.g., qualifications, experience, etc.) were paid, on average, less than men. Critics (or just people who were extra thorough) reasoned that if this was true, women who were paid equally with men would have to be more highly qualified, but when this was checked, it was found that although the results were ‘significant’ when assessed the one way, they were not ‘significant’ when checked the other way, which threw everyone involved into a tizzy. See here for a famous paper that tried to clear the issue up.

Here’s another way to think about this that approaches the topic through the formulas instead of visually:

The formula for the slope of a simple regression line is a consequence of the loss function that has been adopted. If you are using the standard Ordinary Least Squares loss function (noted above), you can derive the formula for the slope that you see in every intro textbook. This formula can be presented in various forms; one of which I call the ‘intuitive’ formula for the slope. Consider this form for both the situation where you are regressing y on x, and where you are regressing x on y:

Now, I hope it’s obvious that these would not be the same unless Var(xequals Var(y). If the variances are equal (e.g., because you standardized the variables first), then so are the standard deviations, and thus the variances would both also equal SD(x)SD(y). In this case, β^1 would equal Pearson’s r, which is the same either way by virtue of the principle of commutativity: