#StackBounty: #regression #multicollinearity Determining statistical significance of linear regression coefficient in the presence of m…

Bounty: 100

Suppose I have a bunch of cities with different population sizes, and I wanted to see if there was a positive linear relationship between the number of liquor stores in a city and the number of DUIs. Where I’m determining whether this relationship is significant or not based on a t-test of the estimated regression coefficient.

Now clearly the pop. size of a city is going to be positively correlated with the both the number of DUIs as well as the number of liquor stores. Thus if I run a simple linear regression on just liquor stores and see if its regression coefficient is statistically significant, I will likely run into a problem of multicollinearity, and over-estimate the effect of liquor stores on DUIs.

Which of the two methods should I use to correct for this?

  1. I should divide the number of liquor stores in the city by its population in order to get a liquor store per capita value and then regress on that.

  2. I should regress on both liquor stores and size, and then look to see if the liquor store coefficient is significant when controlling for size.

  3. Some other method?

I honestly can’t decide which seems more sensible. I vacillate between them, depending on which one I think about I’m able to convince myself that that’s the right way.

On the one hand liquor stores per capita seems like the right variable to use, since DUIs are committed by individuals, but that doesn’t seem very statistically rigorous. On the other hand, controlling for size seems statistically rigorous, but rather indirect. Furthermore, if I rescale after computing the liquor stores per capita variable, I get very similar regression coefficients between the two methods, but method 1 produces a smaller p-value.


Get this bounty!!!

#StackBounty: #regression #confidence-interval #p-value #bootstrap #nonlinear-regression Efficient nonparametric estimation of confiden…

Bounty: 50

I’m estimating parameters for a complex, “implicit” nonlinear model $f(mathbf{x}, boldsymbol{theta})$. It’s “implicit” in the sense that I don’t have an explicit formula for $f$: its value is the output of a complex fluid dynamics code (CFD). After NLS regression, I had a look at residuals, and they don’t look very normal at all. Also, I’m having a lot of issues with estimating their variance-covariance matrix: methods available in nlstools fail with an error.

I’m suspecting the assumption of normally distributed parameter estimators is not valid: thus I would like to use some nonparametric method to estimate confidence intervals, $p$-values and confidence regions for the three parameters of my model. I thought of bootstrap, but other approaches are welcome, so long as they don’t rely on normality of parameter estimators. Would this work:

  1. given data set $D={P_i=(mathbf{x}_i,f_i)}_{i=1}^N$, generate datasets $D_1,dots,D_m$ by sampling with replacement from $D$
  2. For each $D_i$, use NLS (Nonlinear Least Squares) to estimate model parameters $boldsymbol{theta}^*_i=(theta^*_{1i},theta^*_{2i},theta^*_{3i})$
  3. I now have empirical distributions for the NLS parameters estimator. The sample mean of this distribution would be the bootstrap estimate for my parameters; 2.5% and 97.5% quantiles would give me confidence intervals. I could also make scatterplots matrices of each parameter against each other, and get an idea of the correlation among them. This is the part I like the most, because I believe that one parameter is weakly correlated with the others, while the remaining are extremely strongly correlated among themselves.

Is this correct? Then how do I compute the $p-$values – what is the null for nonlinear regression models? For example, for parameter $theta_{3}$, is it that $theta_{3}=0$, and the other two are not? How would I compute the $p-$value for such an hypothesis from my bootstrap sample $boldsymbol{theta}^_1,dots,boldsymbol{theta}^_m$? I don’t see the connection with the null…

Also, each NLS fit takes me quite some time (let’s say a few hours) because I need to run my fluid dynamics code $ptimes N$ times, where $N$ is the size of $D$ and $p$ is about 40 in my case. The total CPU time for bootstrap is then $40times N times m$ the time of a single CFD run, which is a lot. I would need a faster way. What can I do? I thought of building a metamodel for my CFD code (for example, a Gaussian Process model) and use that for bootstrapping, instead than CFD. What do you think? Would that work?


Get this bounty!!!

#StackBounty: #regression #scikit-learn #standardization #glmnet #weighted-regression Regularization and scaling feature matrix with we…

Bounty: 50

When using L1 or L2 regularization in a glm it is necessary to standardize the features to be variance 1. When applying weights to the glm, should the feature matrix be standardized so that it has a weighted variance of 1?

According to the glmnet paper
https://web.stanford.edu/~hastie/Papers/glmnet.pdf
There is no mention of standardizing to have weighted variance 1. Furthermore, there is a simplification in the unweighted case that’s made in equation 8 where the authors took advantage of the fact that column j had variance 1. When they describe the weighted case, they do not make any simplification taking advantage of column j having weighted variance 1. This makes it sound like standardization always happens without weights.

When I went through the source code of ElasticNet in sklearn I was unable to find references to weights

https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/linear_model/coordinate_descent.py#L503

This doesn’t seem to be correct. If I have a data frame with 2 identical rows, they will contribute twice to the mean and variance. This can be equivalently represented as one row with a weight of 2, but if weights are ignored in standardization the algorithm will have a different result


Get this bounty!!!

#HackerRank: Correlation and Regression Lines solutions

import numpy as np
import scipy as sp
from scipy.stats import norm

Correlation and Regression Lines – A Quick Recap #1

Here are the test scores of 10 students in physics and history:

Physics Scores 15 12 8 8 7 7 7 6 5 3

History Scores 10 25 17 11 13 17 20 13 9 15

Compute Karl Pearson’s coefficient of correlation between these scores. Compute the answer correct to three decimal places.

Output Format

In the text box, enter the floating point/decimal value required. Do not leave any leading or trailing spaces. Your answer may look like: 0.255

This is NOT the actual answer – just the format in which you should provide your answer.

physicsScores=[15, 12,  8,  8,  7,  7,  7,  6, 5,  3]
historyScores=[10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
print(np.corrcoef(historyScores,physicsScores)[0][1])
0.144998154581

Correlation and Regression Lines – A Quick Recap #2

Here are the test scores of 10 students in physics and history:

Physics Scores 15 12 8 8 7 7 7 6 5 3

History Scores 10 25 17 11 13 17 20 13 9 15

Compute the slope of the line of regression obtained while treating Physics as the independent variable. Compute the answer correct to three decimal places.

Output Format

In the text box, enter the floating point/decimal value required. Do not leave any leading or trailing spaces. Your answer may look like: 0.255

This is NOT the actual answer – just the format in which you should provide your answer.

sp.stats.linregress(physicsScores,historyScores).slope
0.20833333333333331

Correlation and Regression Lines – A quick recap #3

Here are the test scores of 10 students in physics and history:

Physics Scores 15 12 8 8 7 7 7 6 5 3

History Scores 10 25 17 11 13 17 20 13 9 15

When a student scores 10 in Physics, what is his probable score in History? Compute the answer correct to one decimal place.

Output Format

In the text box, enter the floating point/decimal value required. Do not leave any leading or trailing spaces. Your answer may look like: 0.255

This is NOT the actual answer – just the format in which you should provide your answer.

def predict(pi,x,y):
    slope, intercept, rvalue, pvalue, stderr=sp.stats.linregress(x,y);
    return slope*pi+ intercept

predict(10,physicsScores,historyScores)
15.458333333333332

Correlation and Regression Lines – A Quick Recap #4

The two regression lines of a bivariate distribution are:

4x – 5y + 33 = 0 (line of y on x)

20x – 9y – 107 = 0 (line of x on y).

Estimate the value of x when y = 7. Compute the correct answer to one decimal place.

Output Format

In the text box, enter the floating point/decimal value required. Do not lead any leading or trailing spaces. Your answer may look like: 7.2

This is NOT the actual answer – just the format in which you should provide your answer.

x=[i for i in range(0,20)]

'''
    4x - 5y + 33 = 0
    x = ( 5y - 33 ) / 4
    y = ( 4x + 33 ) / 5
    
    20x - 9y - 107 = 0
    x = (9y + 107)/20
    y = (20x - 107)/9
'''
t=7
print( ( 9 * t + 107 ) / 20 )
8.5

Correlation and Regression Lines – A Quick Recap #5

The two regression lines of a bivariate distribution are:

4x – 5y + 33 = 0 (line of y on x)

20x – 9y – 107 = 0 (line of x on y).

find the variance of y when σx= 3.

Compute the correct answer to one decimal place.

Output Format

In the text box, enter the floating point/decimal value required. Do not lead any leading or trailing spaces. Your answer may look like: 7.2

This is NOT the actual answer – just the format in which you should provide your answer.

http://www.mpkeshari.com/2011/01/19/lines-of-regression/

Q.3. If the two regression lines of a bivariate distribution are 4x – 5y + 33 = 0 and 20x – 9y – 107 = 0,

  • calculate the arithmetic means of x and y respectively.
  • estimate the value of x when y = 7. – find the variance of y when σx = 3.
Solution : –

We have,

4x – 5y + 33 = 0 => y = 4x/5 + 33/5 ————— (i)

And

20x – 9y – 107 = 0 => x = 9y/20 + 107/20 ————- (ii)

(i) Solving (i) and (ii) we get, mean of x = 13 and mean of y = 17.[Ans.]

(ii) Second line is line of x on y

x = (9/20) × 7 + (107/20) = 170/20 = 8.5 [Ans.]

(iii) byx = r(σy/σx) => 4/5 = 0.6 × σy/3 [r = √(byx.bxy) = √{(4/5)(9/20)]= 0.6 => σy = (4/5)(3/0.6) = 4 [Ans.]

variance= σ**2=> 16

What is the difference between linear regression on y with x and x with y?

The Pearson correlation coefficient of x and y is the same, whether you compute pearson(x, y) or pearson(y, x). This suggests that doing a linear regression of y given x or x given y should be the same, but that’s the case.

The best way to think about this is to imagine a scatter plot of points with y on the vertical axis and x represented by the horizontal axis. Given this framework, you see a cloud of points, which may be vaguely circular, or may be elongated into an ellipse. What you are trying to do in regression is find what might be called the ‘line of best fit’. However, while this seems straightforward, we need to figure out what we mean by ‘best’, and that means we must define what it would be for a line to be good, or for one line to be better than another, etc. Specifically, we must stipulate a loss function. A loss function gives us a way to say how ‘bad’ something is, and thus, when we minimize that, we make our line as ‘good’ as possible, or find the ‘best’ line.

Traditionally, when we conduct a regression analysis, we find estimates of the slope and intercept so as to minimize the sum of squared errors. These are defined as follows:

In terms of our scatter plot, this means we are minimizing the sum of the vertical distances between the observed data points and the line.

enter image description here

On the other hand, it is perfectly reasonable to regress x onto y, but in that case, we would put x on the vertical axis, and so on. If we kept our plot as is (with x on the horizontal axis), regressing x onto y (again, using a slightly adapted version of the above equation with x and y switched) means that we would be minimizing the sum of the horizontal distances between the observed data points and the line. This sounds very similar, but is not quite the same thing. (The way to recognize this is to do it both ways, and then algebraically convert one set of parameter estimates into the terms of the other. Comparing the first model with the rearranged version of the second model, it becomes easy to see that they are not the same.)

enter image description here

Note that neither way would produce the same line we would intuitively draw if someone handed us a piece of graph paper with points plotted on it. In that case, we would draw a line straight through the center, but minimizing the vertical distance yields a line that is slightly flatter (i.e., with a shallower slope), whereas minimizing the horizontal distance yields a line that is slightly steeper.

A correlation is symmetrical x is as correlated with y as y is with x. The Pearson product-moment correlation can be understood within a regression context, however. The correlation coefficient, r, is the slope of the regression line when both variables have been standardized first. That is, you first subtracted off the mean from each observation, and then divided the differences by the standard deviation. The cloud of data points will now be centered on the origin, and the slope would be the same whether you regressed y onto x, or x onto y.

enter image description here

Now, why does this matter? Using our traditional loss function, we are saying that all of the error is in only one of the variables (viz., y). That is, we are saying that x is measured without error and constitutes the set of values we care about, but that y has sampling error. This is very different from saying the converse. This was important in an interesting historical episode: In the late 70’s and early 80’s in the US, the case was made that there was discrimination against women in the workplace, and this was backed up with regression analyses showing that women with equal backgrounds (e.g., qualifications, experience, etc.) were paid, on average, less than men. Critics (or just people who were extra thorough) reasoned that if this was true, women who were paid equally with men would have to be more highly qualified, but when this was checked, it was found that although the results were ‘significant’ when assessed the one way, they were not ‘significant’ when checked the other way, which threw everyone involved into a tizzy. See here for a famous paper that tried to clear the issue up.

Here’s another way to think about this that approaches the topic through the formulas instead of visually:

The formula for the slope of a simple regression line is a consequence of the loss function that has been adopted. If you are using the standard Ordinary Least Squares loss function (noted above), you can derive the formula for the slope that you see in every intro textbook. This formula can be presented in various forms; one of which I call the ‘intuitive’ formula for the slope. Consider this form for both the situation where you are regressing y on x, and where you are regressing x on y:

Now, I hope it’s obvious that these would not be the same unless Var(xequals Var(y). If the variances are equal (e.g., because you standardized the variables first), then so are the standard deviations, and thus the variances would both also equal SD(x)SD(y). In this case, β^1 would equal Pearson’s r, which is the same either way by virtue of the principle of commutativity:

 

Source

#StackBounty: #regression #time-series #least-squares References Request (Least-Squares Estimates for non i.i.d. Processes)

Bounty: 50

I am interested in suggestions concerning possible applications/problems within applied statistics with respect to estimates of least-squares for non-stationary designs. In particular, I would like to know if there are current problems in statistics in which it is important to approach the best average response $Y$ from an possibly non-stationary (but, say, independent and bounded) set of data $(X_{1},Y_{1}),dots,(X_{n},Y_{n})$ in $mathbb{R}^{d}timesmathbb{R}$.

In more precise terms, consider the following observation: if $(X_{1},Y_{1}),dots,(X_{n},Y_{n})$ is a sequence of (possibly non i.i.d.) data in a statistical experiment, then the least squares approximation
$$f^{*}:=argmin_{f
in
mathcal{F}}
frac{1}{n}sum_{k=1}^{n}|f(X_{k})-Y_{k}|^{2} ,,,,,,,,,,(1)$$
within a family $mathcal{F}$ of functions $mathbb{R}^{d}tomathbb{R}$ is the natural “simultaneous” estimator in $L^{2}$ of the conditional expectations of the response variable given the explanatory variable. This is, (1) is the natural empirical estimate of
$$argmin_{finmathcal{F}}frac{1}{n}sum_{k=1}^{n}E(Y_{k}-f(X_{k}))^{2}= argmin_{finmathcal{F}}frac{1}{n}sum_{k=1}^{n}E(E[Y_{k}|X_{k}]-f(X_{k}))^{2}.,,,,,(2)$$

Note that in the i.i.d. case, this leads exactly to the classical least squares regression problem, because $E[Y_{k}|X_{k}]$ does not depend on $k$. Note also that, if $X_{k}$ is stationary (perhaps dependent) and the expectation of $Y_{k}$ given $X_{k}=x$ does not depend on $k$, the setting is still an instance of the classical least-squares problem (for the same reason). Possible ramifications follow easily considering for example the case in which the response variables $E[Y_{k}|X_{k}]$ converge in some sense.

The question therefore is: do you know of any interesting applications/references in which the problem (2) has current relevance? (preferably under the independent, non identically distributed case to begin with, but also in situations of dependence).

Reason for this request: In short: I have been working with some colaborators on the problem of non-stationary least squares regression towards a certain kind of particular Markovian evolution related to Monte Carlo methods.

It turns out that our results so far, if correct, seem to address several intermediate cases along of which we have been wondering whether there are applications of relevance within the community devoted to these methods. This would hopefully give us some useful set of assumptions to further test our results.

Until now I have seen quite a number of articles in which the problem of estimating conditional expectations is addressed via kernel methods for nonindependent, and sometimes nonstationary evolutions. Time Series Analysis, in particular, seems to be an area where these problems are important.

For some reason, nonetheless, the classical least-squares method seems to have not been very explored in that direction. My mathematical intuition tells me that these results can be of great interest also at the practical level and, as pointed out, we would like to confirm this and to even address other existing problems in the statistical community.


Get this bounty!!!

#StackBounty: #regression #generalized-linear-model #panel-data Can I get away with using GLM models on "pseudo-panel" big N …

Bounty: 50

Suppose I have a kind of panel data set, where we track the investment totals of a great many customers, which may be highly variable, and is measured on a monthly basis over the course of 7-10 years. While we are ultimately interested in forecasting the total level of investment of all of our customers, for various reasons we find it desirable to factor the analysis into two stages. First, forecast the rate at which old customers close their accounts (and a separate analysis for new customers, which we’ll forget about for this question). Second, forecast the average investment of each remaining customer. Unfortunately, this is not a true panel data set and we are unable to track one customer across time. Strange, I know!

We hypothesize that length-of-customer-relationship and one or two specific macroeconomic variables might drive the two processes.

The current approach is to model customer attrition with a logistic regression model, and to model average customer investment with a gamma GLM. This seems reasonable to me, having previous experience with only non-economic data. However, I have a concern. As this data is a pseudo panel data set, with large N but small time T, it’s not clear that regression assumptions are satisfied. There is surely some correlation within accounts for a given month/year. And furthermore, though I only understand it vaguely, there are issues when some of our macroeconomic predictors are stochastic and non-stationary.

Can we use these GLM models for this data set? If we had a true panel data at least know the literature to read, but in this case my intuition tells me that a GLM should work, although we have to be careful about interpreting the standard errors, etc.

Should I radically reconsider the GLM models or is there a way to justify it in this context? Is there some specific literature that I might be advised to look into?


Get this bounty!!!

#StackBounty: #r #regression #calibration #nls Calibration of a computer model: how to deal with parameter vectors such that for some d…

Bounty: 50

I have an experimental data set $D={mathbf{x}i,y_i}{i=1}^N$, and a computer code with inputs $mathbf{x}$ and calibration parameters $boldsymbol{theta}$, returning a value $s=f(mathbf{x},boldsymbol{theta})$. Assuming Gaussian iid errors, I want to calibrate my code using the data set $D$ using nls or similar, more advanced functions in R. If you need more details, you can find them in this question. Note that since my model is implicitly defined by my computer code and not by an explicit formula, any nonlinear least squares function must use numerical, not analytical, Jacobians.

Now, the problem is that for some vector $boldsymbol{theta}^{(j)}$, the computer code doesn’t converge at all points of $D$. It converges for some data points, but it doesn’t converge for others. From a theoretical point of view, I believe the likelihood $mathcal{L}(D;boldsymbol{theta}^{(j)})$ is not even defined.

Because of various issues, I finally switched to using nonlinear least squares functions such asnlfb from package nlmrt and nls.lm from package minpack.lm, which do not require a formula argument, but take a function resfn which takes in input the parameter vector $boldsymbol{theta}$ and return the vector of residuals $mathbf{r}=(y_1-f(mathbf{x}1,boldsymbol{theta}),dots,y_N-f(mathbf{x}_N,boldsymbol{theta}))$. Suppose that for a certain value $boldsymbol{theta}^{(j)}$ proposed by the NLS algorithm, my computer code is not converging at points $mathbf{x}{i1},dots,mathbf{x}{im}$. Now, what should I return as the corresponding components $r{i1},dots,r_{im}$ of the residual vector $mathbf{r}$?

  • Should I return NA? I don’t think that would be handled by nlfb or by nls.lm.
  • Should I return an arbitrary, very large number $z$? I think this would “encourage” the NLS algorithm to move away from similar parameter values, and move towards other regions of the parameter space, where hopefully my code would converge on all of $D$. Sounds very ad-hoc.

I tried dealing with this issue empirically: I removed those few points in $D$, which correspond to conditions which are not very interesting for calibration and which are also known to pose convergence issues to my computer code. Since I have $N$=319 experimental points, and only 3 parameters to calibrate (for now!), deleting a few points wasn’t an issue. However, the NLS regression is still not converging, and I need to solve this issue. Can you give me some tips?


Get this bounty!!!

#StackBounty: #regression #heteroscedasticity #clustered-standard-errors Moulton Factor and Clustering of Standard Errors with Heterosc…

Bounty: 50

When running a linear regression, standard errors may need to be adjusted for clustering. One way to do so is to multiply the “conventional” variance estimate by the Moulton factor, defined for equicorrelated regressors as $1 + [frac{V(n_g)}{overline{n}}+overline{n} -1]*rho_e$, where $overline{n}$ represents the average cluster size (average number of observations per cluster), $n_g$ is the number of observations per cluster, $V$ is the variance and $rho_e$ is the residual correlation.

For a reference see for example Angrist and Pischke’s “Mostly Harmless Econometrics” page 311.

To my understanding, however, this Moulton factor, which comes up in every treatment of clustering, is derived for homoscedastic errors. In that case the standard errors estimated assuming homoscedasticity are multiplied by the square root of the Moulton factor. This is also what the Stata “moulton.ado” file provided on the website of “Mostly Harmless Econometrics” does.

Does the same formula hold for heteroscedastic errors? Is it correct to multiply heteroscedasticity robust standard errors by this same factor? If not, how does one adjust for clustering parametrically (using the Moulton factor) while also dealing with heteroscedasticity?


Get this bounty!!!