#StackBounty: #regression Descriptive modelling using 100 runs

Bounty: 50

I refer you to Breiman’s paper Statistical Modeling – A Tale of Two Cultures where he illustrated some examples of descriptive modelling. Under section 11.1, 100 runs of regression were performed, each time leaving randomly selected 10% of the data. May I know what is the purpose of performing 100 runs and getting the average error rate and relationship between the outcome and explanatory?

I understand the objective of creating tests sets in predictive modelling is as to prevent overestimating the predictive accuracy of the model. But I do not quite understand in the context of descriptive modelling.


Get this bounty!!!

#StackBounty: #regression #machine-learning #variance #cross-validation #predictive-models Does $K$-fold CV with $K=N$ (LOO) provide th…

Bounty: 50

TL,DR: It appears that, contrary to oft-repeated advice, leave-one-out cross validation (LOO-CV) — that is, $K$-fold CV with $K$ (the number of folds) equal to $N$ (the number of training observations) — yields estimates of the generalization error that are the least variable for any $K$, not the most variable, assuming a certain stability condition on either the model/algorithm, the dataset, or both (I’m not sure which is correct as I don’t really understand this stability condition).

  • Can someone clearly explain what exactly this stability condition is?
  • Is it true that linear regression is one such “stable” algorithm, implying that in that context, LOO-CV is strictly the best choice of CV as far as bias and variance of the estimates of generalization error are concerned?

The conventional wisdom is that the choice of $K$ in $K$-fold CV follows a bias-variance tradeoff, such lower values of $K$ (approaching 2) lead to estimates of the generalization error that have more pessimistic bias, but lower variance, while higher values of $K$ (approaching $N$) lead to estimates that are less biased, but with greater variance. The conventional explanation for this phenomenon of variance increasing with $K$ is given perhaps most prominently in The Elements of Statistical Learning (Section 7.10.1):

With K=N, the cross-validation estimator is approximately unbiased for the true (expected) prediction error, but can have high variance because the N “training sets” are so similar to one another.

The implication being that the $N$ validation errors are more highly correlated so that their sum is more variable. This line of reasoning has been repeated in many answers on this site (e.g., here, here, here, here, here, here, and here) as well as on various blogs and etc. But a detailed analysis is virtually never given, instead only an intuition or brief sketch of what an analysis might look like.

One can however find contradictory statements, usually citing a certain “stability” condition that I don’t really understand. For example, this contradictory answer quotes a couple paragraphs from a 2015 paper which says, among other things, “For models/modeling procedures with low instability, LOO often has the smallest variability” (emphasis added). This paper (section 5.2) seems to agree that LOO represents the least variable choice of $K$ as long as the model/algorithm is “stable.” Taking even another stance on the issue, there is also this paper (Corollary 2), which says “The variance of $k$ fold cross validation […] does not depend on $k$,” again citing a certain “stability” condition.

The explanation about why LOO might be the most variable $K$-fold CV is intuitive enough, but there is a counter-intuition. The final CV estimate of the mean squared error (MSE) is the mean of the MSE estimates in each fold. So as $K$ increases up to $N$, the CV estimate is the mean of an increasing number of random variables. And we know that the variance of a mean decreases with the number of variables being averaged over. So in order for LOO to be the most variable $K$-fold CV, it would have to be true that the increase in variance due to the increased correlation among the MSE estimates outweighs the decrease in variance due to the greater number of folds being averaged over. And it is not at all obvious that this is true.

Having become thoroughly confused thinking about all this, I decided to run a little simulation for the linear regression case. I simulated 10,000 datasets with $N$=50 and 3 uncorrelated predictors, each time estimating the generalization error using $K$-fold CV with $K$=2, 5, 10, or 50=$N$. The R code is here. Here are the resulting means and variances of the CV estimates across all 10,000 datasets (in MSE units):

         k = 2 k = 5 k = 10 k = n = 50
mean     1.187 1.108  1.094      1.087
variance 0.094 0.058  0.053      0.051

These results show the expected pattern that higher values of $K$ lead to a less pessimistic bias, but also appear to confirm that the variance of the CV estimates is lowest, not highest, in the LOO case.

So it appears that linear regression is one of the “stable” cases mentioned in the papers above, where increasing $K$ is associated with decreasing rather than increasing variance in the CV estimates. But what I still don’t understand is:

  • What precisely is this “stability” condition? Does it apply to models/algorithms, datasets, or both to some extent?
  • Is there an intuitive way to think about this stability?
  • What are other examples of stable and unstable models/algorithms or datasets?
  • Is it relatively safe to assume that most models/algorithms or datasets are “stable” and therefore that $K$ should generally be chosen as high as is computationally feasible?


Get this bounty!!!

#StackBounty: #regression #multiple-regression #multivariate-analysis How to model multiple inputs to multiple outputs?

Bounty: 50

1. My purpose

I am trying to build a model with multiple inputs and multiple outputs, which is something like this:
enter image description here

I am not sure if I need to firstly integrate the xi into X, and yi into Y so as to make the model easier like this:
enter image description here

2. Features of the training data

There are several features for the training data:

  • There are some missing factors for different entries due to lack of resources.
    For example,
    X1 may not have x3 and x5, Y1 may not have y4 and y5,
    X2 may not have x1 and x2, Y2 may not have y4.
  • There might be multiple projections for X=g(Y). For example
    enter image description here

Of course, I would like to minimise the number of projections so as to build a more generalised model.

3. My question

So may I ask how could I tackle this problem?

I feel that the regression (e.g. polynomial regression) and classification (e.g. logistic regression, neural network) models only require one sigle output for each entry.

I also do not think PLS is the right answer as PLS essentially models multiple x variables to a single yi instead of considering the Y=Σyi as a whole.

It seems that it is about modelling multivariate (not multivariable) regression. But how could I deal with missing factors?


Get this bounty!!!

#StackBounty: #regression #multicollinearity Determining statistical significance of linear regression coefficient in the presence of m…

Bounty: 100

Suppose I have a bunch of cities with different population sizes, and I wanted to see if there was a positive linear relationship between the number of liquor stores in a city and the number of DUIs. Where I’m determining whether this relationship is significant or not based on a t-test of the estimated regression coefficient.

Now clearly the pop. size of a city is going to be positively correlated with the both the number of DUIs as well as the number of liquor stores. Thus if I run a simple linear regression on just liquor stores and see if its regression coefficient is statistically significant, I will likely run into a problem of multicollinearity, and over-estimate the effect of liquor stores on DUIs.

Which of the two methods should I use to correct for this?

  1. I should divide the number of liquor stores in the city by its population in order to get a liquor store per capita value and then regress on that.

  2. I should regress on both liquor stores and size, and then look to see if the liquor store coefficient is significant when controlling for size.

  3. Some other method?

I honestly can’t decide which seems more sensible. I vacillate between them, depending on which one I think about I’m able to convince myself that that’s the right way.

On the one hand liquor stores per capita seems like the right variable to use, since DUIs are committed by individuals, but that doesn’t seem very statistically rigorous. On the other hand, controlling for size seems statistically rigorous, but rather indirect. Furthermore, if I rescale after computing the liquor stores per capita variable, I get very similar regression coefficients between the two methods, but method 1 produces a smaller p-value.


Get this bounty!!!

#StackBounty: #regression #confidence-interval #p-value #bootstrap #nonlinear-regression Efficient nonparametric estimation of confiden…

Bounty: 50

I’m estimating parameters for a complex, “implicit” nonlinear model $f(mathbf{x}, boldsymbol{theta})$. It’s “implicit” in the sense that I don’t have an explicit formula for $f$: its value is the output of a complex fluid dynamics code (CFD). After NLS regression, I had a look at residuals, and they don’t look very normal at all. Also, I’m having a lot of issues with estimating their variance-covariance matrix: methods available in nlstools fail with an error.

I’m suspecting the assumption of normally distributed parameter estimators is not valid: thus I would like to use some nonparametric method to estimate confidence intervals, $p$-values and confidence regions for the three parameters of my model. I thought of bootstrap, but other approaches are welcome, so long as they don’t rely on normality of parameter estimators. Would this work:

  1. given data set $D={P_i=(mathbf{x}_i,f_i)}_{i=1}^N$, generate datasets $D_1,dots,D_m$ by sampling with replacement from $D$
  2. For each $D_i$, use NLS (Nonlinear Least Squares) to estimate model parameters $boldsymbol{theta}^*_i=(theta^*_{1i},theta^*_{2i},theta^*_{3i})$
  3. I now have empirical distributions for the NLS parameters estimator. The sample mean of this distribution would be the bootstrap estimate for my parameters; 2.5% and 97.5% quantiles would give me confidence intervals. I could also make scatterplots matrices of each parameter against each other, and get an idea of the correlation among them. This is the part I like the most, because I believe that one parameter is weakly correlated with the others, while the remaining are extremely strongly correlated among themselves.

Is this correct? Then how do I compute the $p-$values – what is the null for nonlinear regression models? For example, for parameter $theta_{3}$, is it that $theta_{3}=0$, and the other two are not? How would I compute the $p-$value for such an hypothesis from my bootstrap sample $boldsymbol{theta}^_1,dots,boldsymbol{theta}^_m$? I don’t see the connection with the null…

Also, each NLS fit takes me quite some time (let’s say a few hours) because I need to run my fluid dynamics code $ptimes N$ times, where $N$ is the size of $D$ and $p$ is about 40 in my case. The total CPU time for bootstrap is then $40times N times m$ the time of a single CFD run, which is a lot. I would need a faster way. What can I do? I thought of building a metamodel for my CFD code (for example, a Gaussian Process model) and use that for bootstrapping, instead than CFD. What do you think? Would that work?


Get this bounty!!!

#StackBounty: #regression #scikit-learn #standardization #glmnet #weighted-regression Regularization and scaling feature matrix with we…

Bounty: 50

When using L1 or L2 regularization in a glm it is necessary to standardize the features to be variance 1. When applying weights to the glm, should the feature matrix be standardized so that it has a weighted variance of 1?

According to the glmnet paper
https://web.stanford.edu/~hastie/Papers/glmnet.pdf
There is no mention of standardizing to have weighted variance 1. Furthermore, there is a simplification in the unweighted case that’s made in equation 8 where the authors took advantage of the fact that column j had variance 1. When they describe the weighted case, they do not make any simplification taking advantage of column j having weighted variance 1. This makes it sound like standardization always happens without weights.

When I went through the source code of ElasticNet in sklearn I was unable to find references to weights

https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/linear_model/coordinate_descent.py#L503

This doesn’t seem to be correct. If I have a data frame with 2 identical rows, they will contribute twice to the mean and variance. This can be equivalently represented as one row with a weight of 2, but if weights are ignored in standardization the algorithm will have a different result


Get this bounty!!!

#HackerRank: Correlation and Regression Lines solutions

import numpy as np
import scipy as sp
from scipy.stats import norm

Correlation and Regression Lines – A Quick Recap #1

Here are the test scores of 10 students in physics and history:

Physics Scores 15 12 8 8 7 7 7 6 5 3

History Scores 10 25 17 11 13 17 20 13 9 15

Compute Karl Pearson’s coefficient of correlation between these scores. Compute the answer correct to three decimal places.

Output Format

In the text box, enter the floating point/decimal value required. Do not leave any leading or trailing spaces. Your answer may look like: 0.255

This is NOT the actual answer – just the format in which you should provide your answer.

physicsScores=[15, 12,  8,  8,  7,  7,  7,  6, 5,  3]
historyScores=[10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
print(np.corrcoef(historyScores,physicsScores)[0][1])
0.144998154581

Correlation and Regression Lines – A Quick Recap #2

Here are the test scores of 10 students in physics and history:

Physics Scores 15 12 8 8 7 7 7 6 5 3

History Scores 10 25 17 11 13 17 20 13 9 15

Compute the slope of the line of regression obtained while treating Physics as the independent variable. Compute the answer correct to three decimal places.

Output Format

In the text box, enter the floating point/decimal value required. Do not leave any leading or trailing spaces. Your answer may look like: 0.255

This is NOT the actual answer – just the format in which you should provide your answer.

sp.stats.linregress(physicsScores,historyScores).slope
0.20833333333333331

Correlation and Regression Lines – A quick recap #3

Here are the test scores of 10 students in physics and history:

Physics Scores 15 12 8 8 7 7 7 6 5 3

History Scores 10 25 17 11 13 17 20 13 9 15

When a student scores 10 in Physics, what is his probable score in History? Compute the answer correct to one decimal place.

Output Format

In the text box, enter the floating point/decimal value required. Do not leave any leading or trailing spaces. Your answer may look like: 0.255

This is NOT the actual answer – just the format in which you should provide your answer.

def predict(pi,x,y):
    slope, intercept, rvalue, pvalue, stderr=sp.stats.linregress(x,y);
    return slope*pi+ intercept

predict(10,physicsScores,historyScores)
15.458333333333332

Correlation and Regression Lines – A Quick Recap #4

The two regression lines of a bivariate distribution are:

4x – 5y + 33 = 0 (line of y on x)

20x – 9y – 107 = 0 (line of x on y).

Estimate the value of x when y = 7. Compute the correct answer to one decimal place.

Output Format

In the text box, enter the floating point/decimal value required. Do not lead any leading or trailing spaces. Your answer may look like: 7.2

This is NOT the actual answer – just the format in which you should provide your answer.

x=[i for i in range(0,20)]

'''
    4x - 5y + 33 = 0
    x = ( 5y - 33 ) / 4
    y = ( 4x + 33 ) / 5
    
    20x - 9y - 107 = 0
    x = (9y + 107)/20
    y = (20x - 107)/9
'''
t=7
print( ( 9 * t + 107 ) / 20 )
8.5

Correlation and Regression Lines – A Quick Recap #5

The two regression lines of a bivariate distribution are:

4x – 5y + 33 = 0 (line of y on x)

20x – 9y – 107 = 0 (line of x on y).

find the variance of y when σx= 3.

Compute the correct answer to one decimal place.

Output Format

In the text box, enter the floating point/decimal value required. Do not lead any leading or trailing spaces. Your answer may look like: 7.2

This is NOT the actual answer – just the format in which you should provide your answer.

http://www.mpkeshari.com/2011/01/19/lines-of-regression/

Q.3. If the two regression lines of a bivariate distribution are 4x – 5y + 33 = 0 and 20x – 9y – 107 = 0,

  • calculate the arithmetic means of x and y respectively.
  • estimate the value of x when y = 7. – find the variance of y when σx = 3.
Solution : –

We have,

4x – 5y + 33 = 0 => y = 4x/5 + 33/5 ————— (i)

And

20x – 9y – 107 = 0 => x = 9y/20 + 107/20 ————- (ii)

(i) Solving (i) and (ii) we get, mean of x = 13 and mean of y = 17.[Ans.]

(ii) Second line is line of x on y

x = (9/20) × 7 + (107/20) = 170/20 = 8.5 [Ans.]

(iii) byx = r(σy/σx) => 4/5 = 0.6 × σy/3 [r = √(byx.bxy) = √{(4/5)(9/20)]= 0.6 => σy = (4/5)(3/0.6) = 4 [Ans.]

variance= σ**2=> 16

What is the difference between linear regression on y with x and x with y?

The Pearson correlation coefficient of x and y is the same, whether you compute pearson(x, y) or pearson(y, x). This suggests that doing a linear regression of y given x or x given y should be the same, but that’s the case.

The best way to think about this is to imagine a scatter plot of points with y on the vertical axis and x represented by the horizontal axis. Given this framework, you see a cloud of points, which may be vaguely circular, or may be elongated into an ellipse. What you are trying to do in regression is find what might be called the ‘line of best fit’. However, while this seems straightforward, we need to figure out what we mean by ‘best’, and that means we must define what it would be for a line to be good, or for one line to be better than another, etc. Specifically, we must stipulate a loss function. A loss function gives us a way to say how ‘bad’ something is, and thus, when we minimize that, we make our line as ‘good’ as possible, or find the ‘best’ line.

Traditionally, when we conduct a regression analysis, we find estimates of the slope and intercept so as to minimize the sum of squared errors. These are defined as follows:

In terms of our scatter plot, this means we are minimizing the sum of the vertical distances between the observed data points and the line.

enter image description here

On the other hand, it is perfectly reasonable to regress x onto y, but in that case, we would put x on the vertical axis, and so on. If we kept our plot as is (with x on the horizontal axis), regressing x onto y (again, using a slightly adapted version of the above equation with x and y switched) means that we would be minimizing the sum of the horizontal distances between the observed data points and the line. This sounds very similar, but is not quite the same thing. (The way to recognize this is to do it both ways, and then algebraically convert one set of parameter estimates into the terms of the other. Comparing the first model with the rearranged version of the second model, it becomes easy to see that they are not the same.)

enter image description here

Note that neither way would produce the same line we would intuitively draw if someone handed us a piece of graph paper with points plotted on it. In that case, we would draw a line straight through the center, but minimizing the vertical distance yields a line that is slightly flatter (i.e., with a shallower slope), whereas minimizing the horizontal distance yields a line that is slightly steeper.

A correlation is symmetrical x is as correlated with y as y is with x. The Pearson product-moment correlation can be understood within a regression context, however. The correlation coefficient, r, is the slope of the regression line when both variables have been standardized first. That is, you first subtracted off the mean from each observation, and then divided the differences by the standard deviation. The cloud of data points will now be centered on the origin, and the slope would be the same whether you regressed y onto x, or x onto y.

enter image description here

Now, why does this matter? Using our traditional loss function, we are saying that all of the error is in only one of the variables (viz., y). That is, we are saying that x is measured without error and constitutes the set of values we care about, but that y has sampling error. This is very different from saying the converse. This was important in an interesting historical episode: In the late 70’s and early 80’s in the US, the case was made that there was discrimination against women in the workplace, and this was backed up with regression analyses showing that women with equal backgrounds (e.g., qualifications, experience, etc.) were paid, on average, less than men. Critics (or just people who were extra thorough) reasoned that if this was true, women who were paid equally with men would have to be more highly qualified, but when this was checked, it was found that although the results were ‘significant’ when assessed the one way, they were not ‘significant’ when checked the other way, which threw everyone involved into a tizzy. See here for a famous paper that tried to clear the issue up.

Here’s another way to think about this that approaches the topic through the formulas instead of visually:

The formula for the slope of a simple regression line is a consequence of the loss function that has been adopted. If you are using the standard Ordinary Least Squares loss function (noted above), you can derive the formula for the slope that you see in every intro textbook. This formula can be presented in various forms; one of which I call the ‘intuitive’ formula for the slope. Consider this form for both the situation where you are regressing y on x, and where you are regressing x on y:

Now, I hope it’s obvious that these would not be the same unless Var(xequals Var(y). If the variances are equal (e.g., because you standardized the variables first), then so are the standard deviations, and thus the variances would both also equal SD(x)SD(y). In this case, β^1 would equal Pearson’s r, which is the same either way by virtue of the principle of commutativity:

 

Source

#StackBounty: #regression #time-series #least-squares References Request (Least-Squares Estimates for non i.i.d. Processes)

Bounty: 50

I am interested in suggestions concerning possible applications/problems within applied statistics with respect to estimates of least-squares for non-stationary designs. In particular, I would like to know if there are current problems in statistics in which it is important to approach the best average response $Y$ from an possibly non-stationary (but, say, independent and bounded) set of data $(X_{1},Y_{1}),dots,(X_{n},Y_{n})$ in $mathbb{R}^{d}timesmathbb{R}$.

In more precise terms, consider the following observation: if $(X_{1},Y_{1}),dots,(X_{n},Y_{n})$ is a sequence of (possibly non i.i.d.) data in a statistical experiment, then the least squares approximation
$$f^{*}:=argmin_{f
in
mathcal{F}}
frac{1}{n}sum_{k=1}^{n}|f(X_{k})-Y_{k}|^{2} ,,,,,,,,,,(1)$$
within a family $mathcal{F}$ of functions $mathbb{R}^{d}tomathbb{R}$ is the natural “simultaneous” estimator in $L^{2}$ of the conditional expectations of the response variable given the explanatory variable. This is, (1) is the natural empirical estimate of
$$argmin_{finmathcal{F}}frac{1}{n}sum_{k=1}^{n}E(Y_{k}-f(X_{k}))^{2}= argmin_{finmathcal{F}}frac{1}{n}sum_{k=1}^{n}E(E[Y_{k}|X_{k}]-f(X_{k}))^{2}.,,,,,(2)$$

Note that in the i.i.d. case, this leads exactly to the classical least squares regression problem, because $E[Y_{k}|X_{k}]$ does not depend on $k$. Note also that, if $X_{k}$ is stationary (perhaps dependent) and the expectation of $Y_{k}$ given $X_{k}=x$ does not depend on $k$, the setting is still an instance of the classical least-squares problem (for the same reason). Possible ramifications follow easily considering for example the case in which the response variables $E[Y_{k}|X_{k}]$ converge in some sense.

The question therefore is: do you know of any interesting applications/references in which the problem (2) has current relevance? (preferably under the independent, non identically distributed case to begin with, but also in situations of dependence).

Reason for this request: In short: I have been working with some colaborators on the problem of non-stationary least squares regression towards a certain kind of particular Markovian evolution related to Monte Carlo methods.

It turns out that our results so far, if correct, seem to address several intermediate cases along of which we have been wondering whether there are applications of relevance within the community devoted to these methods. This would hopefully give us some useful set of assumptions to further test our results.

Until now I have seen quite a number of articles in which the problem of estimating conditional expectations is addressed via kernel methods for nonindependent, and sometimes nonstationary evolutions. Time Series Analysis, in particular, seems to be an area where these problems are important.

For some reason, nonetheless, the classical least-squares method seems to have not been very explored in that direction. My mathematical intuition tells me that these results can be of great interest also at the practical level and, as pointed out, we would like to confirm this and to even address other existing problems in the statistical community.


Get this bounty!!!