#StackBounty: #regression #ridge-regression #shrinkage #penalized If shrinkage is applied in a clever way, does it always work better f…

Bounty: 50

Suppose I have two estimators $widehat{beta}_1$ and $widehat{beta}_2$ that are consistent estimators of the same parameter $beta_0$ and such that
$$sqrt{n}(widehat{beta}_1 -beta_0) stackrel{d}rightarrow mathcal{N}(0, V_1), quad sqrt{n}(widehat{beta}_2 -beta_0) stackrel{d}rightarrow mathcal{N}(0, V_2)$$
with $V_1 leq V_2$ in the p.s.d. sense. Thus, asymptotically $widehat{beta}_1$ is more efficient than $widehat{beta}_2$. These two estimators are based on different loss functions.

Now I want to look for some shrinkage techniques to improve finite-sample properties of my estimators.

Suppose that I found a shrinkage technique that improves the estimator $widehat{beta}_2$ in a finite sample and gives me the value of MSE equal to $widehat{gamma}_2$. Does this imply that I can find a suitable shrinkage technique to apply to $widehat{beta}_1$ that will give me the MSE no greater than $widehat{gamma}_2$?

In other words, if shrinkage is applied cleverly, does it always work better for more efficient estimators?


Get this bounty!!!

#StackBounty: #regression #self-study #generalized-linear-model #data-visualization #model How to calculate the increments in the mean …

Bounty: 50

Suppose that I have the following model
$$g(mu)=beta_0+beta_1(x_1-bar{x}_1)+beta_2(x_2-bar{x}_2)+beta_3(x_2-bar{x}_2)^2$$
where $g(mu)$ is the complementary log-log function.

I calculated the increments in the mean for each unit change in the values of $x_1$ and $x_2$ fixing a value of $x_1$ then fixing the values of $x_2$.

Fixing the value of $x_2$ I can calculate the increment in $g(mu)$ as

$$g_1(mu)=beta_0+beta_1((x_1+1)-bar{x}_1)+beta_2(x_2-bar{x}_2)+beta_3(x_2-bar{x}_2)^2$$
$$=g(mu)+beta_1$$

Fixing now the value of $x_1$ then

$$g_1(mu)=beta_0+beta_1(x_1-bar{x}_1)+beta_2((x_2+1)-bar{x}_2)+beta_3((x_2+1)-bar{x}_2)^2$$
$$=g(mu)+beta_2+beta_3+2beta_3(x_2-bar{x}_2)$$

So these are the increments in the values of $g(mu)$ to calculate the increments in $mu$ I just calculate the inverse of link function?

Edit:
The increments in the the complementary log-log in the first case are
$$g_1(mu)-g(mu)=beta_1$$
so the increment in $mu$ is
$$1-exp(-exp(beta_1))$$

In the second case the increments in $g$ are
$$g_1(mu)-g(mu)=beta_2+beta_3+2beta_3(x_2-bar{x_2})$$
so the increments in $mu$ are
$$1-exp(-exp(beta_2+beta_3+2beta_3(x_2-bar{x_2})))$$

Is it right?


Get this bounty!!!

#StackBounty: #regression #self-study #generalized-linear-model #data-visualization #model Increments in mean with link function

Bounty: 50

Suppose that I have the following model
$$g(mu)=beta_0+beta_1(x_1-bar{x}_1)+beta_2(x_2-bar{x}_2)+beta_3(x_2-bar{x}_2)^2$$
where $g(mu)$ is the complementary log-log function.

I calculated the increments in the mean for each unit change in the values of $x_1$ and $x_2$ fixing a value of $x_1$ then fixing the values of $x_2$.

Fixing the value of $x_2$ I can calculate the increment in $g(mu)$ as

$$g_1(mu)=beta_0+beta_1((x_1+1)-bar{x}_1)+beta_2(x_2-bar{x}_2)+beta_3(x_2-bar{x}_2)^2$$
$$=g(mu)+beta_1$$

Fixing now the value of $x_1$ then

$$g_1(mu)=beta_0+beta_1(x_1-bar{x}_1)+beta_2((x_2+1)-bar{x}_2)+beta_3((x_2+1)-bar{x}_2)^2$$
$$=g(mu)+beta_2+beta_3+2beta_3(x_2-bar{x}_2)$$

So these are the increments in the values of $g(mu)$ to calculate the increments in $mu$ I just calculate the inverse of link function?

Edit:
The increments in the the complementary log-log in the first case are
$$g_1(mu)-g(mu)=beta_1$$
so the increment in $mu$ is
$$1-exp(-exp(beta_1))$$

In the second case the increments in $g$ are
$$g_1(mu)-g(mu)=beta_2+beta_3+2beta_3(x_2-bar{x_2})$$
so the increments in $mu$ are
$$1-exp(-exp(beta_2+beta_3+2beta_3(x_2-bar{x_2})))$$

Is it right?


Get this bounty!!!

#StackBounty: #regression #multiple-regression #terminology #scikit-learn #software How to fit to sum of observations?

Bounty: 100

From a practical point of view, how does one go about fitting a model to training data that consists of sums of a dependent variable over multiple conditions? For example, fitting a model to predict the incomes of individuals given only the incomes of households and the descriptors of each member of each household.

To be more precise: I wish to predict a dependent variable $y$ in from $n$ independent variables described by a vector $vec{x}$. My training data does not consist of the usual sort of observations $(vec{x}_i,y_i)$. Rather, I have the sums of various mutually exclusive subsets of ${y_i}$. In other words, my training data consists of ${vec{x}_i}$ and ${Y_k}$, where

$$Y_k=sum_{iin{J_k}}{y_i}$$

The values of ${J_k}$ are known and are mutually exclusive, meaning that each $i$ is contained in one and only one $J_k$.

I would like to train a model $f(vec{x})$ with this data using off-the-shelf tools. For example, using scikit-learn to fit a random forest regression. It’s not clear to me how to do this through the API, which seems to require the training data to contain observations $(vec{x}_i,y_i)$.

Also, what is the best terminology to describe this sort of optimization problem? Is there a specific name for it?


Get this bounty!!!

#StackBounty: #regression #bayesian #autocorrelation #bayes-factors How to account for temporal autocorrelation when computing a Bayes …

Bounty: 50

I have data of some physiological measure, represented as a vector of 185 measurements taken every 2 seconds. I can model this response in two different ways, and I wish to compare the fit of my two competing models so that I can extract a Bayes factor for the ratio of the likelihoods of obtaining this data given the first or the second model.
My approach so far was to fit a BLUE linear regression model to the data (with only one regressor, namely the expected response under the first or the second model), and to use the formula for the log-likelihood of a regression model with the MLE coefficients ($-N/2times(log(2timespi)+log(sigma^{2})+1)$). I then subtracted the two terms to get the log of my Bayes factor.

But this assumes my samples are independent, when in reality they are serially autocorrelated. I guess that shouldn’t bias my BF, but it should nonetheless polarize it. How can I overcome this other than thinning down my vector? I guess I should somehow correct my $N$?


Get this bounty!!!

#StackBounty: #regression #cross-validation #modeling #explanatory-models Purpose of leave-one-out cross-validation in descriptive mode…

Bounty: 50

I refer you to Breiman’s paper Statistical Modeling – A Tale of Two Cultures where he illustrated some examples of descriptive modelling. Under section 11.1, 100 runs of regression were performed, each time leaving randomly selected 10% of the data. May I know what is the purpose of performing 100 runs and getting the average error rate and relationship between the outcome and explanatory?

I understand the objective of creating tests sets in predictive modelling is as to prevent overestimating the predictive accuracy of the model. But I do not quite understand in the context of descriptive modelling.

Follow-up question: Would validation replaces method to assess descriptive model’s performance (e.g. goodness-of-fit test & R^2)?


Get this bounty!!!

#StackBounty: #regression Purpose of leave-one-out cross-validation in descriptive modelling

Bounty: 50

I refer you to Breiman’s paper Statistical Modeling – A Tale of Two Cultures where he illustrated some examples of descriptive modelling. Under section 11.1, 100 runs of regression were performed, each time leaving randomly selected 10% of the data. May I know what is the purpose of performing 100 runs and getting the average error rate and relationship between the outcome and explanatory?

I understand the objective of creating tests sets in predictive modelling is as to prevent overestimating the predictive accuracy of the model. But I do not quite understand in the context of descriptive modelling.


Get this bounty!!!

#StackBounty: #regression #time-series #logistic #lags #random-walk Lag between predicted output and real output in time series predict…

Bounty: 50

I modeled a directional prediction of a time series. In every step, I predict next direction of that series (up or down). Currently I have a lag in predicted outputs compared to real outputs.

enter image description here

For example, in above outputs, first figure is real direction and second figure is predicted direction (red star: next up trend, blue star: next down trend). So we can see that in prediction outputs we have 1-step lag. Totally, I have better results If I don’t have this lag in prediction. I saw this link which mentions that this is a problem related to “naive predictor”. We have same behavior in this problem (inputs: different lags of time series, output: 1 or 0)?
How can I resolve that? Currently, I’m using logistic regression in this model.

  • I checked my input data using unit root test. The input data was non-stationary so I transformed it to stationary using different methods (difference,detrend, etc.) but I have same problem. Is this same problem mentioned HERE?


Get this bounty!!!

#StackBounty: #regression Descriptive modelling using 100 runs

Bounty: 50

I refer you to Breiman’s paper Statistical Modeling – A Tale of Two Cultures where he illustrated some examples of descriptive modelling. Under section 11.1, 100 runs of regression were performed, each time leaving randomly selected 10% of the data. May I know what is the purpose of performing 100 runs and getting the average error rate and relationship between the outcome and explanatory?

I understand the objective of creating tests sets in predictive modelling is as to prevent overestimating the predictive accuracy of the model. But I do not quite understand in the context of descriptive modelling.


Get this bounty!!!

#StackBounty: #regression #machine-learning #variance #cross-validation #predictive-models Does $K$-fold CV with $K=N$ (LOO) provide th…

Bounty: 50

TL,DR: It appears that, contrary to oft-repeated advice, leave-one-out cross validation (LOO-CV) — that is, $K$-fold CV with $K$ (the number of folds) equal to $N$ (the number of training observations) — yields estimates of the generalization error that are the least variable for any $K$, not the most variable, assuming a certain stability condition on either the model/algorithm, the dataset, or both (I’m not sure which is correct as I don’t really understand this stability condition).

  • Can someone clearly explain what exactly this stability condition is?
  • Is it true that linear regression is one such “stable” algorithm, implying that in that context, LOO-CV is strictly the best choice of CV as far as bias and variance of the estimates of generalization error are concerned?

The conventional wisdom is that the choice of $K$ in $K$-fold CV follows a bias-variance tradeoff, such lower values of $K$ (approaching 2) lead to estimates of the generalization error that have more pessimistic bias, but lower variance, while higher values of $K$ (approaching $N$) lead to estimates that are less biased, but with greater variance. The conventional explanation for this phenomenon of variance increasing with $K$ is given perhaps most prominently in The Elements of Statistical Learning (Section 7.10.1):

With K=N, the cross-validation estimator is approximately unbiased for the true (expected) prediction error, but can have high variance because the N “training sets” are so similar to one another.

The implication being that the $N$ validation errors are more highly correlated so that their sum is more variable. This line of reasoning has been repeated in many answers on this site (e.g., here, here, here, here, here, here, and here) as well as on various blogs and etc. But a detailed analysis is virtually never given, instead only an intuition or brief sketch of what an analysis might look like.

One can however find contradictory statements, usually citing a certain “stability” condition that I don’t really understand. For example, this contradictory answer quotes a couple paragraphs from a 2015 paper which says, among other things, “For models/modeling procedures with low instability, LOO often has the smallest variability” (emphasis added). This paper (section 5.2) seems to agree that LOO represents the least variable choice of $K$ as long as the model/algorithm is “stable.” Taking even another stance on the issue, there is also this paper (Corollary 2), which says “The variance of $k$ fold cross validation […] does not depend on $k$,” again citing a certain “stability” condition.

The explanation about why LOO might be the most variable $K$-fold CV is intuitive enough, but there is a counter-intuition. The final CV estimate of the mean squared error (MSE) is the mean of the MSE estimates in each fold. So as $K$ increases up to $N$, the CV estimate is the mean of an increasing number of random variables. And we know that the variance of a mean decreases with the number of variables being averaged over. So in order for LOO to be the most variable $K$-fold CV, it would have to be true that the increase in variance due to the increased correlation among the MSE estimates outweighs the decrease in variance due to the greater number of folds being averaged over. And it is not at all obvious that this is true.

Having become thoroughly confused thinking about all this, I decided to run a little simulation for the linear regression case. I simulated 10,000 datasets with $N$=50 and 3 uncorrelated predictors, each time estimating the generalization error using $K$-fold CV with $K$=2, 5, 10, or 50=$N$. The R code is here. Here are the resulting means and variances of the CV estimates across all 10,000 datasets (in MSE units):

         k = 2 k = 5 k = 10 k = n = 50
mean     1.187 1.108  1.094      1.087
variance 0.094 0.058  0.053      0.051

These results show the expected pattern that higher values of $K$ lead to a less pessimistic bias, but also appear to confirm that the variance of the CV estimates is lowest, not highest, in the LOO case.

So it appears that linear regression is one of the “stable” cases mentioned in the papers above, where increasing $K$ is associated with decreasing rather than increasing variance in the CV estimates. But what I still don’t understand is:

  • What precisely is this “stability” condition? Does it apply to models/algorithms, datasets, or both to some extent?
  • Is there an intuitive way to think about this stability?
  • What are other examples of stable and unstable models/algorithms or datasets?
  • Is it relatively safe to assume that most models/algorithms or datasets are “stable” and therefore that $K$ should generally be chosen as high as is computationally feasible?


Get this bounty!!!