#StackBounty: #machine-learning #correlation #nonlinear-regression #r-squared Generalization of Adjusted R-Squared to Nonlinear Models

Bounty: 50

We are in the prototypical machine learning setting. We have a set of random variables $X=X_1,ldots,X_p$ representing predictors, and a random variable $Y$ representing the dependent variable. We assume that $Y=f(X)+epsilon$ where $epsilon$ is a random variable with mean $0$, and $f$ is some function.

We define the amount of variance explained as:

$$1-frac{Var(epsilon)}{Var(Y)}.$$

I am wondering how to best estimate the amount of variance explained in general but most importantly for the case of only having one predictor ($p=1$)

For the special case of $f$ being linear, and both $X$, and $epsilon$ being Gaussian, this problem has received a lot of attention in statistics and lead to the development of adjusted $R^2$.

Dropping those assumptions, estimating the predictive ability of a learning algorithm, as done in machine learning, seems to be closely related but also a slightly different question. In particular, the prediction error in machine learning consists of $$text{expected prediction error} = text{bias}^2 + text{variance} + text{irreducible error}.$$ Under $L^2$ loss, $Var(epsilon)=text{irreducible error}$. Thus, the prediction error is bigger than $Var(epsilon)$. From a machine learning perspective, $Var(epsilon)$ essentially quantifies how well the optimal prediction function would perform.

Two naive ideas, which both seem to work remarkably well for $p=1$ are as follows.

Statistical approach: Do polynomial regression with let’s say a polynomial of degree 10, and then calculate adjusted R-squared as usual. This has two problems. First, one has to decide for a degree of the polynomial. Second, it assumes $f$ is in the chosen set of polynomial functions

Machine Learning approach: Use a flexible learner. I used a support vector machine with a radial basis function kernel. Then act as if $text{expected prediction error} = text{irreducible error}$. Thus, just use the estimate of prediction error, as obtained from, for example, cross-validation, as the estimate for $Var(epsilon)$. For a flexible learner, this should be a consistent estimator for $Var(epsilon)$, since with sample size $N rightarrow infty$, both $bias^2$ and variance should converge to $0$. To get $bias^2$ to converge to $0$ was the reasoning behind choosing a flexible learner. As an estimator for $Var(Y)$ just use the normal unbiased variance estimator. This approach could possibly be improved by estimating $bias^2$ and variance, which seems to be possible, and subtracting those values from the estimated prediction error


Get this bounty!!!

#StackBounty: #r #goodness-of-fit #r-squared #instrumental-variables #endogeneity Can I ignore the negative R-squared value when I am u…

Bounty: 50

I am running an instrumental variable regression using ‘ivreg’ command in R program. I find that all my validity tests related to endogeneity are satisfied only except the R-squared value which is negative. May I know whether I can ignore this negative R-squared value without reporting? If not, what is an alternative manner to resolve this issue? The code is as below:

    > Y2_ivreg2=ivreg(Y2~x1+x2+x3+x4+x5+x6+x7|x2+x8+x9+x10+x5+x6+x7,data=DATA)
    > summary(Y2_ivreg2,diagnostics=TRUE)

    Call:
    ivreg(formula = Y2 ~ x1 + x2 + x3 + x4 + x5 + 
        x6 + x7 | x2 + x8 + x9 + x10 + 
        x5 + x6 + x7, data = DATA2)

    Residuals:
          Min        1Q    Median        3Q       Max 
    -0.747485 -0.053721 -0.009349  0.044285  1.085256 

    Coefficients:
              Estimate  Std. Error  t value Pr(>|t|)    
 (Intercept)  0.0979178  0.0319244   3.067  0.00218 ** 
    x1        0.0008438  0.0004927   1.712  0.08691 .  
    x2        0.0018515  0.0009135   2.027  0.04277 *  
    x3       -0.0130133  0.0073484  -1.771  0.07667 .  
    x4       -0.0018486  0.0009552  -1.935  0.05303 .  
    x5       -0.0000294  0.0000126  -2.333  0.01971 *  
    x6        0.0018214  0.0008908   2.045  0.04096 *  
    x7       -0.0024457  0.0005488  -4.456 8.61e-06 ***

    Diagnostic tests:
                              df1  df2 statistic p-value    
    Weak instruments (x1)    3 3313   185.440  <2e-16 ***
    Weak instruments (x3)    3 3313  3861.526  <2e-16 ***
    Weak instruments (x4)    3 3313  3126.315  <2e-16 ***
    Wu-Hausman               3 3310     1.943   0.121    
    Sargan                   0   NA        NA      NA    
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 0.1142 on 3313 degrees of freedom
    Multiple R-Squared: -0.009029,  Adjusted R-squared: -0.01116 
    Wald test: 4.231 on 7 and 3313 DF,  p-value: 0.0001168 


Get this bounty!!!

#StackBounty: #r #goodness-of-fit #r-squared #instrumental-variables #endogeneity Can I ignore negative R-squared value when I am using…

Bounty: 50

I am running an instrumental variable regression using ‘ivreg’ command in R program. I find that the R-squared value which is negative. May I know whether I can ignore this negative R-squared value without reporting? If not, what is an alternative manner to resolve this issue? The code is as below:

    > Y2_ivreg2=ivreg(Y2~x1+x2+x3+x4+x5+x6+x7+x8+x9|x2+x10+x11+x12+x5+x6+x7+x8+x9,data=DATA2)
    > summary(Y2_ivreg2,diagnostics=TRUE)

    Call:
    ivreg(formula = Y2 ~ x1 + x2 + x3 + x4 + x5 + 
        x6 + x7 + x8 + x9 | x2 + x10 + 
        x11 + x12 + x5 + x6 + x7 + x8 + 
        x9, data = DATA2)

    Residuals:
          Min        1Q    Median        3Q       Max 
    -0.754860 -0.054511 -0.008602  0.044721  1.098549 

    Coefficients:
              Estimate  Std. Error t value Pr(>|t|)    
 (Intercept)  2.532e-01  6.376e-02   3.971 7.32e-05 ***
    x1        1.543e-03  7.497e-04   2.058   0.0397 *  
    x2        2.687e-03  1.072e-03   2.505   0.0123 *  
    x3       -1.051e-02  6.245e-03  -1.683   0.0925 .  
    x4       -1.290e-03  7.494e-04  -1.722   0.0852 .  
    x5       -2.010e-02  8.384e-03  -2.398   0.0166 *  
    x6       -9.806e-01  8.123e-01  -1.207   0.2275    
    x7       -2.594e-05  1.253e-05  -2.070   0.0385 *  
    x8        1.664e-03  6.785e-04   2.452   0.0143 *  
    x9       -2.716e-03  5.700e-04  -4.766 1.96e-06 ***

    Diagnostic tests:
                              df1  df2 statistic p-value    
    Weak instruments (x1)    3 3311   142.916  <2e-16 ***
    Weak instruments (x3)    3 3311  4686.511  <2e-16 ***
    Weak instruments (x4)    3 3311  2745.649  <2e-16 ***
    Wu-Hausman               3 3308     3.629  0.0124 *  
    Sargan                   0   NA        NA      NA    
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 0.1151 on 3311 degrees of freedom
    Multiple R-Squared: -0.02389,   Adjusted R-squared: -0.02667 
    Wald test: 4.735 on 9 and 3311 DF,  p-value: 2.793e-06 


Get this bounty!!!

#StackBounty: #r #goodness-of-fit #r-squared #instrumental-variables #endogeneity Can I ignore R-squared value of one when I am using i…

Bounty: 50

I am running an instrumental variable regression using ‘ivreg’ command in R program. I am using a lagged dependent variable and find all my validity tests are satisfied only except the R-squared value which is one. May I know whether I can ignore this R-squared value which is one without reporting? If not, what is an alternative manner to resolve this issue? The code is as below:

    > Y_ivreg=ivreg(Y~lag(Y,1)+x1+x2+x3+x4+x5+x6+x7+x8|x9+x10+x11+x12+x3+x4+x6+x7+x8,data=DATA2)
    > summary(Y_ivreg,diagnostics=TRUE)

    Call:
    ivreg(formula = Y ~ lag(Y, 1) + x1 + x2 + x3 + x4 + 
        x5 + x6 + x7 + x8 | x9 + x10 + 
        x11 + x12 + x3 + x4 + x6 + x7 + 
        x8, data = DATA2)

    Residuals:
           Min         1Q     Median         3Q        Max 
    -5.895e-14  1.998e-15  2.476e-14  3.364e-14  6.517e-14 

        Coefficients:
            Estimate Std. Error      t value    Pr(>|t|)
(Intercept)   5.896e-13  4.433e-14  1.330e+01  < 2e-16 ***
  lag(Y, 1)   1.000e+00  3.999e-14  2.501e+13  < 2e-16 ***
        x1    1.414e-14  1.488e-15  9.501e+00  < 2e-16 ***
        x2    2.515e-15  2.328e-16  1.080e+01  < 2e-16 ***
        x3   -2.434e-14  2.759e-15 -8.822e+00  < 2e-16 ***
        x4   -1.925e-12  1.972e-13 -9.764e+00  < 2e-16 ***
        x5    2.055e-14  3.955e-15  5.195e+00 2.17e-07 ***
        x6    2.565e-17  3.230e-18  7.940e+00 2.72e-15 ***
        x7   -2.147e-15  2.427e-16 -8.846e+00  < 2e-16 ***
        x8    4.688e-16  1.900e-16  2.468e+00   0.0136 *

    Diagnostic tests:
                                df1 df2  statistic  p-value    
    Weak instruments (lag(Y, 1))  4 3358    455.18  <2e-16 ***
    Weak instruments (x1)         4 3358    998.12  <2e-16 ***
    Weak instruments (x2)         4 3358   1077.84  <2e-16 ***
    Weak instruments (x5)         4 3358    913.43  <2e-16 ***
    Wu-Hausman                    4 3354      0.78   0.538    
    Sargan                        0   NA        NA      NA    
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 3.497e-14 on 3358 degrees of freedom
    Multiple R-Squared:     1,      Adjusted R-squared:     1 
    Wald test: 5.536e+26 on 9 and 3358 DF,  p-value: < 2.2e-16 


Get this bounty!!!