#StackBounty: #interaction #econometrics #instrumental-variables #endogeneity Interaction with endogenous variables in the first stage

Bounty: 50

I am working with industry level data and trying to solve an issue with omitted variables bias, by using an instrument. The problem with my instrument is that it only varies relatively little. I.e. for most my industries its value is the same. Or in other words it only varies by groups encompassing several industries.

One of my teachers told me that I can fix that problem by putting additional explanatory variables in the first stage and interacting them with the instrument. And these variables do not have to be exogenous (with regards to my outcome variable).

The following is what I have thought about this, if someone has a better idea how solve the problem I describe above, I would very much appreciate that too!

So basically, first I was thinking of this:

Equation of interest: $$ y_i = alpha_0 + alpha_1 x_i + alpha_2 mathbf{X}+ epsilon_i $$

Where $y_i$ is the outcome, $alpha_1$ the coefficient of interest, $x_i$ is the endogenous variable, $mathbf{X}$ are covariates.

First stage: $$x_i = beta_0 + beta_1 z_i + beta_2 w_i + beta_3 z_i * w_i + beta_4 mathbf{X} + eta_i $$
To get $hat x_i$, $z_i$ is an instrument that only has an effect on $y_i$ through $x_i$, $w_i$ is a covariate correlated with $epsilon_i $

Second stage: $$y_i = gamma_1 + alpha_1 hat x_i + gamma_2 w_i + gamma_3 mathbf{X} + e_i$$

Then I thought I must have understood something wrong, because this seems kind of weird. But now I read a paper – Nizalova, Murtazashvili (2014) – (https://www.degruyter.com/document/doi/10.1515/jem-2013-0012/html) that explains why interaction effects of one exogenous and one endogenous variable are still consistent.

And another paper – Bun, Harrison (2019) – (https://www.tandfonline.com/doi/full/10.1080/07474938.2018.1427486) that argues similarly: specifically they write:

… we have that, even if we have an endogenous regressor x, the OLS estimator of the ceofficient $beta_{xw}$ is consistent and standard heteroskeasticity-robust OLS inference applies.

and

… we show that endogeneity bias can be reduced to zero for the OLS estimator as far as the interaction term is concerned.

Does this mean what I outlined above works? I.e. the interaction term is not endogenous (wrt to $y_i$) eventhough $W_i$ is?


Get this bounty!!!

#StackBounty: #regression #econometrics #intuition #instrumental-variables #endogeneity Question about Instrumental variables, endogene…

Bounty: 50

I have seen his notation to describe the Instrumental Variable framework, and I wish to make sure I understand it. Y is the dependent variable, x is treatment, and z is the instrument:

$y = f(x,epsilon)$

$x = g(z,eta)$

and the endogeneity structure is defined as: $cov(epsilon,eta)neq0$, $cov(z,epsilon)=0$, $cov(z,eta)=0$

I want to make sure I understand what this is saying.

  1. First, is any variable z that can fit this an instrument?

  2. If I am say approximating these functions with linear equations, that $x = pi z + eta$, is this saying we can partition the entire variation of x as the variation explained by z and then all the remaining variation $eta$, and the endogeneity can be expressed as $cov(epsilon,eta)neq0$? I am confused because usually this is simply expressed as $cov(x,epsilon)neq0$, and I am not familiar with writing this all in terms of errors. is this the same since I can just plug in the model of x as $cov(pi z + eta,epsilon) = cov(eta,epsilon)$ given the exogeneity of z?

  3. Is this equivalent as saying there exists some subset of variables, $rin epsilon$ and $r in eta$, i.e. omitted variables that determine x and determine y?


Get this bounty!!!

#StackBounty: #regression #econometrics #probit #endogeneity #2sls 2SLS with a boolean regressor

Bounty: 50

So, I have the following linear model:
$$y = alpha + beta x + u$$
and $x in {0,1}$, i.e. the variable $x$ is boolean. Moreover $x$ may be endogenous, and I have a set of instrumental variables $boldsymbol{z}$ which are exogenous. In this situation usually one uses a simple 2SLS regression and that’s it. But I was wondering whether one could first regress $x$ on $boldsymbol{z}$ thorough probit, and then take the fitted values $hat{x}$ as instrumental variables in the second step of the regression, where we use $hat{x}$ as instrumental variable for $x$ and use IV.
So I have replaced the OLS regression of the first step with a probit regression.

Is the result of this kind of two step regression consistent? Does it make sense to do so?

Thanks!


Get this bounty!!!

#StackBounty: #count-data #negative-binomial #poisson-regression #endogeneity #quasi-likelihood Testing for endogeneity in a negative b…

Bounty: 150

I’m trying to fit a negative binomial model to my data because the dependent variable exhibits overdispersion. However, one of my reviewers is insisting that I also test for endogeneity. He or she is worried that two independent variables are potentially endogenous (one of them might potentially be so…). My question is how one goes about testing for overdispersion in a negative binomial model, ideally in R. Can it be done simultaneously for two variables? I already found a potential instrument for the most problematic of these two variables (correlated with the endogenous independent variable but uncorrelated to the dependent variable). I’m just not sure how to go from here… I see papers that implement a two-step Heckman procedure, running the negative binomial regression with the inverse Mills ratio. However, I also read that this might not be appropriate…

My current model looks like this, I’m using R. Basically I’m pooling three years of data from two different countries. I’m primarily interested in the differences between these two countries. I have 2 control variables and 9 independent variables of interest. X1 and X3 are the potential problematic variables. Y is a count of different countries in which firms are present, and independent variables are things like international experience, international education, board independence, etc. Endogeneity arises, for instance, because international firms might hire people with more international experience/education than their local counterparts.

negbin <- glm.nb(Y~ Control1 + Contro2 + Year + Country
                 + X1*Country
                 + X2*Country
                 + X3*Country
                 + X4*Country
                 + X5*Country
                 + X6*Country 
                 + X7*Country
                 + X8*Country
                 + X9*Country
                 + X10*Country, data = mydata)
summary(negbin)
car::vif(negbin)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.04651  -1.16581  -0.56598   0.01105   3.00675  

Coefficients:
                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)              1.588771   1.742045   0.912 0.361761    
Control1                 0.240602   0.086086   2.795 0.005191 ** 
Control2                -0.013200   0.003732  -3.537 0.000404 ***
YearThree                0.152904   0.277186   0.552 0.581203    
YearTwo                  0.085071   0.276648   0.308 0.758459    
Country                 -1.899136   2.604823  -0.729 0.465950    
X1                       1.609189   0.652992   2.464 0.013727 *  
X2                       0.146868   0.111476   1.317 0.187674    
X3                      -4.792707   0.748956  -6.399 1.56e-10 ***
X4                       4.352965   0.677561   6.424 1.32e-10 ***
X5                      -0.054561   0.015381  -3.547 0.000389 ***
X6                      -1.497622   0.374987  -3.994 6.50e-05 ***
X7                      -2.689511   0.768235  -3.501 0.000464 ***
X8                      -0.078919   0.069243  -1.140 0.254394    
X9                       4.237630   1.544278   2.744 0.006068 ** 
X10                      3.333337   1.258869   2.648 0.008100 ** 
Country:X1               0.584704   0.992207   0.589 0.555662    
Country:X2              -0.635671   0.332893  -1.910 0.056193 .  
Country:X3               4.508881   0.884777   5.096 3.47e-07 ***
Country:X4              -7.823156   1.411851  -5.541 3.01e-08 ***
Country:X5              -0.003909   0.032332  -0.121 0.903779    
Country:X6               1.001702   0.570836   1.755 0.079294 .  
Country:X7               4.870946   0.991810   4.911 9.05e-07 ***
Country:X8               0.403581   0.100593   4.012 6.02e-05 ***
Country:X9              -2.151496   1.953145  -1.102 0.270655    
Country:X10            -21.951529   4.102211  -5.351 8.74e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Get this bounty!!!

#StackBounty: #r #stata #instrumental-variables #endogeneity #hausman What are the differences between tests for overidentification in …

Bounty: 50

I am using 2SLS for my research and I want to test for overidentification. I started out with the Hausman test of which I have a reasonable grasp.

The problem I have is that from the Hausman and the Sargan Test I am getting very different results.

The Sargan test is done by ivmodel from library(ivmodel). I copied the Hausman test from “Using R for Introductory Econometrics” page 226, by Florian Heiss.

[1] "############################################################"
[1] "***Hausman Test for Overidentification***"
[1] "############################################################"
[1] "***R2***"
[1] 0.0031
[1] "***Number of observations (nobs)***"
[1] 8937
[1] "***nobs*R2***"
[1] 28
[1] "***p-value***"
[1] 0.00000015


Sargan Test Result:

Sargan Test Statistics=0.31, df=1, p-value is 0.6

On top of this I am also using ivtobit from Stata, which provides a Wald test of exogeneity.

Lastly I read about a fourth which is the Hansen J statistic.

What is the difference between all of these tests?


Get this bounty!!!

#StackBounty: #r #goodness-of-fit #r-squared #instrumental-variables #endogeneity Can I ignore the negative R-squared value when I am u…

Bounty: 50

I am running an instrumental variable regression using ‘ivreg’ command in R program. I find that all my validity tests related to endogeneity are satisfied only except the R-squared value which is negative. May I know whether I can ignore this negative R-squared value without reporting? If not, what is an alternative manner to resolve this issue? The code is as below:

    > Y2_ivreg2=ivreg(Y2~x1+x2+x3+x4+x5+x6+x7|x2+x8+x9+x10+x5+x6+x7,data=DATA)
    > summary(Y2_ivreg2,diagnostics=TRUE)

    Call:
    ivreg(formula = Y2 ~ x1 + x2 + x3 + x4 + x5 + 
        x6 + x7 | x2 + x8 + x9 + x10 + 
        x5 + x6 + x7, data = DATA2)

    Residuals:
          Min        1Q    Median        3Q       Max 
    -0.747485 -0.053721 -0.009349  0.044285  1.085256 

    Coefficients:
              Estimate  Std. Error  t value Pr(>|t|)    
 (Intercept)  0.0979178  0.0319244   3.067  0.00218 ** 
    x1        0.0008438  0.0004927   1.712  0.08691 .  
    x2        0.0018515  0.0009135   2.027  0.04277 *  
    x3       -0.0130133  0.0073484  -1.771  0.07667 .  
    x4       -0.0018486  0.0009552  -1.935  0.05303 .  
    x5       -0.0000294  0.0000126  -2.333  0.01971 *  
    x6        0.0018214  0.0008908   2.045  0.04096 *  
    x7       -0.0024457  0.0005488  -4.456 8.61e-06 ***

    Diagnostic tests:
                              df1  df2 statistic p-value    
    Weak instruments (x1)    3 3313   185.440  <2e-16 ***
    Weak instruments (x3)    3 3313  3861.526  <2e-16 ***
    Weak instruments (x4)    3 3313  3126.315  <2e-16 ***
    Wu-Hausman               3 3310     1.943   0.121    
    Sargan                   0   NA        NA      NA    
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 0.1142 on 3313 degrees of freedom
    Multiple R-Squared: -0.009029,  Adjusted R-squared: -0.01116 
    Wald test: 4.231 on 7 and 3313 DF,  p-value: 0.0001168 


Get this bounty!!!

#StackBounty: #r #goodness-of-fit #r-squared #instrumental-variables #endogeneity Can I ignore negative R-squared value when I am using…

Bounty: 50

I am running an instrumental variable regression using ‘ivreg’ command in R program. I find that the R-squared value which is negative. May I know whether I can ignore this negative R-squared value without reporting? If not, what is an alternative manner to resolve this issue? The code is as below:

    > Y2_ivreg2=ivreg(Y2~x1+x2+x3+x4+x5+x6+x7+x8+x9|x2+x10+x11+x12+x5+x6+x7+x8+x9,data=DATA2)
    > summary(Y2_ivreg2,diagnostics=TRUE)

    Call:
    ivreg(formula = Y2 ~ x1 + x2 + x3 + x4 + x5 + 
        x6 + x7 + x8 + x9 | x2 + x10 + 
        x11 + x12 + x5 + x6 + x7 + x8 + 
        x9, data = DATA2)

    Residuals:
          Min        1Q    Median        3Q       Max 
    -0.754860 -0.054511 -0.008602  0.044721  1.098549 

    Coefficients:
              Estimate  Std. Error t value Pr(>|t|)    
 (Intercept)  2.532e-01  6.376e-02   3.971 7.32e-05 ***
    x1        1.543e-03  7.497e-04   2.058   0.0397 *  
    x2        2.687e-03  1.072e-03   2.505   0.0123 *  
    x3       -1.051e-02  6.245e-03  -1.683   0.0925 .  
    x4       -1.290e-03  7.494e-04  -1.722   0.0852 .  
    x5       -2.010e-02  8.384e-03  -2.398   0.0166 *  
    x6       -9.806e-01  8.123e-01  -1.207   0.2275    
    x7       -2.594e-05  1.253e-05  -2.070   0.0385 *  
    x8        1.664e-03  6.785e-04   2.452   0.0143 *  
    x9       -2.716e-03  5.700e-04  -4.766 1.96e-06 ***

    Diagnostic tests:
                              df1  df2 statistic p-value    
    Weak instruments (x1)    3 3311   142.916  <2e-16 ***
    Weak instruments (x3)    3 3311  4686.511  <2e-16 ***
    Weak instruments (x4)    3 3311  2745.649  <2e-16 ***
    Wu-Hausman               3 3308     3.629  0.0124 *  
    Sargan                   0   NA        NA      NA    
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 0.1151 on 3311 degrees of freedom
    Multiple R-Squared: -0.02389,   Adjusted R-squared: -0.02667 
    Wald test: 4.735 on 9 and 3311 DF,  p-value: 2.793e-06 


Get this bounty!!!

#StackBounty: #r #goodness-of-fit #r-squared #instrumental-variables #endogeneity Can I ignore R-squared value of one when I am using i…

Bounty: 50

I am running an instrumental variable regression using ‘ivreg’ command in R program. I am using a lagged dependent variable and find all my validity tests are satisfied only except the R-squared value which is one. May I know whether I can ignore this R-squared value which is one without reporting? If not, what is an alternative manner to resolve this issue? The code is as below:

    > Y_ivreg=ivreg(Y~lag(Y,1)+x1+x2+x3+x4+x5+x6+x7+x8|x9+x10+x11+x12+x3+x4+x6+x7+x8,data=DATA2)
    > summary(Y_ivreg,diagnostics=TRUE)

    Call:
    ivreg(formula = Y ~ lag(Y, 1) + x1 + x2 + x3 + x4 + 
        x5 + x6 + x7 + x8 | x9 + x10 + 
        x11 + x12 + x3 + x4 + x6 + x7 + 
        x8, data = DATA2)

    Residuals:
           Min         1Q     Median         3Q        Max 
    -5.895e-14  1.998e-15  2.476e-14  3.364e-14  6.517e-14 

        Coefficients:
            Estimate Std. Error      t value    Pr(>|t|)
(Intercept)   5.896e-13  4.433e-14  1.330e+01  < 2e-16 ***
  lag(Y, 1)   1.000e+00  3.999e-14  2.501e+13  < 2e-16 ***
        x1    1.414e-14  1.488e-15  9.501e+00  < 2e-16 ***
        x2    2.515e-15  2.328e-16  1.080e+01  < 2e-16 ***
        x3   -2.434e-14  2.759e-15 -8.822e+00  < 2e-16 ***
        x4   -1.925e-12  1.972e-13 -9.764e+00  < 2e-16 ***
        x5    2.055e-14  3.955e-15  5.195e+00 2.17e-07 ***
        x6    2.565e-17  3.230e-18  7.940e+00 2.72e-15 ***
        x7   -2.147e-15  2.427e-16 -8.846e+00  < 2e-16 ***
        x8    4.688e-16  1.900e-16  2.468e+00   0.0136 *

    Diagnostic tests:
                                df1 df2  statistic  p-value    
    Weak instruments (lag(Y, 1))  4 3358    455.18  <2e-16 ***
    Weak instruments (x1)         4 3358    998.12  <2e-16 ***
    Weak instruments (x2)         4 3358   1077.84  <2e-16 ***
    Weak instruments (x5)         4 3358    913.43  <2e-16 ***
    Wu-Hausman                    4 3354      0.78   0.538    
    Sargan                        0   NA        NA      NA    
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 3.497e-14 on 3358 degrees of freedom
    Multiple R-Squared:     1,      Adjusted R-squared:     1 
    Wald test: 5.536e+26 on 9 and 3358 DF,  p-value: < 2.2e-16 


Get this bounty!!!

#StackBounty: #regression #logistic #instrumental-variables #endogeneity #identification Do I need Sargan test with equal numbers of in…

Bounty: 50

I have a instrumental variable logistic regression I run with three instruments (z1, z2, z3) and three endogenous variables (k1, k2, k3).

Therefore, since the number of instruments “3” equals the number of endogenous variables “3”, the statistics says I do not need to run the Sargan test to check over-identification.

My model also shows strong instruments (passing Instrument relevance test) and endogeneity tests (by passing the tests for exogeneity) with these three instruments and endogenous variables.

However, my economic interpretation says z1 may not only affect k1 but also k2.

Does this mean I am still having over-identification problem? But even though I do so, I will still have zero degrees of freedom that won’t make me run the Sargan test at all.

Shall I assume I still do not need over-identification test in this case?

May I also have some literature support for the arguments as well please?

Also, if I interpret that z1 may not only affect k1 but also k2, is this acceptable in general as part of the instrumental variable regression result interpretation? If not, is there a way to make this interpretation acceptable?

I need a clear answer without disregarding the last sentence above in bold texts please.


Get this bounty!!!