#StackBounty: #r #categorical-data #interaction #instrumental-variables #2sls A 2SLS when the instrumented variable has two interaction…

Bounty: 50

I am using ivreg and ivmodel in R to apply a 2SLS.

I would like to instrument one variable, namely $x_1$, present in two interaction terms. In this example $x_1$ is a factor variable. The regression is specified in this manner because the ratio between $a$ and $b$ is of importance.

$$y = ax_1 x_2 + bx_1x_3 + cx_4 + e$$

For this instrumented variable I have two instruments $z_1$ and $z_2$. For both the following causal diagram is applicable (Z only has an indirect effect on Y through X).

enter image description here

What is for this problem the correct way to instrument $x_1$?

In the data

Translated to some (fake) sample data the problem looks like:

$$happiness = a(factor:income) + b(factor:sales) + c(educ) + e$$
$$=$$
$$(y = ax_1 x_2 + bx_1x_3 + cx_4 + e)$$

Where the instrument $z_1$ is urban and $z_2$ is size. Here I however become to get confused about how to write the regression.

For the first stage:

What is my dependent variable here?

For the second stage, should I do:

$$happiness = a(urban:income) + b(urban:sales) + c(educ) + e$$
$$happiness = a(size:income) + b(size:sales) + c(educ) + e$$

Or should I just do:

$$happiness = urban(a:income+b:sales) + c(educ) + e$$
$$happiness = size
(a:income+b:sales) + c(educ) + e$$

Nevertheless, how should I specify this in R ?

library(data.table)
library(ivmodel)
library(AER)
panelID = c(1:50)   
year= c(2001:2010)
country = c("NLD", "BEL", "GER")
urban = c("A", "B", "C")
indust = c("D", "E", "F")
sizes = c(1,2,3,4,5)
n <- 2
library(data.table)
set.seed(123)
DT <- data.table(panelID = rep(sample(panelID), each = n),
                    country = rep(sample(country, length(panelID), replace = T), each = n),
                    year = c(replicate(length(panelID), sample(year, n))),
                    some_NA = sample(0:5, 6),                                             
                    Factor = sample(0:5, 6), 
                    industry = rep(sample(indust, length(panelID), replace = T), each = n),
                    urbanisation = rep(sample(urban, length(panelID), replace = T), each = n),
                    size = rep(sample(sizes, length(panelID), replace = T), each = n),
                    income = round(runif(100)/10,2),
                    Y_Outcome= round(rnorm(10,100,10),2),
                    sales= round(rnorm(10,10,10),2),
                    happiness = sample(10,10),
                    Sex = round(rnorm(10,0.75,0.3),2),
                    Age = sample(100,100),
                    educ = round(rnorm(10,0.75,0.3),2))        
DT [, uniqueID := .I]                                                         # Creates a unique ID     
DT <- as.data.frame(DT)

To make it slightly easier for someone to help who is not familiar with the packages, I have added how the structure of the two packages I use looks.

The structure of the second stage of ivreg is as follows:

second_stage <- ivreg(Happiness ~ factor:income + factor:sales + educ | urban:income + urban:sales + educ, data=DT)

The structure for ivmodel is:

second_stage<- ivmodel(Y=DT$Happiness,D=DT$factor,Z=DT[,c("urban","size")],X=DT$educ, na.action = na.omit) 

Any help with figuring out how to do this properly would be greatly appreciated!


Get this bounty!!!

#StackBounty: #large-data #instrumental-variables #hausman Interpretation of the Hausman test (overidentification in relation to IV&#39…

Bounty: 50

I am using survey data with a huge amount of observations, such as the World Value Surveys. Large sample sizes are obviously very nice, but I have have encountered some downsides as well.

To give an example, in almost every econometric model I specify, about 90% of the variables is highly significant. So I will have to decide whether, in addition to an estimate being statistically significant, it is also economically significant, which is not always an easy thing to do.

The biggest issue is however, that when resorting to Instrumental Variables, the Hausman test for over identification is always very, very, very significant. See to this extent THIS POST.

How do I deal with with this consequence of large sample sizes?

The only thing I can think of is to reduce the sample size. This however seems a very arbitrary way to get the test statistic down.


Get this bounty!!!

#StackBounty: #large-data #instrumental-variables #hausman Interpretation of the Hausman test (overidentification in relation to IV&#39…

Bounty: 50

I am using survey data with a huge amount of observations, such as the World Value Surveys. Large sample sizes are obviously very nice, but I have have encountered some downsides as well.

To give an example, in almost every econometric model I specify, about 90% of the variables is highly significant. So I will have to decide whether, in addition to an estimate being statistically significant, it is also economically significant, which is not always an easy thing to do.

The biggest issue is however, that when resorting to Instrumental Variables, the Hausman test for over identification is always very, very, very significant. See to this extent THIS POST.

How do I deal with with this consequence of large sample sizes?

The only thing I can think of is to reduce the sample size. This however seems a very arbitrary way to get the test statistic down.


Get this bounty!!!

#StackBounty: #large-data #instrumental-variables #hausman Interpretation of the Hausman test (overidentification in relation to IV&#39…

Bounty: 50

I am using survey data with a huge amount of observations, such as the World Value Surveys. Large sample sizes are obviously very nice, but I have have encountered some downsides as well.

To give an example, in almost every econometric model I specify, about 90% of the variables is highly significant. So I will have to decide whether, in addition to an estimate being statistically significant, it is also economically significant, which is not always an easy thing to do.

The biggest issue is however, that when resorting to Instrumental Variables, the Hausman test for over identification is always very, very, very significant. See to this extent THIS POST.

How do I deal with with this consequence of large sample sizes?

The only thing I can think of is to reduce the sample size. This however seems a very arbitrary way to get the test statistic down.


Get this bounty!!!

#StackBounty: #large-data #instrumental-variables #hausman Interpretation of the Hausman test (overidentification in relation to IV&#39…

Bounty: 50

I am using survey data with a huge amount of observations, such as the World Value Surveys. Large sample sizes are obviously very nice, but I have have encountered some downsides as well.

To give an example, in almost every econometric model I specify, about 90% of the variables is highly significant. So I will have to decide whether, in addition to an estimate being statistically significant, it is also economically significant, which is not always an easy thing to do.

The biggest issue is however, that when resorting to Instrumental Variables, the Hausman test for over identification is always very, very, very significant. See to this extent THIS POST.

How do I deal with with this consequence of large sample sizes?

The only thing I can think of is to reduce the sample size. This however seems a very arbitrary way to get the test statistic down.


Get this bounty!!!

#StackBounty: #large-data #instrumental-variables #hausman Interpretation of the Hausman test (overidentification in relation to IV&#39…

Bounty: 50

I am using survey data with a huge amount of observations, such as the World Value Surveys. Large sample sizes are obviously very nice, but I have have encountered some downsides as well.

To give an example, in almost every econometric model I specify, about 90% of the variables is highly significant. So I will have to decide whether, in addition to an estimate being statistically significant, it is also economically significant, which is not always an easy thing to do.

The biggest issue is however, that when resorting to Instrumental Variables, the Hausman test for over identification is always very, very, very significant. See to this extent THIS POST.

How do I deal with with this consequence of large sample sizes?

The only thing I can think of is to reduce the sample size. This however seems a very arbitrary way to get the test statistic down.


Get this bounty!!!

#StackBounty: #large-data #instrumental-variables #hausman Interpretation of the Hausman test (overidentification in relation to IV&#39…

Bounty: 50

I am using survey data with a huge amount of observations, such as the World Value Surveys. Large sample sizes are obviously very nice, but I have have encountered some downsides as well.

To give an example, in almost every econometric model I specify, about 90% of the variables is highly significant. So I will have to decide whether, in addition to an estimate being statistically significant, it is also economically significant, which is not always an easy thing to do.

The biggest issue is however, that when resorting to Instrumental Variables, the Hausman test for over identification is always very, very, very significant. See to this extent THIS POST.

How do I deal with with this consequence of large sample sizes?

The only thing I can think of is to reduce the sample size. This however seems a very arbitrary way to get the test statistic down.


Get this bounty!!!

#StackBounty: #r #stata #instrumental-variables #endogeneity #hausman What are the differences between tests for overidentification in …

Bounty: 50

I am using 2SLS for my research and I want to test for overidentification. I started out with the Hausman test of which I have a reasonable grasp.

The problem I have is that from the Hausman and the Sargan Test I am getting very different results.

The Sargan test is done by ivmodel from library(ivmodel). I copied the Hausman test from “Using R for Introductory Econometrics” page 226, by Florian Heiss.

[1] "############################################################"
[1] "***Hausman Test for Overidentification***"
[1] "############################################################"
[1] "***R2***"
[1] 0.0031
[1] "***Number of observations (nobs)***"
[1] 8937
[1] "***nobs*R2***"
[1] 28
[1] "***p-value***"
[1] 0.00000015


Sargan Test Result:

Sargan Test Statistics=0.31, df=1, p-value is 0.6

On top of this I am also using ivtobit from Stata, which provides a Wald test of exogeneity.

Lastly I read about a fourth which is the Hansen J statistic.

What is the difference between all of these tests?


Get this bounty!!!

#StackBounty: #econometrics #instrumental-variables #matching #random-allocation Matching / Scoring in an Experiment, using instrumenta…

Bounty: 50

Maybe this is better suited here, than in economics. I don’t know and please excuse the Econ language, I cannot do any better.

I did an experiment with random assignment to two treatment groups and one control group.

I had problems to encourage participation in one treatment (treatment 1), so at the end, I had to randomly assign a lot of people into treatment 2 and control group.

I used the initial treatment assignment as an instrument for actual treatment assignment to have “LATE”s (local average treatment effects).

However, I do not find any statistically significant effects from treatment 1 (maybe due to smaller sample size, as some effects are truly economically meaningful but Standard errors SE are huge).

At a virtual conference, people encouraged me to use either entropy balancing or propensity score matching. However, I have not been able to find any examples where people use this with experiments or with instruments.

It seems to me as people use matching/balancing methods when they have no control group/ no experiment/ no instrument.

Can any of you help and provide hints, how (if) to use balancing or matching with instrumental variables/experiments.

Kindly thank you in advance!

(This is a crosspost from Economics StackExchange)


Get this bounty!!!

#StackBounty: #instrumental-variables #probit #bivariate #recursive-model Not recovering true coefficient with recursive bivariate prob…

Bounty: 50

I have built a simulated dataset to try to build my intuition about the recursive bivariate probit model. The challenge I’m running into is that I’m unable to recover the true coefficient in my simulated data in the presence of an unobserved confounder, despite having an instrumental variable.

The true model for observation $i$ is as follows:

begin{align}
X^
_i &= beta_0+beta_1Z_i+beta_2O_i+beta_3U_i+zeta_i \
X_i &= mathbb{I}[X^*_i > 0] \
Y^*_i &= beta_4+beta_5X_i+beta_6O_i+beta_7U_i+epsilon_i \
Y_i &= mathbb{I}[Y^*_i > 0]
end{align*}

I want to learn the relationship between $X_i$ and $Y_i$ (so the goal is to accurately estimate $beta_5$); I have instrumental variable $Z_i$ (which affects $X_i$ but not $Y_i$), as well as observed factor $O_i$ and unobserved factor $U_i$, both of which impact both $X_i$ and $Y_i$.

I simulate $Z_i$, $O_i$, $U_i$, $zeta_i$, and $epsilon_i$ as IID standard normal random variables, and I use coefficients $beta_0=beta_1=beta_4=beta_5=-1$ and $beta_2=beta_3=beta_6=beta_7=1$. With the full model specification (including access to the unobserved confounder $U_i$), I can perfectly estimate the coefficients using the recursive bivariate probit model:

bivariateProbit(cbind(b0=1, b1=Z, b2=O, b3=U), cbind(b4=1, b5=X, b6=O, b7=U), X, Y)
#            b0            b1            b2            b3            b4 
# -1.0004121410 -0.9883708493  0.9916502262  1.0030378739 -0.9986777626 
#            b5            b6            b7           rho 
# -1.0003586241  0.9967899947  1.0005921044  0.0007548157 

However, once I drop $U_i$, my estimate for $hatbeta_5$ becomes -0.717, quite far from the true value of -1:

bivariateProbit(cbind(b0=1, b1=Z, b2=O), cbind(b4=1, b5=X, b6=O), X, Y)
#         b0         b1         b2         b4         b5         b6        rho 
# -0.7077195 -0.6983809  0.7046800 -0.7028858 -0.7172548  0.7091427  0.5080552 

What is causing the estimate to vary so far from the true value once I drop my unobserved confounder? Are there better approaches to obtain an estimate closer to the true value of -1?


R code to construct the example (note that bivariateProbit is a function I implemented to make sure I understood how the bivariate probit model worked):

set.seed(144)
N <- 100000  # Observations
Z <- rnorm(N) ; O <- rnorm(N) ; U <- rnorm(N) ; zeta <- rnorm(N) ; eps <- rnorm(N)
b0 <- -1 ; b1 <- -1 ; b2 <- 1 ; b3 <- 1 ; b4 <- -1 ; b5 <- -1 ; b6 <- 1 ; b7 <- 1
X <- as.numeric(b0 + b1*Z + b2*O + b3*U + zeta > 0)
Y <- as.numeric(b4 + b5*X + b6*O + b7*U + eps > 0)
library(pbivnorm)
bivariateProbit <- function(X1, X2, y1, y2) {
  optim(setNames(rep(0, ncol(X1)+ncol(X2)+1), c(colnames(X1), colnames(X2), "rho")),
        function(beta) -sum(log(pbivnorm((2*y1-1)*as.numeric(X1 %*% head(beta, ncol(X1))), (2*y2-1)*as.numeric(X2 %*% beta[seq(ncol(X1)+1, ncol(X1)+ncol(X2))]), (2*y1-1)*(2*y2-1)*tail(beta, 1)))),
        lower=rep(c(-Inf, -1), c(ncol(X1)+ncol(X2), 1)), upper=rep(c(Inf, 1), c(ncol(X1)+ncol(X2), 1)), method="L-BFGS-B")$par
}


Get this bounty!!!