#StackBounty: #count-data #negative-binomial #poisson-regression #endogeneity #quasi-likelihood Testing for endogeneity in a negative b…

Bounty: 150

I’m trying to fit a negative binomial model to my data because the dependent variable exhibits overdispersion. However, one of my reviewers is insisting that I also test for endogeneity. He or she is worried that two independent variables are potentially endogenous (one of them might potentially be so…). My question is how one goes about testing for overdispersion in a negative binomial model, ideally in R. Can it be done simultaneously for two variables? I already found a potential instrument for the most problematic of these two variables (correlated with the endogenous independent variable but uncorrelated to the dependent variable). I’m just not sure how to go from here… I see papers that implement a two-step Heckman procedure, running the negative binomial regression with the inverse Mills ratio. However, I also read that this might not be appropriate…

My current model looks like this, I’m using R. Basically I’m pooling three years of data from two different countries. I’m primarily interested in the differences between these two countries. I have 2 control variables and 9 independent variables of interest. X1 and X3 are the potential problematic variables. Y is a count of different countries in which firms are present, and independent variables are things like international experience, international education, board independence, etc. Endogeneity arises, for instance, because international firms might hire people with more international experience/education than their local counterparts.

negbin <- glm.nb(Y~ Control1 + Contro2 + Year + Country
                 + X1*Country
                 + X2*Country
                 + X3*Country
                 + X4*Country
                 + X5*Country
                 + X6*Country 
                 + X7*Country
                 + X8*Country
                 + X9*Country
                 + X10*Country, data = mydata)
summary(negbin)
car::vif(negbin)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.04651  -1.16581  -0.56598   0.01105   3.00675  

Coefficients:
                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)              1.588771   1.742045   0.912 0.361761    
Control1                 0.240602   0.086086   2.795 0.005191 ** 
Control2                -0.013200   0.003732  -3.537 0.000404 ***
YearThree                0.152904   0.277186   0.552 0.581203    
YearTwo                  0.085071   0.276648   0.308 0.758459    
Country                 -1.899136   2.604823  -0.729 0.465950    
X1                       1.609189   0.652992   2.464 0.013727 *  
X2                       0.146868   0.111476   1.317 0.187674    
X3                      -4.792707   0.748956  -6.399 1.56e-10 ***
X4                       4.352965   0.677561   6.424 1.32e-10 ***
X5                      -0.054561   0.015381  -3.547 0.000389 ***
X6                      -1.497622   0.374987  -3.994 6.50e-05 ***
X7                      -2.689511   0.768235  -3.501 0.000464 ***
X8                      -0.078919   0.069243  -1.140 0.254394    
X9                       4.237630   1.544278   2.744 0.006068 ** 
X10                      3.333337   1.258869   2.648 0.008100 ** 
Country:X1               0.584704   0.992207   0.589 0.555662    
Country:X2              -0.635671   0.332893  -1.910 0.056193 .  
Country:X3               4.508881   0.884777   5.096 3.47e-07 ***
Country:X4              -7.823156   1.411851  -5.541 3.01e-08 ***
Country:X5              -0.003909   0.032332  -0.121 0.903779    
Country:X6               1.001702   0.570836   1.755 0.079294 .  
Country:X7               4.870946   0.991810   4.911 9.05e-07 ***
Country:X8               0.403581   0.100593   4.012 6.02e-05 ***
Country:X9              -2.151496   1.953145  -1.102 0.270655    
Country:X10            -21.951529   4.102211  -5.351 8.74e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Get this bounty!!!

#StackBounty: #r #heteroscedasticity #poisson-regression #weighted-regression Closest approximation of a Poisson GLM using weighted lea…

Bounty: 100

I have an application where I would like to approximate a Poisson GLM with identity link, i.e. a glm of the form

fit = glm(y~x1+x2+x3+..., family=poisson(link=identity), data=data)

using a (noniterative) weighted least squares model fit instead, ie using

fit = lm(y~x1+x2+x3+...,, weights=..., data=data)

which comes down to using a weighted nonnegative least squars model (nnls) where both y and the covariate matrix are multiplied by sqrt(weights).

I was wondering what would be the correct/best weights to use though to take into account the mean/variance relationship?

For Poisson the variance=mean, so would that imply that I should use weights=1/variance, i.e. weights = 1/(y+epsilon) with small epsilon (e.g. 1E-10)? I would like to use weighted OLS instead of a GLM mainly for computational reasons (a lm.fit typically being 5-10x faster than a glm.fit) and also because the identity link Poisson GLM in R doesn’t seem to be implemented in a very robust way (it often complains about it not finding valid starting values). A quick test indeed shows that such a weighted LS fit indeed is very close to the corresponding Poisson GLM.

When working with a single covariate sqrt transforming x and y to stabilize the variance and doing a regular OLS fit on those also seems to give a nearly identical fit as the Poisson GLM. But this only works with a single covariate though.

I was wondering if both of these procedures are generally accepted approximations of Poisson GLMs? And how they would typically be referred to, and what a good citation would be for both recipes?

To enforce nonnegativity coefficients on the fitted coefficients I would then also like to use nnls or glmnet with lower.limits=0 (as glmnet does not support identity link Poisson; h20 does but does not support nonnegativity constraints). So would using 1/variance weights within a regular (nonnegative) weighted least squares framework be an OK thing to do to approach a Poisson error structure?

Quick test :

x=1:100
set.seed(1)
y=rpois(n=length(x), lambda=x)
epsilon=1E-10
weights = 1/(y+epsilon)
weights = nrow(data)*weights/sum(weights)

data=data.frame(x=x,y=y,weights=weights)
fit = glm(y~x, family=poisson(link=identity), data=data) # Poisson GLM
lmfit = lm(sqrt(y)~sqrt(x), data=data) # OLS with x & y sqrt transformed to stabilize variance
wlmfit = lm(y*sqrt(weights) ~ x*sqrt(weights), data=data) # weighted OLS with 1/variance weights
preds.glm = do.call(cbind,predict(fit,type="link",se.fit=TRUE)[1:2])
preds.glm = data.frame(ypred=preds.glm[,1], 
                       y.lwr=preds.glm[,1]-1.96*preds.glm[,2], 
                       y.upr=preds.glm[,1]+1.96*preds.glm[,2])
preds.lm = do.call(cbind,predict(lmfit,se.fit=TRUE)[1:2])
preds.lm = data.frame(ypred=preds.lm[,1]^2, 
                      y.lwr=(preds.lm[,1]-1.96*preds.lm[,2])^2, 
                      y.upr=(preds.lm[,1]+1.96*preds.lm[,2])^2)
preds.wlm = do.call(cbind,predict(wlmfit,se.fit=TRUE)[1:2])
preds.wlm = data.frame(ypred=preds.wlm[,1]^2, 
                      y.lwr=(preds.wlm[,1]-1.96*preds.wlm[,2])^2, 
                      y.upr=(preds.wlm[,1]+1.96*preds.wlm[,2])^2)
dev.off()
plot(x,y,pch=16,main="Red=Poisson GLM,nBlue=Unweighted LS on sqrt() transformed data,nGreen=Weighted LS with 1/variance weights",cex=0.5)
lines(x,preds.glm$ypred, col="red",lty=1,lwd=6)
lines(x,preds.glm$y.lwr, col="red",lty=1,lwd=4)
lines(x,preds.glm$y.upr, col="red",lty=1,lwd=4)
lines(x,preds.lm$ypred, col="blue",lty=2,lwd=6)
lines(x,preds.lm$y.lwr, col="blue",lty=2,lwd=4)
lines(x,preds.lm$y.upr, col="blue",lty=2,lwd=4)
lines(x,preds.lm$ypred, col="green4",lty=3,lwd=6)
lines(x,preds.lm$y.lwr, col="green4",lty=3,lwd=4)
lines(x,preds.lm$y.upr, col="green4",lty=3,lwd=4)

enter image description here

Also relevant perhaps is this link.


Get this bounty!!!

#StackBounty: #hypothesis-testing #heteroscedasticity #poisson-regression #overdispersion How to perform over-dispersion test where nul…

Bounty: 50

If I understand correctly, a quasi Poisson regression assumes roughly that
$$
mbox{E}left[yleft|xright.right] = exp{left(x^{top}betaright)},
quad
mbox{VAR}left(yleft|xright.right) = sigma^2 exp{left(x^{top}betaright)},
$$

and estimates both $beta$ and $sigma^2$. (Poisson regression further assumes $sigma=1$, rather than estimating it.)

I would like to test the variance assumption. That is, assuming
$$
mbox{E}left[yleft|xright.right] = exp{left(x^{top}betaright)},
quad
mbox{VAR}left(yleft|xright.right) = fleft(xright) exp{left(x^{top}betaright)},
$$

I would like to test the null hypothesis
$$
H_0: fleft(xright) = c,,,mbox{for some},c.
$$

The most widely used test, as given by Cameron & Trivedi (and implemented in AER::dispersiontest), seems to assume
$$
mbox{E}left[yleft|xright.right] = mu = exp{left(x^{top}betaright)},
quad
mbox{VAR}left(yleft|xright.right) = mu + alpha gleft(muright),
$$

for some specified function $gleft(cdotright)$ (typically a linear or quadratic function), and tests the null hypothesis
$$
H_0: alpha = 0.
$$

Is it possible to adapt this test for my purposes? Concretely can I, through some perverted usage of dispersiontest, somehow assume
$$
mbox{VAR}left(yleft|xright.right) = alpha_1 mu + alpha_2 mu^2,
$$

and test $H_0: alpha_2 = 0$ regardless of $alpha_1$?

(I have tried to use dispersiontest on a glm object fit with quasipoisson family, and get an error claiming that “only Poisson GLMs can be tested”. So some extra fiddling will be required.)


Get this bounty!!!