#StackBounty: #count-data #negative-binomial #poisson-regression #endogeneity #quasi-likelihood Testing for endogeneity in a negative b…

Bounty: 150

I’m trying to fit a negative binomial model to my data because the dependent variable exhibits overdispersion. However, one of my reviewers is insisting that I also test for endogeneity. He or she is worried that two independent variables are potentially endogenous (one of them might potentially be so…). My question is how one goes about testing for overdispersion in a negative binomial model, ideally in R. Can it be done simultaneously for two variables? I already found a potential instrument for the most problematic of these two variables (correlated with the endogenous independent variable but uncorrelated to the dependent variable). I’m just not sure how to go from here… I see papers that implement a two-step Heckman procedure, running the negative binomial regression with the inverse Mills ratio. However, I also read that this might not be appropriate…

My current model looks like this, I’m using R. Basically I’m pooling three years of data from two different countries. I’m primarily interested in the differences between these two countries. I have 2 control variables and 9 independent variables of interest. X1 and X3 are the potential problematic variables. Y is a count of different countries in which firms are present, and independent variables are things like international experience, international education, board independence, etc. Endogeneity arises, for instance, because international firms might hire people with more international experience/education than their local counterparts.

``````negbin <- glm.nb(Y~ Control1 + Contro2 + Year + Country
+ X1*Country
+ X2*Country
+ X3*Country
+ X4*Country
+ X5*Country
+ X6*Country
+ X7*Country
+ X8*Country
+ X9*Country
+ X10*Country, data = mydata)
summary(negbin)
car::vif(negbin)

Deviance Residuals:
Min        1Q    Median        3Q       Max
-2.04651  -1.16581  -0.56598   0.01105   3.00675

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)              1.588771   1.742045   0.912 0.361761
Control1                 0.240602   0.086086   2.795 0.005191 **
Control2                -0.013200   0.003732  -3.537 0.000404 ***
YearThree                0.152904   0.277186   0.552 0.581203
YearTwo                  0.085071   0.276648   0.308 0.758459
Country                 -1.899136   2.604823  -0.729 0.465950
X1                       1.609189   0.652992   2.464 0.013727 *
X2                       0.146868   0.111476   1.317 0.187674
X3                      -4.792707   0.748956  -6.399 1.56e-10 ***
X4                       4.352965   0.677561   6.424 1.32e-10 ***
X5                      -0.054561   0.015381  -3.547 0.000389 ***
X6                      -1.497622   0.374987  -3.994 6.50e-05 ***
X7                      -2.689511   0.768235  -3.501 0.000464 ***
X8                      -0.078919   0.069243  -1.140 0.254394
X9                       4.237630   1.544278   2.744 0.006068 **
X10                      3.333337   1.258869   2.648 0.008100 **
Country:X1               0.584704   0.992207   0.589 0.555662
Country:X2              -0.635671   0.332893  -1.910 0.056193 .
Country:X3               4.508881   0.884777   5.096 3.47e-07 ***
Country:X4              -7.823156   1.411851  -5.541 3.01e-08 ***
Country:X5              -0.003909   0.032332  -0.121 0.903779
Country:X6               1.001702   0.570836   1.755 0.079294 .
Country:X7               4.870946   0.991810   4.911 9.05e-07 ***
Country:X8               0.403581   0.100593   4.012 6.02e-05 ***
Country:X9              -2.151496   1.953145  -1.102 0.270655
Country:X10            -21.951529   4.102211  -5.351 8.74e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
``````

Get this bounty!!!

#StackBounty: #r #heteroscedasticity #poisson-regression #weighted-regression Closest approximation of a Poisson GLM using weighted lea…

Bounty: 100

I have an application where I would like to approximate a Poisson GLM with identity link, i.e. a `glm` of the form

``````fit = glm(y~x1+x2+x3+..., family=poisson(link=identity), data=data)
``````

using a (noniterative) weighted least squares model fit instead, ie using

``````fit = lm(y~x1+x2+x3+...,, weights=..., data=data)
``````

I was wondering what would be the correct/best weights to use though to take into account the mean/variance relationship?

For Poisson the variance=mean, so would that imply that I should use weights=1/variance, i.e. `weights = 1/(y+epsilon)` with small `epsilon` (e.g. 1E-10)? I would like to use weighted OLS instead of a GLM mainly for computational reasons (a `lm.fit` typically being 5-10x faster than a `glm.fit`) and also because the identity link Poisson GLM in R doesn’t seem to be implemented in a very robust way (it often complains about it not finding valid starting values). A quick test indeed shows that such a weighted LS fit indeed is very close to the corresponding Poisson GLM.

When working with a single covariate `sqrt` transforming `x` and `y` to stabilize the variance and doing a regular OLS fit on those also seems to give a nearly identical fit as the Poisson GLM. But this only works with a single covariate though.

I was wondering if both of these procedures are generally accepted approximations of Poisson GLMs? And how they would typically be referred to, and what a good citation would be for both recipes?

To enforce nonnegativity coefficients on the fitted coefficients I would then also like to use `nnls` or `glmnet` with `lower.limits=0` (as glmnet does not support identity link Poisson; h20 does but does not support nonnegativity constraints). So would using `1/variance` weights within a regular (nonnegative) weighted least squares framework be an OK thing to do to approach a Poisson error structure?

Quick test :

``````x=1:100
set.seed(1)
y=rpois(n=length(x), lambda=x)
epsilon=1E-10
weights = 1/(y+epsilon)
weights = nrow(data)*weights/sum(weights)

data=data.frame(x=x,y=y,weights=weights)
fit = glm(y~x, family=poisson(link=identity), data=data) # Poisson GLM
lmfit = lm(sqrt(y)~sqrt(x), data=data) # OLS with x & y sqrt transformed to stabilize variance
wlmfit = lm(y*sqrt(weights) ~ x*sqrt(weights), data=data) # weighted OLS with 1/variance weights
preds.glm = data.frame(ypred=preds.glm[,1],
y.lwr=preds.glm[,1]-1.96*preds.glm[,2],
y.upr=preds.glm[,1]+1.96*preds.glm[,2])
preds.lm = do.call(cbind,predict(lmfit,se.fit=TRUE)[1:2])
preds.lm = data.frame(ypred=preds.lm[,1]^2,
y.lwr=(preds.lm[,1]-1.96*preds.lm[,2])^2,
y.upr=(preds.lm[,1]+1.96*preds.lm[,2])^2)
preds.wlm = do.call(cbind,predict(wlmfit,se.fit=TRUE)[1:2])
preds.wlm = data.frame(ypred=preds.wlm[,1]^2,
y.lwr=(preds.wlm[,1]-1.96*preds.wlm[,2])^2,
y.upr=(preds.wlm[,1]+1.96*preds.wlm[,2])^2)
dev.off()
plot(x,y,pch=16,main="Red=Poisson GLM,nBlue=Unweighted LS on sqrt() transformed data,nGreen=Weighted LS with 1/variance weights",cex=0.5)
lines(x,preds.glm$$ypred, col="red",lty=1,lwd=6) lines(x,preds.glm$$y.lwr, col="red",lty=1,lwd=4)
lines(x,preds.glm$$y.upr, col="red",lty=1,lwd=4) lines(x,preds.lm$$ypred, col="blue",lty=2,lwd=6)
lines(x,preds.lm$$y.lwr, col="blue",lty=2,lwd=4) lines(x,preds.lm$$y.upr, col="blue",lty=2,lwd=4)
lines(x,preds.lm$$ypred, col="green4",lty=3,lwd=6) lines(x,preds.lm$$y.lwr, col="green4",lty=3,lwd=4)
lines(x,preds.lm\$y.upr, col="green4",lty=3,lwd=4)
``````

Also relevant perhaps is this link.

Get this bounty!!!

#StackBounty: #hypothesis-testing #heteroscedasticity #poisson-regression #overdispersion How to perform over-dispersion test where nul…

Bounty: 50

If I understand correctly, a quasi Poisson regression assumes roughly that
$$mbox{E}left[yleft|xright.right] = exp{left(x^{top}betaright)}, quad mbox{VAR}left(yleft|xright.right) = sigma^2 exp{left(x^{top}betaright)},$$
and estimates both $$beta$$ and $$sigma^2$$. (Poisson regression further assumes $$sigma=1$$, rather than estimating it.)

I would like to test the variance assumption. That is, assuming
$$mbox{E}left[yleft|xright.right] = exp{left(x^{top}betaright)}, quad mbox{VAR}left(yleft|xright.right) = fleft(xright) exp{left(x^{top}betaright)},$$
I would like to test the null hypothesis
$$H_0: fleft(xright) = c,,,mbox{for some},c.$$

The most widely used test, as given by Cameron & Trivedi (and implemented in `AER::dispersiontest`), seems to assume
$$mbox{E}left[yleft|xright.right] = mu = exp{left(x^{top}betaright)}, quad mbox{VAR}left(yleft|xright.right) = mu + alpha gleft(muright),$$
for some specified function $$gleft(cdotright)$$ (typically a linear or quadratic function), and tests the null hypothesis
$$H_0: alpha = 0.$$
Is it possible to adapt this test for my purposes? Concretely can I, through some perverted usage of `dispersiontest`, somehow assume
$$mbox{VAR}left(yleft|xright.right) = alpha_1 mu + alpha_2 mu^2,$$
and test $$H_0: alpha_2 = 0$$ regardless of $$alpha_1$$?

(I have tried to use `dispersiontest` on a `glm` object fit with `quasipoisson` family, and get an error claiming that “only Poisson GLMs can be tested”. So some extra fiddling will be required.)

Get this bounty!!!