## #StackBounty: #goodness-of-fit #kolmogorov-smirnov-test #power-law #zipf Is KS test really appropriate when validating a power law/esti…

### Bounty: 50

I’m attempting to find out whether some highly skewed data are drawn from a power law distribution, following the popular paper by Clauset, Shalizi and Newman, 2009.

Clauset et al. use the Kolmogorov-Smirnov (KS) statistic to measure the goodness-of-fit of the data to the hypothesised power law distribution. However, in an old paper on the Whitworth distribution by Nancy Geller, she mentions that once observations are ranked, they are no longer independently and identically distributed and therefore the KS test becomes invalid.

My question: Does this mean that the KS test is also invalid when considering any power law where a quantity `x` is scaled according to its rank (i.e. Zipf’s law)? Or, is it still valid since Clauset et al. did a simulation test and, in Fig. 4 (p. 673), it appears as though the KS test performs fine anyway?

Apologies if this is a silly question and please feel free to point me in the right direction if I’ve missed something more basic here.

References

Clauset, A., C. R. Shalizi, and M. E. J. Newman. 2009. “Power-Law Distributions in Empirical Data.” SIAM Review 51 (4): 661–703. doi:10.1137/070710111

Geller, N.L. 1979, “A Test of Significance for the Whitworth Distribution”, Journal of the American Society for Information Science, vol. 30, no. 4, pp. 229.

Get this bounty!!!

## #StackBounty: #regression #chi-squared #least-squares #goodness-of-fit #weighted-regression Goodness of Fit ot Least Squares with known…

### Bounty: 150

Given a linear model where $$Y=Xbeta+e$$ and $$esim N(0,Omega^{-1})sim N(0,sigma^2W^{-1})$$, where $$W_{ii}=w_i=frac{sigma^2}{sigma_i^2}$$ and $$Omega=frac{1}{sigma_i^2}$$ .

Assume that, thanks to a lot of repeated measurements, we know the underlying measurement uncertainties $$sigma_i$$ of my response variable in the i-th measured point. We measure a total of $$n$$ points

Given Heteroskedasticity we use weigthed least squares.

The residual analysis yields good results, showing that the residuals are independent and normally distributed, and that the weights enable studentized residuals with constant variance.

Now what is the best way to assess the goodness of fit ?

1.) Reduced chi-square: $$chi_{red}^2$$ should be close to 1. (1) (2)
$$chi_{red}^2 = frac{chi^2}{nu} = frac{r’Omega r}{nu} = frac{1}{nu} cdot sum_i^nfrac{ r_i^2}{sigma_i^2}$$

N.B.: This corresponds to a comparison of the unbiased estimate of error variance $$hat{sigma}^2$$ and the known mean measurement uncertainty $$sigma^2$$.
$$frac{hat{sigma}^2}{sigma^2} = frac{r’Wr}{nu} cdot frac{1}{sigma^2} = frac{ frac{1}{nu} sum_i^n r_i^2 cdot w_i}{sigma^2} = frac{ frac{1}{nu} sum_i^n r_i^2 cdot frac{sigma^2}{sigma_i^2}}{sigma^2} = frac{1}{nu} cdot sum_i^nfrac{ r_i^2}{sigma_i^2}$$

or

2.) Evaluation of the variance of the standardized/studentized Residuals, which should be close to 1. Note that the value for $$sigma$$ would be the one geven by the prior repeated measurements and not the MSE, where:

Standardized Residuals $$sim mathcal{N}(0,,1)$$, so $$Var(r_{i,Stand}) approx 1$$
$$r_{i,Stand} = frac{r_i}{sigma}$$
Internally studentized Residuals:
$$r_{i,ExtStud} = frac {r_i}{var(r_i)} = frac{r_i}{sqrt{sigma^2 (frac{1}{w_{i}} – H_{ii})}}$$
Externally studentized Residuals $$sim t(nu)$$, so $$Var(r_{i,IntStud}) approx frac{nu}{nu-2}$$
$$r_{i,IntStud} = frac{r_i}{sqrt{sigma_{(i)}^2 (frac{1}{w_{i}} – H_{ii})}}$$

or another alternative ?

Get this bounty!!!

## #StackBounty: #r #distributions #bootstrap #goodness-of-fit #fitting Applying bootstrapping to test if the data followed a certain dist…

### Bounty: 50

I have a dataset with a large sample size (around 80,000). I would like to test if the data followed a certain distribution. I can fit a distribution function, such as log-normal or gamma, to the entire dataset in R, such as using the `fitdist` function from the `fitdistrplus` package in R. I can also look at some diagnostic plots to evaluate if the fitting is good. Nevertheless, given this large amount of data, I cannot apply some goodness-of-fit test, such as `Kolmogorov Smirnov` or `Anderson-Darling` test, because large sample size makes these tests too sensitive and any slight deviations from my sample would lead to the rejection of null hypothesis at `p = 0.05`.

As a result, I am thinking to apply bootstrapping to my dataset and conduct the goodness-of-fit test to each sub-sample and then evaluate the proportion when `p value` is smaller than `0.05`. If most of the time the `p value` is not smaller than `0.05`, I will conclude that my data followed a certain distribution.

Below is a sample code in R

``````# Load the package for distribution fitting
library(fitdistrplus)
library(goftest)

# Set seed and generate simulated data
set.seed(1)

s <- rgamma(80000, shape = 2, rate = 1)

# Add some random noises to the data
y <- runif(80000, min = 0, max = 0.2)
x <- s + y

# Fit a distribution to x
fit_x <- fitdist(x, distr = "gamma")

# Plot the data
plot(fit_x)

# Apply Anderon-Darling test to see if the distribution of x is as expected as the theoretical distribution
ad.test(x, null = "pgamma", shape = fit_x$$estimate[["shape"]], rate = fit_x$$estimate[["rate"]])
# Anderson-Darling test of goodness-of-fit
# Null hypothesis: Gamma distribution
# with parameters shape = 2.29115085990351, rate = 1.09151800140921
# Parameters assumed to be fixed
#
# data:  x
# An = 14.253, p-value = 7.5e-09

# The p-value is small

### Bootstrapping the data and conduct Anderson-Darling test to each sub-sample

result <- numeric() # A vector storing the result
B <- 10000          # Number of bootstrap

for (i in 1:B){
temp <- sample(x, size = 500, replace = TRUE)
temp_p <- ad.test(temp, null = "pgamma", shape = fit_x$$estimate[["shape"]], rate = fit_x$$estimate[["rate"]])
result[[i]] <- temp_p[["p.value"]]
}

# The proportion when p value is smaller than 0
sum(result < 0.05)/length(result) * 100
#  5.84
``````

Given that only 5.84% of the time the `P value` is smaller than 0.05, I would like to conclude that my original dataset is likely following the gamma distribution.

Please let me know if the proposed steps make sense or if there is any concerns.

Here is a related post on Cross-Validated (How to bootstrap the best fit distribution to a sample?).

Get this bounty!!!

## #StackBounty: #r #goodness-of-fit #r-squared #instrumental-variables #endogeneity Can I ignore the negative R-squared value when I am u…

### Bounty: 50

I am running an instrumental variable regression using ‘ivreg’ command in R program. I find that all my validity tests related to endogeneity are satisfied only except the R-squared value which is negative. May I know whether I can ignore this negative R-squared value without reporting? If not, what is an alternative manner to resolve this issue? The code is as below:

``````    > Y2_ivreg2=ivreg(Y2~x1+x2+x3+x4+x5+x6+x7|x2+x8+x9+x10+x5+x6+x7,data=DATA)
> summary(Y2_ivreg2,diagnostics=TRUE)

Call:
ivreg(formula = Y2 ~ x1 + x2 + x3 + x4 + x5 +
x6 + x7 | x2 + x8 + x9 + x10 +
x5 + x6 + x7, data = DATA2)

Residuals:
Min        1Q    Median        3Q       Max
-0.747485 -0.053721 -0.009349  0.044285  1.085256

Coefficients:
Estimate  Std. Error  t value Pr(>|t|)
(Intercept)  0.0979178  0.0319244   3.067  0.00218 **
x1        0.0008438  0.0004927   1.712  0.08691 .
x2        0.0018515  0.0009135   2.027  0.04277 *
x3       -0.0130133  0.0073484  -1.771  0.07667 .
x4       -0.0018486  0.0009552  -1.935  0.05303 .
x5       -0.0000294  0.0000126  -2.333  0.01971 *
x6        0.0018214  0.0008908   2.045  0.04096 *
x7       -0.0024457  0.0005488  -4.456 8.61e-06 ***

Diagnostic tests:
df1  df2 statistic p-value
Weak instruments (x1)    3 3313   185.440  <2e-16 ***
Weak instruments (x3)    3 3313  3861.526  <2e-16 ***
Weak instruments (x4)    3 3313  3126.315  <2e-16 ***
Wu-Hausman               3 3310     1.943   0.121
Sargan                   0   NA        NA      NA
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1142 on 3313 degrees of freedom
Multiple R-Squared: -0.009029,  Adjusted R-squared: -0.01116
Wald test: 4.231 on 7 and 3313 DF,  p-value: 0.0001168
``````

Get this bounty!!!

## #StackBounty: #r #goodness-of-fit #r-squared #instrumental-variables #endogeneity Can I ignore negative R-squared value when I am using…

### Bounty: 50

I am running an instrumental variable regression using ‘ivreg’ command in R program. I find that the R-squared value which is negative. May I know whether I can ignore this negative R-squared value without reporting? If not, what is an alternative manner to resolve this issue? The code is as below:

``````    > Y2_ivreg2=ivreg(Y2~x1+x2+x3+x4+x5+x6+x7+x8+x9|x2+x10+x11+x12+x5+x6+x7+x8+x9,data=DATA2)
> summary(Y2_ivreg2,diagnostics=TRUE)

Call:
ivreg(formula = Y2 ~ x1 + x2 + x3 + x4 + x5 +
x6 + x7 + x8 + x9 | x2 + x10 +
x11 + x12 + x5 + x6 + x7 + x8 +
x9, data = DATA2)

Residuals:
Min        1Q    Median        3Q       Max
-0.754860 -0.054511 -0.008602  0.044721  1.098549

Coefficients:
Estimate  Std. Error t value Pr(>|t|)
(Intercept)  2.532e-01  6.376e-02   3.971 7.32e-05 ***
x1        1.543e-03  7.497e-04   2.058   0.0397 *
x2        2.687e-03  1.072e-03   2.505   0.0123 *
x3       -1.051e-02  6.245e-03  -1.683   0.0925 .
x4       -1.290e-03  7.494e-04  -1.722   0.0852 .
x5       -2.010e-02  8.384e-03  -2.398   0.0166 *
x6       -9.806e-01  8.123e-01  -1.207   0.2275
x7       -2.594e-05  1.253e-05  -2.070   0.0385 *
x8        1.664e-03  6.785e-04   2.452   0.0143 *
x9       -2.716e-03  5.700e-04  -4.766 1.96e-06 ***

Diagnostic tests:
df1  df2 statistic p-value
Weak instruments (x1)    3 3311   142.916  <2e-16 ***
Weak instruments (x3)    3 3311  4686.511  <2e-16 ***
Weak instruments (x4)    3 3311  2745.649  <2e-16 ***
Wu-Hausman               3 3308     3.629  0.0124 *
Sargan                   0   NA        NA      NA
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1151 on 3311 degrees of freedom
Multiple R-Squared: -0.02389,   Adjusted R-squared: -0.02667
Wald test: 4.735 on 9 and 3311 DF,  p-value: 2.793e-06
``````

Get this bounty!!!

## #StackBounty: #r #goodness-of-fit #r-squared #instrumental-variables #endogeneity Can I ignore R-squared value of one when I am using i…

### Bounty: 50

I am running an instrumental variable regression using ‘ivreg’ command in R program. I am using a lagged dependent variable and find all my validity tests are satisfied only except the R-squared value which is one. May I know whether I can ignore this R-squared value which is one without reporting? If not, what is an alternative manner to resolve this issue? The code is as below:

``````    > Y_ivreg=ivreg(Y~lag(Y,1)+x1+x2+x3+x4+x5+x6+x7+x8|x9+x10+x11+x12+x3+x4+x6+x7+x8,data=DATA2)
> summary(Y_ivreg,diagnostics=TRUE)

Call:
ivreg(formula = Y ~ lag(Y, 1) + x1 + x2 + x3 + x4 +
x5 + x6 + x7 + x8 | x9 + x10 +
x11 + x12 + x3 + x4 + x6 + x7 +
x8, data = DATA2)

Residuals:
Min         1Q     Median         3Q        Max
-5.895e-14  1.998e-15  2.476e-14  3.364e-14  6.517e-14

Coefficients:
Estimate Std. Error      t value    Pr(>|t|)
(Intercept)   5.896e-13  4.433e-14  1.330e+01  < 2e-16 ***
lag(Y, 1)   1.000e+00  3.999e-14  2.501e+13  < 2e-16 ***
x1    1.414e-14  1.488e-15  9.501e+00  < 2e-16 ***
x2    2.515e-15  2.328e-16  1.080e+01  < 2e-16 ***
x3   -2.434e-14  2.759e-15 -8.822e+00  < 2e-16 ***
x4   -1.925e-12  1.972e-13 -9.764e+00  < 2e-16 ***
x5    2.055e-14  3.955e-15  5.195e+00 2.17e-07 ***
x6    2.565e-17  3.230e-18  7.940e+00 2.72e-15 ***
x7   -2.147e-15  2.427e-16 -8.846e+00  < 2e-16 ***
x8    4.688e-16  1.900e-16  2.468e+00   0.0136 *

Diagnostic tests:
df1 df2  statistic  p-value
Weak instruments (lag(Y, 1))  4 3358    455.18  <2e-16 ***
Weak instruments (x1)         4 3358    998.12  <2e-16 ***
Weak instruments (x2)         4 3358   1077.84  <2e-16 ***
Weak instruments (x5)         4 3358    913.43  <2e-16 ***
Wu-Hausman                    4 3354      0.78   0.538
Sargan                        0   NA        NA      NA
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.497e-14 on 3358 degrees of freedom
Multiple R-Squared:     1,      Adjusted R-squared:     1
Wald test: 5.536e+26 on 9 and 3358 DF,  p-value: < 2.2e-16
``````

Get this bounty!!!

## #StackBounty: #r #regression #logistic #goodness-of-fit #instrumental-variables How can I get goodness-of-fit measures for "ivglm&…

### Bounty: 50

I am trying to get the goodness-of-fit measures, such as R-square, chi-square, etc. from the “ivglm” code in the “ivtools” package in R programming.

However, I could not find a way to get these from its output.

For your reference, I also have different number of missing values for each variable as well.

For instance, I run the following code and get the output.

``````    reg_X.LZ=glm(reg[,5]+reg[,3]+reg[,6]~reg[,14]+reg[,25]+reg[,15]+reg[,46], data=reg)

> summary(reg_logit)

Call:
ivglm(estmethod = "ts", fitX.LZ = reg_X.LZ, fitY.LX = reg_Y.LX,
data = reg, family = binomial(link = "logit"))

Coefficients:
Estimate   Std. Error z value Pr(>|z|)
(Intercept)              2.582e+00  1.673e+00   1.543 0.122738
reg[, 5]                -7.177e-02  4.150e-03 -17.293  < 2e-16 ***
reg[, 7]                 1.666e+00  1.163e-01  14.331  < 2e-16 ***
reg[, 6]                -1.339e-01  2.393e-02  -5.596 2.19e-08 ***
reg[, 3]                -1.678e-04  2.763e-05  -6.075 1.24e-09 ***
reg[, 4]                 1.016e-01  3.873e-03  26.235  < 2e-16 ***
reg[, 9]                 2.169e-02  6.504e-03   3.335 0.000854 ***
reg[, 10]               -2.127e-01  1.870e-01  -1.137 0.255463
reg[, 13]               -4.391e+00  1.899e+00  -2.313 0.020721 *
reg[, 11]                4.420e-02  1.112e-02   3.976 7.01e-05 ***
reg[, 12]                3.070e-01  6.807e-02   4.510 6.48e-06 ***
reg[, 14]                1.919e-01  7.351e-02   2.610 0.009046 **
reg[, 10]:reg[, 13]      4.545e-01  2.138e-01   2.126 0.033488 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
``````

Get this bounty!!!

## #StackBounty: #r #regression #logistic #goodness-of-fit #instrumental-variables How can I get goodness-of-fit measures for "ivglm&…

### Bounty: 50

I am trying to get the goodness-of-fit measures, such as R-square, chi-square, etc. from the “ivglm” code in the “ivtools” package in R programming.

However, I could not find a way to get these from its output.

For your reference, I also have different number of missing values for each variable as well.

For instance, I run the following code and get the output.

``````    reg_X.LZ=glm(reg[,5]+reg[,3]+reg[,6]~reg[,14]+reg[,25]+reg[,15]+reg[,46], data=reg)

> summary(reg_logit)

Call:
ivglm(estmethod = "ts", fitX.LZ = reg_X.LZ, fitY.LX = reg_Y.LX,
data = reg, family = binomial(link = "logit"))

Coefficients:
Estimate   Std. Error z value Pr(>|z|)
(Intercept)              2.582e+00  1.673e+00   1.543 0.122738
reg[, 5]                -7.177e-02  4.150e-03 -17.293  < 2e-16 ***
reg[, 7]                 1.666e+00  1.163e-01  14.331  < 2e-16 ***
reg[, 6]                -1.339e-01  2.393e-02  -5.596 2.19e-08 ***
reg[, 3]                -1.678e-04  2.763e-05  -6.075 1.24e-09 ***
reg[, 4]                 1.016e-01  3.873e-03  26.235  < 2e-16 ***
reg[, 9]                 2.169e-02  6.504e-03   3.335 0.000854 ***
reg[, 10]               -2.127e-01  1.870e-01  -1.137 0.255463
reg[, 13]               -4.391e+00  1.899e+00  -2.313 0.020721 *
reg[, 11]                4.420e-02  1.112e-02   3.976 7.01e-05 ***
reg[, 12]                3.070e-01  6.807e-02   4.510 6.48e-06 ***
reg[, 14]                1.919e-01  7.351e-02   2.610 0.009046 **
reg[, 10]:reg[, 13]      4.545e-01  2.138e-01   2.126 0.033488 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
``````

Get this bounty!!!

## #StackBounty: #r #regression #logistic #goodness-of-fit #instrumental-variables How can I get goodness-of-fit measures for "ivglm&…

### Bounty: 50

I am trying to get the goodness-of-fit measures, such as R-square, chi-square, etc. from the “ivglm” code in the “ivtools” package in R programming.

However, I could not find a way to get these from its output.

For your reference, I also have different number of missing values for each variable as well.

For instance, I run the following code and get the output.

``````    reg_X.LZ=glm(reg[,5]+reg[,3]+reg[,6]~reg[,14]+reg[,25]+reg[,15]+reg[,46], data=reg)

> summary(reg_logit)

Call:
ivglm(estmethod = "ts", fitX.LZ = reg_X.LZ, fitY.LX = reg_Y.LX,
data = reg, family = binomial(link = "logit"))

Coefficients:
Estimate   Std. Error z value Pr(>|z|)
(Intercept)              2.582e+00  1.673e+00   1.543 0.122738
reg[, 5]                -7.177e-02  4.150e-03 -17.293  < 2e-16 ***
reg[, 7]                 1.666e+00  1.163e-01  14.331  < 2e-16 ***
reg[, 6]                -1.339e-01  2.393e-02  -5.596 2.19e-08 ***
reg[, 3]                -1.678e-04  2.763e-05  -6.075 1.24e-09 ***
reg[, 4]                 1.016e-01  3.873e-03  26.235  < 2e-16 ***
reg[, 9]                 2.169e-02  6.504e-03   3.335 0.000854 ***
reg[, 10]               -2.127e-01  1.870e-01  -1.137 0.255463
reg[, 13]               -4.391e+00  1.899e+00  -2.313 0.020721 *
reg[, 11]                4.420e-02  1.112e-02   3.976 7.01e-05 ***
reg[, 12]                3.070e-01  6.807e-02   4.510 6.48e-06 ***
reg[, 14]                1.919e-01  7.351e-02   2.610 0.009046 **
reg[, 10]:reg[, 13]      4.545e-01  2.138e-01   2.126 0.033488 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
``````

Get this bounty!!!

## #StackBounty: #r #regression #logistic #goodness-of-fit #instrumental-variables How can I get goodness-of-fit measures for "ivglm&…

### Bounty: 50

I am trying to get the goodness-of-fit measures, such as R-square, chi-square, etc. from the “ivglm” code in the “ivtools” package in R programming.

However, I could not find a way to get these from its output.

For your reference, I also have different number of missing values for each variable as well.

For instance, I run the following code and get the output.

``````    reg_X.LZ=glm(reg[,5]+reg[,3]+reg[,6]~reg[,14]+reg[,25]+reg[,15]+reg[,46], data=reg)

> summary(reg_logit)

Call:
ivglm(estmethod = "ts", fitX.LZ = reg_X.LZ, fitY.LX = reg_Y.LX,
data = reg, family = binomial(link = "logit"))

Coefficients:
Estimate   Std. Error z value Pr(>|z|)
(Intercept)              2.582e+00  1.673e+00   1.543 0.122738
reg[, 5]                -7.177e-02  4.150e-03 -17.293  < 2e-16 ***
reg[, 7]                 1.666e+00  1.163e-01  14.331  < 2e-16 ***
reg[, 6]                -1.339e-01  2.393e-02  -5.596 2.19e-08 ***
reg[, 3]                -1.678e-04  2.763e-05  -6.075 1.24e-09 ***
reg[, 4]                 1.016e-01  3.873e-03  26.235  < 2e-16 ***
reg[, 9]                 2.169e-02  6.504e-03   3.335 0.000854 ***
reg[, 10]               -2.127e-01  1.870e-01  -1.137 0.255463
reg[, 13]               -4.391e+00  1.899e+00  -2.313 0.020721 *
reg[, 11]                4.420e-02  1.112e-02   3.976 7.01e-05 ***
reg[, 12]                3.070e-01  6.807e-02   4.510 6.48e-06 ***
reg[, 14]                1.919e-01  7.351e-02   2.610 0.009046 **
reg[, 10]:reg[, 13]      4.545e-01  2.138e-01   2.126 0.033488 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
``````

Get this bounty!!!