#StackBounty: #hypothesis-testing #statistical-significance #sampling #binomial #fishers-exact Test if two averages of two binomial dis…

Bounty: 50

First, this may be a duplicate of:

statistical significance

I’m unsure if that post covers my exact situation. If so, just mark this as a duplicate. Let’s say I have a list of 100k potential clients and I sell cars. I select 10k clients based on a probability to buy a car from a model with many features. Each client has a different probability. I test the 10k clients on whether they buy a car or not across different periods of time and bucket the clients into four buckets based on, let’s say their income level.

For week 4, I have 1200 clients in the income bucket "$50k-$75k", and 1000 of those bought a car. I also have 1600 clients in income bucket "$76k-$100k" and 1100 of them bought a car. Can I use the fisher test to calculate the p-value between these two subgroups of clients in these specific buckets?

A couple things I am unsure of is the definition of sample in this experiment and if the size is too large for the fisher test. Also, I’m assuming this is a sample without replacement and does that work with the fisher test?

Are there better options than the fisher test?


Get this bounty!!!

#StackBounty: #estimation #binomial #beta-distribution #measurement-error How to model errors around the estimation of proportions – wi…

Bounty: 100

I have a situation I’m trying to model. I would appreciate any ideas on how to model this, or if there are known names for such a situation.

Background:

Let’s assume we have a large number of movies (M). For each movie, I’d like to know the proportion of people in the population who enjoy watching these movies. So for movie $m_1$ we’d say that $p_1$ proportion of the population would say "yes" to "did you enjoy watching this movie?" question. And the same for movie $m_j$, we’d have proportion $p_j$ (up to movie $m_M$).

We sample $n$ people, and ask each of them to say if they enjoyed watching movies $m_1, m_2, …, m_M$ of the movies. We can now easily build estimations for $p_1, …, p_M$ using standard point estimates, and build confidence intervals for these estimations using the standard methods (ref).

But there is a problem.

Problem: measurement error

Some of the people in the sample do not bother to answer truthfully. They instead just answer yes/no to the question regardless of their true preference. Luckily, for some sample of the M movies, we know the true proportion of people who like the movies. So let’s assume that M is very large, but that for the first 100 movies (of some indexing) we know the real proportion.
So we know the real values of $p_1, p_2, …, p_{100}$, and we have their estimations $hat p_1 , hat p_2, …, hat p_{100}$. While we still want to know the confidence intervals that takes this measurement error into account for $p_{101} , p_{102}, …, p_M$, using our estimators $hat p_{101} , hat p_{102}, …, hat p_M$.

I could imagine some simple model such as:

$$hat p_i sim N(p_i, epsilon^2 + eta^2 )$$

Where $eta^2$ is for the measurement error.

Questions:

  1. Are there other reasonable models for this type of situation?
  2. What are good ways to estimate $eta^2$ (for the purpose of building confidence interval)? For example, would using $hat eta^2 = frac{1}{n-1}sum (p_i – hat p_i)^2$ make sense? Or, for example, it makes sense to first take some transformation of the $p_i$ and $hat p_i$ values (using logit, probit or some other transformation from the 0 to 1, to an -inf to inf scale)?


Get this bounty!!!

#StackBounty: #binomial #beta-distribution #inverse-problem Distribution of population size $n$ given binomial sampled quantity $k$ and…

Bounty: 50

Given a drawn (without replacement) sample size $k$ from a binomial distribution with known probability parameter $pi$, is there a function which gives distribution of likely population size $n$ from which these $k$ were sampled? For instance, let’s say we have $k=315$ items randomly selected with known probability $pi=0.34$ from a population of $n$ items. Here most likely value is $hat{n}=926$ but what is probability distribution for $n$. Is there a distribution which gives $p(n)$?

I know that $p(pi | k,n)$ is given by the beta distribution and that $p(k |pi, n)$ is the binomial distribution. I’m looking for that third creature, $p(n |pi, k)$, properly normalized of course such that $sum_{n=k}^{infty} p(n)=1$

first "attempt" at this, given the normal approximation to binomial distribution is $p(k|pi, n)=mathcal{N}(k/pi,kpi(1-pi))$, is that $p(n|pi,k)approxmathcal{N}(k/pi,kpi(1-pi))$?


Get this bounty!!!

#StackBounty: #logistic #binomial #model-comparison #pseudo-r-squared Develop granularity-invariant criteria for comparison of logistic…

Bounty: 200

I have a model with logistic (binomial) likelihood, with number of successes and failures as a response variable. I am comparing various models, which can be of different granularity. Different granularity means that the binomial observations can be either:

  • grouped together (successes and failures summed up) for each site, or
  • evaluated separately for each visit (there can be multiple visits to each site).

So, I am looking for model quality criteria, which wouldn’t change with the site/visit granularity; i.e. which would produce the same result regardless of how the binomial observations are grouped.

I developed bunch of model comparison criteria, but as you can see below, apart from the AUC, all of them change with granularity. Here below is the evaluation of a single model using different criteria – first column shows the site-level granularity, second column the visit-level granularity:

                  per_site  per_visit
AUC_1h          0.97175420 0.97175420
AUC_1h_weighted 0.97033082 0.97033082
R2_avgScore     0.49352020 0.42906301
R2_dev          0.68408469 0.53648654
R2_LR           0.62293855 0.53648654

R2_dev is pseudo $R^2$ based on deviance, R2_LR is based on likelihood, McFadden’s – see definitions here.

The problem with binomial likelihood:

$$prod_{i}{n_i choose x_i}p_i^x(1-p_i)^{n_i-x_i}$$

is that it contains the binomial coefficient ${n_i choose x_i}$, which is the (only) term which depends on the granularity.

Since I don’t want to stick just to AUC, I tried to look for other pseudo-R-squared methods for one which would be granularity-invariant. The Cox & Snell did look promissing:

enter image description here

because the binomial coefficients would cancel each other out in the fraction. However, there are two problems with this:

  1. It needs a modification: $N$ needs to be set up so that it is granularity invariant. So instead of putting $N$ as number of records, one would put $N$ as the total sum of all successes and failures (which doesn’t change with granularity). Would that make sense? Or is there any conceptual problem with this modification?

  2. The maximum of this criteria is not one, which makes it difficult to interpret. This is addressed by Nagelkerke / Cragg & Uhler’s pseudo R-squared:

enter image description here

but here again, the denominator will ruin the granularity-independence again, as it depends on the binomial coefficient.

So how to address this?

  1. Is there a way to reasonably modify Cox & Snell? (See the 2 points above)
  2. Or would it make sense to just use all of these likelihood-based criteria, and just calculate the likelihood without the binomial coefficients? Would that make sense?
  3. Is there another reasonable, granularity-invariant criteria?
  4. Is my way of thinking alright, or is it conceptually broken (for example because the granularity is so important, that it doesn’t make sense to look for granularity invariant criteria)? Why?


Get this bounty!!!

#StackBounty: #r #binomial #autocorrelation #glmm #spatio-temporal Spatio-temporal autocorrelation

Bounty: 100

I have a huge data frame (300k + rows) on GPS animal positions.
I want to model the probability of the presence of chamois taking into consideration as variables: distance (from a disturbance), intensity (of the disturbance), altitude.

      ID idAnimal        date               lat   lon   alt  dist     intens    park

 1     1 animal_1        11/07/2018 12:00  45.7  6.71  2351    170       143   name2
 2     2 animal_3        11/07/2018 18:00  45.7  6.71  2371    131        71   name5
 3     3 animal_4        12/07/2018 00:00  45.7  6.70  2323     90       102   name5
 4     4 animal_1        12/07/2018 06:00  45.7  6.69  2379    119         6   name3
 5     5 animal_2        12/07/2018 12:00  45.7  6.69  2372    141       152   name5
 6     6 animal_1        12/07/2018 18:00  45.7  6.70  2364    121        25   name2
 7     7 animal_4        13/07/2018 00:00  45.7  6.70  2217    135        39   name1
 8     8 animal_2        13/07/2018 06:00  45.7  6.72  2605    137        96   name2
 9     9 animal_2        13/07/2018 12:00  45.7  6.72  2602     16       100   name1
10    10 animal_1        13/07/2018 18:00  45.7  6.71  2424     48        72   name2

I want to create a model that takes into account the spatio-temporal autocorrelation of data. I tried to build a binomial GLMM by adding fictitious points of absence, but I have no idea if this is correct. I also do not know how to take into account the autocorrelation of the data.
I was thinking of splitting up data into a list of dataframes with the following condition:
“one observation per day per animalID”.
Then I’m going to run the model on each of the created subsets.
However, I’m not sure how to get a single output from many models and (most of all) if this process can remove the problem of autocorrelation.


Get this bounty!!!

#StackBounty: #generalized-linear-model #binomial #generalized-least-squares Which error families are allowed in generalized least squa…

Bounty: 100

Which error families are allowed in generalized least squares (gls) models? Can I have, for example, a binomial glm and define a covariance structure (which, I guess, makes it a gls) in it (see below example)?

model <- glmmTMB(response ~ predictor + ar1(time + 0 | group), data = data, family = binomial(link = logit))

And also, should I call a glm with a covariance structure a gls?


Get this bounty!!!

#StackBounty: #confidence-interval #binomial Wikipedia's text about the Clopper-Pearson interval for binomial proportions

Bounty: 50

I’m trying to understand the following text currently (i.e., 2019-09-25) in Wikipedia about the Clopper-Pearson interval:

The Clopper–Pearson interval is an early and very common method for
calculating binomial confidence intervals.[8] This is often called an
‘exact’ method, because it is based on the cumulative probabilities of
the binomial distribution (i.e., exactly the correct distribution
rather than an approximation). However, in cases where we know the
population size, the intervals may not be the smallest possible,
because they include impossible proportions: for instance, for a
population of size 10, an interval of [0.35, 0.65] would be too large
as the true proportion cannot lie between 0.35 and 0.4, or between 0.6
and 0.65.

I do understand that in the given example it would be impossible to get an outcome that would represent a binomial proportion of 0.35 (as this would require 3.5 successes, which is not a possible outcome).

However, I believe the CP-interval is meant to represent the range of underlying probabilities of success (the ‘true proportions’) that have some minimum probability to produce the observed (integer) outcome. As far as I can see, these ‘true proportions’ can take values between 0.35 and 0.4, or between 0.6 and 0.65.

Am I seeing this wrong, or is the cited text incorrect?


Get this bounty!!!

#StackBounty: #confidence-interval #binomial Wikipedia's text about the Clopper-Pearson interval for binomial proportions

Bounty: 50

I’m trying to understand the following text currently (i.e., 2019-09-25) in Wikipedia about the Clopper-Pearson interval:

The Clopper–Pearson interval is an early and very common method for
calculating binomial confidence intervals.[8] This is often called an
‘exact’ method, because it is based on the cumulative probabilities of
the binomial distribution (i.e., exactly the correct distribution
rather than an approximation). However, in cases where we know the
population size, the intervals may not be the smallest possible,
because they include impossible proportions: for instance, for a
population of size 10, an interval of [0.35, 0.65] would be too large
as the true proportion cannot lie between 0.35 and 0.4, or between 0.6
and 0.65.

I do understand that in the given example it would be impossible to get an outcome that would represent a binomial proportion of 0.35 (as this would require 3.5 successes, which is not a possible outcome).

However, I believe the CP-interval is meant to represent the range of underlying probabilities of success (the ‘true proportions’) that have some minimum probability to produce the observed (integer) outcome. As far as I can see, these ‘true proportions’ can take values between 0.35 and 0.4, or between 0.6 and 0.65.

Am I seeing this wrong, or is the cited text incorrect?


Get this bounty!!!

#StackBounty: #distributions #binomial #fat-tails Discrepancies with Actual vs Expected Probabilities for Distribution?

Bounty: 50

I am attempting to estimate consecutive days worth of sales of specific items in each store:

For most stores, the probability (n=1 k=1, n=2 k=2, etc…) that an item will continue to sell over a period of n-consecutive days approximately fits a binomial distribution. Therefore, at most stores, the probability of n-consecutive days of sales can be derived from the actual (historical) probability of one day of sales.

However, for a subset of locations (in the same geographical region), the sales data are well above the modeled binomial data and are heavily fat-tailed; approximately ~2 orders of magnitude more probable to be sold in a consecutive fashion than the predicted binomial distribution suggests. The skewness and kurtosis are non-infinite, and an internal assessment determined that there could be some dependency. This is negligible in the well-modeled stores but could potentially cause an issue for the poorly modeled stores.

It would be of great assistance if another distribution or approach could be suggested to better model the high-probability tail events.

Example:

Binomial:

+----------+------+-----+-----+-----+-----+-----+-----+-----+
| Location | N=1  | N=2 | N=3 | N=4 | N=5 | N=6 | N=7 | N=8 |
+----------+------+-----+-----+-----+-----+-----+-----+-----+
| I        |  509 |  81 |  18 |   4 |   1 |   1 |   0 |   0 |
| J        |  509 |  81 |  18 |   4 |   1 |   1 |   0 |   0 |
| K        |  721 | 133 |  34 |  11 |   4 |   1 |   0 |   0 |
| L        |  463 |  71 |  17 |   3 |   2 |   1 |   0 |   0 |
| M        |  312 |  49 |  10 |   2 |   1 |   1 |   0 |   0 |
| N        |  431 |  64 |  12 |   3 |   2 |   1 |   0 |   0 |
| O        |  685 | 111 |  31 |   7 |   3 |   1 |   0 |   0 |
| P        |  580 | 108 |  23 |   6 |   2 |   1 |   0 |   0 |
| Q        | 1142 | 226 |  65 |  24 |   6 |   3 |   0 |   0 |
+----------+------+-----+-----+-----+-----+-----+-----+-----+

Fat-tailed

+----------+------+-----+-----+-----+-----+-----+-----+-----+
| Location | N=1  | N=2 | N=3 | N=4 | N=5 | N=6 | N=7 | N=8 |
+----------+------+-----+-----+-----+-----+-----+-----+-----+
| A        | 1127 | 239 | 136 |  75 |  42 |  23 |  16 |  11 |
| B        | 2223 | 488 | 227 | 113 |  54 |  32 |  20 |  11 |
| C        |  925 | 172 |  87 |  46 |  30 |  20 |  15 |   9 |
| D        |  925 | 172 |  87 |  46 |  30 |  20 |  15 |   9 |
| E        |  861 | 166 |  87 |  46 |  24 |  15 |  12 |   8 |
| F        |  705 | 160 |  92 |  50 |  26 |  16 |   9 |   8 |
| G        | 1047 | 249 | 126 |  55 |  30 |  20 |  12 |   5 |
| H        | 1402 | 307 | 130 |  58 |  35 |  22 |  11 |   5 |
+----------+------+-----+-----+-----+-----+-----+-----+-----+

This image shows the discrepancy between values predicted from N=1 using the binomial distribution (theoretical) and the actual values for fat-tailed locations.

Actual vs Theoretical sales for fat-tailed store


Get this bounty!!!

#StackBounty: #r #regression #binomial #overdispersion #beta-binomial How do I carry out a significance test with Tarone's Z-statis…

Bounty: 50

Context

In this blog the author suggests using Tarone’s Z-statistic to test for overdispersion in a binomial model to determine whether or not it is necessary to use a beta-binomial model instead. In their example they generate some synthetic data from binomial and beta-binomial distributions and then calculate the Z-statistic’s for each and plot them, along with a theoretical curve of the null distribution to demonstrate that this metric works.

Question

How do I actually calculate/use this to test for overdispersion? I found the author code difficult to follow and I don’t quite understand how I could use this to formally test for over-dispersion.

I have searched around but all I can turn up about Tarone’s Z-statistic are the two links I have included.

I am working in R using the lme4 and glmmTMB packages and I would greatly appreciate an answer in this form. I know this question kind of straddles the bounds of CV and stackoverflow, but I considered this a “non-trivial problem” – If the community disagrees I am happy to migrate it!

Update:

I have managed to adapt C.A. Kapourani’s code and write a function for calculating Z-statistics for my models (see below), but I still have the problem of how to compare two values. Is it sensible to find the Z-score having a cumulative probability equal to the critical probability (i.e. 0.05) like with other Z-scores? If so, can anyone recommend how I might do this in R?

taronesZStat <- function(model){

  #Extracting model residuals
  res <- residuals(model)

  #Number of residuals
  n <- length(res)

  #Calculating Tarone's Z-statistic
  p_hat <- sum(res)/n

  S <- sum((res - n * p_hat)^2 / (p_hat * (1 - p_hat)))

  Z_score <- (S - sum(n)) / sqrt(2 * sum(n * (n - 1)))

  return(Z_score)
}


Get this bounty!!!