#StackBounty: #hypothesis-testing #sample-size Intuition – Impact of baseline conversion rate on sample size

Bounty: 50

In an A/B test we calculate the needed sample size before we run the test. The required sample size is dependent on the significance level, the power, the minimum detectable effect (MDE) and the baseline conversion rate.

Let’s say we set those value to

  • Significance level: 5 %
  • Power: 80 %
  • Relative MDE: 2 %

And plug them into a sample size calculator.

For different baseline conversion rates we get different sample sizes. The higher the baseline, the lower the sample size.

  • Baseline 10 %: 354,139 per variant
  • Baseline 20 %: 157,328 per variant
  • Baseline 30 %: 91,725 per variant

The relative change we are trying to detect stays the same. I am trying to get an inituition for why we need bigger samples if the baseline is higher.


Get this bounty!!!

#StackBounty: #r #hypothesis-testing #repeated-measures #nested-data #plm comparing groups in repeated measures FE models, with a neste…

Bounty: 50

I have estimated some repeated measures Fixed Effects models, with a nested error component, using . I am now interested to

  1. test if the full models are significantly different, i.e. $$H_o: beta_{Female} = beta_{Male}$$ where $beta_{Female}$ is the full model for Females and $beta_{Male}$ is the full model for Males and
  2. subsequently test selected regression coefficients between two groups, i.e. $$H_o: beta_{Female == year1.5} = beta_{Male == year1.5}$$ where $beta_{Female == year1.5}$ is the regression coefficient for females at year1.5, and $beta_{Male == year1.5}$ is the regression coefficient for males at year1.5.

I will illustrate the situation using the below working example,

First, some packages needed,

# install.packages(c("plm","texreg","tidyverse","lmtest"), dependencies = TRUE)
library(plm); library(lmtest); require(tidyverse)

Second, some data preparation,

data(egsingle, package = "mlmRev")
dta <-  egsingle %>% mutate(Female = recode(female,.default = 0L,`Female` = 1L))

Third, I estimate a set of models for each gender in data

MoSpc <- as.formula(math ~ Female + size + year)
dfMo = dta %>% group_by(female) %>%
    do(fitMo = plm(update(MoSpc, . ~ . -Female), 
       data = ., index = c("childid", "year", "schoolid"), model="within") )

Forth, lets look at the two estimated models,

texreg::screenreg(dfMo[[2]], custom.model.names = paste0('FE: ', dfMo[[1]]))
#> ===================================
#>            FE: Female   FE: Male   
#> -----------------------------------
#> year-1.5      0.79 ***     0.88 ***
#>              (0.07)       (0.10)   
#> year-0.5      1.80 ***     1.88 ***
#>              (0.07)       (0.10)   
#> year0.5       2.51 ***     2.56 ***
#>              (0.08)       (0.10)   
#> year1.5       3.04 ***     3.17 ***
#>              (0.08)       (0.10)   
#> year2.5       3.84 ***     3.98 ***
#>              (0.08)       (0.10)   
#> -----------------------------------
#> R^2           0.77         0.79    
#> Adj. R^2      0.70         0.72    
#> Num. obs.  3545         3685       
#> ===================================
#> *** p < 0.001, ** p < 0.01, * p < 0.05    #> 

Now, I want to test if these two (linear OLS) models are significantly different, cf. point1 above. I looked around SO and the internet and some suggest that I need to use plm::pFtest(), also suggested here, which I have tried, but I’m not convinced and wonder if someone here has experience and could possibly help me.

I tried,

plm::pFtest(dfMo[[1,2]], dfMo[[2,2]])
# >
# > F test for individual effects
# >
# >data:  update(MoSpc, . ~ . - Female)
# >F = -0.30494, df1 = 113, df2 = 2693, p-value = 1
# >alternative hypothesis: significant effects

Second, I am interested to compare regression coefficients between two groups. Say, is the estimate for year1.5 of 3.04 significantly different from 3.17? Cf. point 2 above.

Please ask if any of the above is not clear and I will be happy to elaborate. Any help will be greatly appreciated!

I realize this question is a bit programming like, but I initially posted it in SO. However, DWin was kind enough to point out that the question belonged in CrossValidated and migrated it here.



Get this bounty!!!

#StackBounty: #hypothesis-testing #loss-functions Utility or loss functions and statistical testing

Bounty: 50

Which would be better, a statistical method that yields false positive errors on 5% of occasions and false negative on 20%, or a statistical method that yields 10% false positives but only 5% false negatives?

The answer, of course, has to begin with “It depends…”. But what factors is it dependent upon, and how are those factors taken into account in real world application of hypothesis testing?

I would say that it depends upon some kind of utility function that takes into account circumstances and consequences of decisions, but I have never noticed any specification or discussion of such a function in research papers in my area of basic pharmacology, and I assume that they are similarly absent from research papers from many other areas of science. Does that matter?

It would be safe to assume that researchers are responsible for the experimental design and analysis in most of the research papers that I read, but at least sometimes a statistician will be consulted (usually after the data are in hand). Do statisticians discuss loss functions with researchers before advising on or performing a data analysis, or do they just use one that is unconsidered and implicit?


Get this bounty!!!

#StackBounty: #hypothesis-testing #correlation #covariance #monte-carlo #measurement-error Assessing significance when measurements are…

Bounty: 50

I am trying to estimate a function $f(x)$ at $x=0.1, 0.2, 0.3, 0.4, 0.5$.
I make this estimate through some complex procedure based on a set of independent but noisy data (with known uncertainties). The outcome is a set of correlated estimates.

The null hypothesis of my study is that $f(x)=0$. I want to assess, based on any/all of these measurements, whether $f(x)neq 0$ at some/any $x$.

I have run $N$ Monte Carlo simulations of the measurement procedure by repeating the measurement with random realizations of the data (since the measurement uncertainty is unknown). Now I have $N$ sets of estimates at these five points. I can use the $N$ trials to estimate the mean and variance of each $f(x)$ as well as the covariance between the different measurements.

Taking the mean and standard deviation across all of the $N$ trials, I find that each of the measurements are roughly $f(x)=-0.2 pm 0.1$. At normal thresholds of significance, this would be a $2sigma$ result and therefore not significantly different from 0. However, since all of the measurements are below zero, maybe the result is significant. However, I do not know how to assess this, given that all of the measurements are correlated.


Get this bounty!!!

#StackBounty: #hypothesis-testing #mixed-model #lme4-nlme #degrees-of-freedom Satterthwaite degrees of freedom in a mixed model change …

Bounty: 50

I have a couple of MLM models created using lme4:

y1 ~ x1 + x2 + x3 + x4 + (1+x4|id)
y2 ~ x1 + x2 + x3 + x4 + (1+x4|id)

notice that the only difference between them is the DV.

When I use lmerTest to get p-values I notice that my degrees of freedom for some of the predictors changes quite drastically between the 2 models. For example, in model 1 x4 df might be 38.50, while in model 2 df for the same predictor might be 260.50

Is that expected behavior?

Given that my predictor variables are identical in both cases (i.e. this can’t be a case of one model having more missing data than the other), why is there such a difference in the degrees of freedom when only the DV is changed?

Is there something about the Satterthwaite approximation that takes into account the DV, and hence degrees of freedom are expected to be so different?

EDIT

comparing Satterthwaite and Kenward-Roger (Id prefer to use regular summary(model) as it gives me more information, like random effects estimates and beta estimates)

Im not sure why there are minor fluctuations in df across the board between the 2 models (for both Satterthwaite and Kenward-Roger), but more importantly, notice how the x4 df is 10x larger in model 2 when using Satterthwaite

model 1:

model <- lmer(step_mean ~ x1 + x2 + x3 + x4 + (1+x4|id), data=df, REML=T)

summary(model)

Fixed effects:
                     Estimate Std. Error        df t value Pr(>|t|)    
(Intercept)          -0.06003    0.12845  35.07528  -0.467 0.643161    
x1                    0.49117    0.12548  35.12842   3.914 0.000398 ***
x2                   -0.01394    0.01225 259.84143  -1.138 0.256368    
x3                    0.01414    0.28512  34.47940   0.050 0.960745    
x4                   -0.04091    0.01086  25.53492  -3.767 0.000874 ***

anova(model, ddf='Kenward-Roger')

Analysis of Variance Table of type III  with  Kenward-Roger 
approximation for degrees of freedom
                     Sum Sq Mean Sq NumDF   DenDF F.value    Pr(>F)    
x1                  0.47355 0.47355     1  35.119 14.5328 0.0005336 ***
x2                  0.04145 0.04145     1 259.955  1.2721 0.2604083    
x3                  0.00007 0.00007     1  34.435  0.0023 0.9621706    
x4                  0.43630 0.43630     1  27.832 13.3899 0.0010463 ** 

model 2:

model <- lmer(stride ~ x1 + x2 + x3 + x4 + (1+x4|id), data=df, REML=T)

summary(model)

Fixed effects:
                     Estimate Std. Error        df t value Pr(>|t|)
(Intercept)          -0.05924    0.09010  35.35792  -0.657    0.515
x1                    0.08257    0.08865  34.98204   0.931    0.358
x2                   -0.03555    0.05087 295.62573  -0.699    0.485
x3                    0.08774    0.20271  35.43835   0.433    0.668
x4                    0.02290    0.04407 260.86367   0.520    0.604

anova(model, ddf='Kenward-Roger')

Analysis of Variance Table of type III  with  Kenward-Roger 
approximation for degrees of freedom
                     Sum Sq Mean Sq NumDF   DenDF F.value Pr(>F)
x1                  0.52223 0.52223     1  34.736 0.82341 0.3704
x2                  0.29974 0.29974     1 294.418 0.47260 0.4923
x3                  0.11041 0.11041     1  35.005 0.17409 0.6791
x4                  0.15516 0.15516     1  24.510 0.24464 0.6253


Get this bounty!!!

#StackBounty: #hypothesis-testing #statistical-significance #estimation #bias #lognormal Measuring accuracy of estimates from lognormal…

Bounty: 50

Our org need to make estimates of movie box office results relative to our estimates pre-release.

We know that, generally, box office results are lognormally distributed.
we can determine a good fit on a lognormal estimator and a large box office portfolio of actual results which matches pretty well.

My question has to do with discriminating the cause of errors in estimation of both individual estimates and portfolio total estimates.

E.g. if based on factors like budget cast director and genre and size of release we make an estimate of box office to be obtained and the amount of marketing spend to be made to support that estimate.
So if we estimate that a film will do 50MM in box office, and we spend marketing dollars accordingly, but the film only does 22MM, does that error look like an “outlier” (signalling that we were over-optimstic in our our estimation) or not? (or put another way, is there some p value we can measure against which says if our estimate is unbiased, then the actual result should be with x% of the estimate? Or is there no way to make a judgement as to whether a single trial like this indicates anything about the bias of our “estimation engine” (e.g. a bunch of people sitting around talking)

Likewise, on a portfolio of say, 10 movies, how do we figure out if the delta between the portfolio estimate total box office and the portfolio actual total box office demonstrates that we are biased high in our estimates or not? On the portfolio case we measure simply the ratio of times we exceeded estimate and the times that we are short on the estimate as a measure of our bias, and feel OK if we were high roughly half the time and low roughly half the time, but I’m sure there is a better measure of our bias. However given we have only 10 history I wonder if that is enough to that the portfolio distribution should be symetric given the asymetry of the sampling distribution and the relatively low n. So would we expect that the say, 95% confidence interval should be smaller on the low side, and higher on the high side due to the asymetry of the log normal distribution?

Many thanks!


Get this bounty!!!

#StackBounty: #hypothesis-testing #correlation #multiple-comparisons #spearman-rho How to compare of two Spearman correlations matrices?

Bounty: 50

I have two non-parametric rank correlations matrices emp and sim (for example, based on Spearman’s $rho$ rank correlation coefficient):

emp <- matrix(c(
1.0000000, 0.7771328, 0.6800540, 0.2741636,
0.7771328, 1.0000000, 0.5818167, 0.2933432,
0.6800540, 0.5818167, 1.0000000, 0.3432396,
0.2741636, 0.2933432, 0.3432396, 1.0000000), ncol=4)

sim <- matrix(c(
1.0000000, 0.7616454, 0.6545774, 0.3081403,
0.7616454, 1.0000000, 0.5360392, 0.3146167,
0.6545774, 0.5360392, 1.0000000, 0.3739758,
0.3081403, 0.3146167, 0.3739758, 1.0000000), ncol=4)

The emp matrix is the correlation matrix that contains correlations between the emprical values (time series), the sim matrix is the correlation matrix — the simulated values.

I have read the Q&A How to compare two or more correlation matrices?, in my case it is known that emprical values are not from normal distribution, and I can’t use the Box’s M test.

I need to test the null hypothesis $H_0$: matrices emp and sim are drawn from the same distribution.

Question. What is a test do I can use? Is is possible to use the Wishart statistic?

Edit.
Follow to Stephan Kolassa‘s comment I have done a simulation.

I have tried to compare two Spearman correlations matrices emp and sim with the Box’s M test. The test has returned

# Chi-squared statistic = 2.6163, p-value = 0.9891

Then I have simulated 1000 times the correlations matrix sim and plot the distribution of Chi-squared statistic $M(1-c)simchi^2(df)$.

enter image description here

After that I have defined the 5-% quantile of Chi-squared statistic $M(1-c)simchi^2(df)$. The defined 5-% quantile equals to

quantile(dfr$stat, probs = 0.05)
#       5% 
# 1.505046

One can see that the 5-% quantile is less that the obtained Chi-squared statistic: 1.505046 < 2.6163 (blue line on the fugure), therefore, my emp‘s statistic $M(1−c)$ does not fall in the left tail of the $(M(1−c))_i$.

Edit 2.
Follow to the second Stephan Kolassa‘s comment I have calculated 95-% quantile of Chi-squared statistic $M(1-c)simchi^2(df)$ (blue line on the fugure). The defined 95-% quantile equals to

quantile(dfr$stat, probs = 0.95)
#      95% 
# 7.362071

One can see that the emp‘s statistic $M(1−c)$ does not fall in the right tail of the $(M(1−c))_i$.

Edit 3. I have calculated the exact $p$-value (green line on the figure) through the empirical cumulative distribution function:

ecdf(dfr$stat)(2.6163)
[1] 0.239

One can see that $p$-value=0.239 is greater than $0.05$.

Edit 4.

Dominik Wied (2014): A Nonparametric Test for a Constant Correlation
Matrix, Econometric Reviews, DOI: 10.1080/07474938.2014.998152

Joël Bun, Jean-Philippe Bouchaud and Mark Potters (2016), Cleaning correlation matrices, Risk.net, April 2016

Li, David X., On Default Correlation: A Copula Function Approach (September 1999). Available at SSRN: https://ssrn.com/abstract=187289 or http://dx.doi.org/10.2139/ssrn.187289

G. E. P. Box, A General Distribution Theory for a Class of Likelihood Criteria. Biometrika. Vol. 36, No. 3/4 (Dec., 1949), pp. 317-346

M. S. Bartlett, Properties of Sufficiency and Statistical Tests. Proc. R. Soc. Lond. A 1937 160, 268-282

Robert I. Jennrich (1970): An Asymptotic χ2 Test for the Equality of Two
Correlation Matrices
, Journal of the American Statistical Association, 65:330, 904-912.

Edit 5.

The first founded paper that has no the assumption about normal distribution.

Reza Modarres & Robert W. Jernigan (1993) A robust test for comparing correlation matrices, Journal of Statistical Computation and Simulation, 46:3-4, 169-181


Get this bounty!!!

#StackBounty: #hypothesis-testing #correlation #multiple-comparisons #spearman-rho #kendall-tau How to compare of two Spearman correlat…

Bounty: 50

I have two non-parametric rank correlations matrices emp and sim (for example, based on Spearman’s $rho$ rank correlation coefficient):

emp <- matrix(c(
1.0000000, 0.7771328, 0.6800540, 0.2741636,
0.7771328, 1.0000000, 0.5818167, 0.2933432,
0.6800540, 0.5818167, 1.0000000, 0.3432396,
0.2741636, 0.2933432, 0.3432396, 1.0000000), ncol=4)

sim <- matrix(c(
1.0000000, 0.7616454, 0.6545774, 0.3081403,
0.7616454, 1.0000000, 0.5360392, 0.3146167,
0.6545774, 0.5360392, 1.0000000, 0.3739758,
0.3081403, 0.3146167, 0.3739758, 1.0000000), ncol=4)

The emp matrix is the correlation matrix that contains correlations between the emprical values (time series), the sim matrix is the correlation matrix — the simulated values.

I have read the Q&A How to compare two or more correlation matrices?, in my case it is known that emprical values are not from normal distribution, and I can’t use the Box’s M test.

I need to test the null hypothesis $H_0$: matrices emp and sim are drawn from the same distribution.

Question. What is a test do I can use? Is is possible to use the Wishart statistic?

Edit.
Follow to Stephan Kolassa‘s comment I have done a simulation.

I have tried to compare two Spearman correlations matrices emp and sim with the Box’s M test. The test has returned

# Chi-squared statistic = 2.6163, p-value = 0.9891

Then I have simulated 1000 times the correlations matrix sim and plot the distribution of Chi-squared statistic $M(1-c)simchi^2(df)$.

enter image description here

After that I have defined the 5-% quantile of Chi-squared statistic $M(1-c)simchi^2(df)$. The defined 5-% quantile equals to

quantile(dfr$stat, probs = 0.05)
#       5% 
# 1.505046

One can see that the 5-% quantile is less that the obtained Chi-squared statistic: 1.505046 < 2.6163 (blue line on the fugure), therefore, my emp‘s statistic $M(1−c)$ does not fall in the left tail of the $(M(1−c))_i$.

Edit 2.
Follow to the second Stephan Kolassa‘s comment I have calculated 95-% quantile of Chi-squared statistic $M(1-c)simchi^2(df)$ (blue line on the fugure). The defined 95-% quantile equals to

quantile(dfr$stat, probs = 0.95)
#      95% 
# 7.362071

One can see that the emp‘s statistic $M(1−c)$ does not fall in the right tail of the $(M(1−c))_i$.

Edit 3. I have calculated the exact $p$-value (green line on the figure) through the empirical cumulative distribution function:

ecdf(dfr$stat)(2.6163)
[1] 0.239

One can see that $p$-value=0.239 is greater than $0.05$.

Edit 4.

Dominik Wied (2014): A Nonparametric Test for a Constant Correlation
Matrix, Econometric Reviews, DOI: 10.1080/07474938.2014.998152

Joël Bun, Jean-Philippe Bouchaud and Mark Potters (2016), Cleaning correlation matrices, Risk.net, April 2016

Li, David X., On Default Correlation: A Copula Function Approach (September 1999). Available at SSRN: https://ssrn.com/abstract=187289 or http://dx.doi.org/10.2139/ssrn.187289


Get this bounty!!!

#StackBounty: #hypothesis-testing #statistical-significance #ordinal-data statistical test for order of movements

Bounty: 50

I have 10 data sets where I have to identify the order of movement of particles. For example, first to move is ranked 1, second to move is 2, and so on.

So for each data set, I have a list of particles and the order of movements. The maximum number of particles is 6. There are cases when particles move at the same time (or when the order is not clear) and were ranked with the same number.

I want to know if there is a statistical test to check if the order of the movement found across data sets might be ‘random’ or not. I want to find out if the order is ‘significant’ or not.

Please point me to the correct statistical test for this, and how would you formulate the hypothesis in this case? Your insights will be extremely helpful.


Get this bounty!!!