#StackBounty: #machine-learning #hypothesis-testing #statistical-significance #anova #repeated-measures How to Compare Two Algorithms w…

Bounty: 50

I have two computational methods (A and B) that have a random behavior each, i.e., if you run the same methods 10 times, you get 10 different results (usually with a small variance). To compare both methods, we selected 5 different databases (its hard to get more) and ran method A and method B, 10 times each, on each of the five databases. This resulted in a 10x5 matrix of measurements (a row for each run and a column for each database) for each method. All measurements are paired between the two methods, because we can control the seed for each run and the database can be reused for both methods, i.e., $text{run}_i$ of $text{database}_j$ use the same $text{seed}_i$ for both methods.

Example (the values in the tables are the accuracies of the methods):

Method A

| Run/Database|   1    |   2    |   3    |   4    |   5    |
|           1 | 88.92% | 44.60% | 69.49% | 73.37% | 85.63% |
|           2 | 89.00% | 42.72% | 64.10% | 71.94% | 85.92% |
|           3 | 88.35% | 45.07% | 65.13% | 72.14% | 85.78% |
|           4 | 88.92% | 43.66% | 67.95% | 72.76% | 85.28% |
|           5 | 87.94% | 50.23% | 67.18% | 71.94% | 85.92% |
|           6 | 87.78% | 43.19% | 68.72% | 73.47% | 86.27% |
|           7 | 89.08% | 45.54% | 66.41% | 71.33% | 85.56% |
|           8 | 88.83% | 42.72% | 66.15% | 72.45% | 86.77% |
|           9 | 88.43% | 45.07% | 68.97% | 72.45% | 86.49% |
|          10 | 88.59% | 40.38% | 66.15% | 73.67% | 86.13% |

Method B

| Run/Database|   1    |   2    |   3    |   4    |   5    |
|           1 | 22.73% | 53.99% | 59.74% | 65.20% | 79.59% |
|           2 | 75.97% | 46.95% | 58.46% | 71.63% | 84.42% |
|           3 | 76.94% | 53.05% | 58.97% | 68.37% | 85.06% |
|           4 | 76.54% | 42.25% | 46.67% | 68.67% | 85.92% |
|           5 | 46.60% | 52.11% | 52.82% | 68.98% | 85.14% |
|           6 | 76.78% | 48.83% | 55.90% | 68.27% | 78.38% |
|           7 | 79.37% | 47.89% | 58.72% | 71.12% | 85.06% |
|           8 | 77.83% | 54.93% | 50.77% | 72.14% | 87.06% |
|           9 | 83.01% | 46.95% | 56.15% | 67.96% | 84.92% |
|          10 | 78.24% | 49.30% | 58.21% | 67.96% | 81.29% |

Which statistical method shall I use to find out which method is the best in terms of overall performance? Or to find out if method A is statistically different, in terms of average accuracy, from method B?

I investigated the use of Student T-Test and One and Two-Way ANOVA with Repeated Measurements, but they didn’t seem appropriate for this analysis. Any suggestion of valid statistical analysis is appreciated.

Get this bounty!!!

#StackBounty: #hypothesis-testing #post-hoc #friedman-test How must the null hypothesis be rejected for Nemenyi test to be allowed?

Bounty: 50

Suppose we have algorithms $A_k$ and $B_k$, $1leq kleq n$, and $alpha = 0.05$.

Friedman test for the algorithms $A_k$ shows that the null hypothesis that all $A_k$s perform equally well, can be rejected. Suppose $p_A = 10^{-5}llalpha$.

Friedman test for the algorithms $B_k$ shows that the null hypothesis that all $B_k$s perform equally well, cannot be rejected. Suppose $p_B = 0.4$.

My problem is that the $p$-value $p_{AB}$ from the Friedman test for $A_k$s and $B_k$s is greater than $alpha$, hence I cannot apply Nemenyi post-hoc test if I blindly follow the procedure

  • perform Friedman on algorithms
  • if $H_0$ is rejected, proceed to Nemenyi on the same algorithms

Question: Are the results from Nemenyi test for all algorithms still valid?

I am not sure, but I would say yes because

  • I have shown that already not all $A_k$s have the same performance (and the results would still be significant with $2p_A$, i.e., if I used Bonferroni correction for multiple-hypothesis testing)
  • Nemenyi for $A_k$s and $B_k$s found two groups of algorithms, i.e., some differences.

Get this bounty!!!

#StackBounty: #r #hypothesis-testing #multiple-regression #regression-coefficients #car Probing effects in a multivariate multiple regr…

Bounty: 50

I’m trying to run a multivariate multiple regression in R, i.e. including multiple predictors and multiple outcome variables in the same linear regression model. Does anybody know how to pull out the coefficients and p-values for the relationship between each predictor and outcome pair in a multivariate multiple regression? I cannot seem to work out how to do that (I’ve been trying!).

Let me explain with an example dataset, if it helps:

#Create example dataset
df <- data.frame(pid=factor(501), y1=numeric(501), y2=numeric(501), 
y3=numeric(501), y4=numeric(501), x1=factor(501), x2=factor(501), 
x3=factor(501), x4=numeric(501), x5=numeric(501))
df$pid <- seq(1,501, by=1)
df$y1 <- seq(1,101, by=0.2)
df$y2 <- seq(401,201, by=-0.4)
df$y3 <- sqrt(rnorm(501, 7, 0.5))^3
df$x1 <- c(rep(c("sad","happy"), each=250), "sad")
df$x2 <- c(rep(c("human","vehicle","animal"), each=167))
df$x3 <- c(rep(seq(1,10, by=0.1), each=5), seq(1,46, by=1))
df$x4 <- rnorm(501, 3, .24)
df$x5 <- sqrt(rnorm(501, 23, 3.5))    

I then create the model using this:

#Specify the regression model
model <- lm(cbind(y1, y2, y3) ~ x1 + x2 + x3 + x4 + x5, data=df)

I can’t simply use summary(lm) as doing so runs separate regressions without accounting for familywise error, nor does it account for the dependent variables possibly being correlated.

Reiterating my question. Does anybody know how to pull output so I can work out the coefficient and p-values but doing so in the same model? For example, I want to work out the coefficients and p-values of:

x1, x2, x3, x4 and x5 on y1
x1 and x2 on y2
x1, x2, x3, x4 and x5 on y3
... etc etc

I tried the car package:

modelanova <- car::Anova(model)

However, I couldn’t get it to break down to a particular outcome variable, it’d only produce coefficients overall (as if a composite outcome variable had been created)

Any ideas would be wonderful. I know I could run several univariate multiple regressions but I am particularly interested in running a single multivariate multiple regression.

Get this bounty!!!

#StackBounty: #r #hypothesis-testing #distributions #statistical-significance How to check if a value doesn't appear by chance with…

Bounty: 50

I have the following randomly generated distribution:

mean=100; sd=15
x <- seq(-4,4,length=100)*sd + mean
hx <- dnorm(x,mean,sd)

plot(x, hx, type="l", lty=2, xlab="x value",
     ylab="Density", main="Some random distribution")

enter image description here

And a “non-random” value

x <- seq(-4,4,length=100)*10 + mean
ux <- dunif(x = x, min=10, max=100)
non_random_value <- ux[1]
# [1] 0.01111111

I’d like to have the statistic that show non_random_value is
significant and doesn’t come up by chance with respect to hx.

What is the reasonable statistics to check that?

Get this bounty!!!

#StackBounty: #time-series #hypothesis-testing #statistical-significance #t-test #binomial Hypothesis / significance tests to study cor…

Bounty: 50

Assume I have two time series, A(t) and B(t). I do know the exact signals in A(t). I want to find out when a known signal in A(t) is causing a signal in B(t).Figure illustrating the time series

  • In case of pure noise the data in both is distributed around the mean (not necessarily Gaussian, as there is systematic noise even after pre-whitening/detrending).
  • Any signal will increase in a time-dependent manner, peak, and decrease.
  • A(t) can cause a signal in B(t), but will not always cause one.
  • If it causes a signal, the impact is immediate (at lag 0).
  • Both have independent (systematic) noise properties.
  • If there is no signal in A(t) there is no signal in B(t) expected
    (however there might be systematic noise that looks similar).

My current analyses:

  1. First, I look at the data from the time series point-of-view, and perform auto-correlation, cross-correlation and rolling correlation analyses. I use the signal-to-noise ratio in e.g. the cross-correlation as a handle on ‘significance’ (not in a statistical sense).
  2. Second, I try to explore other statistical options. Assume I do know when in time the signal in A(t) starts and ends. I extract the time points in B(t) at these times as a seperate sample, B*.
    • I perform a T-test to see whether B* data is distributed around the mean of B(t).
    • I perform a binominal test to see whether B* data is randomly distributed around the mean, or biased in any direction.

    I.e. if B* data contains a signal, both tests will reject the Null Hypotheses. However, this somewhat decreases my signal-to-noise ratio, as it includes the ramping parts of the signal, which are not that far off the mean.

My questions are:

  1. are any of these tests redundant?
  2. how do I retrieve values of ‘statistical significance’ from the cross-correlation etc?
  3. are there any other possibilities to analyse this data that I am not thinking of (either as time series or as seperate sample B*)?

Get this bounty!!!

#StackBounty: #hypothesis-testing #modeling #covariate Hypothesis testing: Why is a null model that fits the data well better than one …

Bounty: 50

Let’s say we have two models: a null model, $M_0$, and an alternative model $M_1$. The only difference between them is that, in $M_0$ one parameter is fixed at $0$ and in $M_1$, that parameter is fixed at the value that maximizes the likelihood of model $M_1$. This is a typical setup for a likelihood ratio test.

My intuition is that the better $M_0$ describes the data-generating process, and thus the less the residual variation in the fitted model, the better. By “better”, I mean for a given sample size, effect size, and false positive rate, I will have more power to reject the null.

That’s a bit hand-wavey. I’ll make a simulation with a linear regression model.


lm_lrs_no_covar <- function(y, x) {
    2*(logLik(lm(formula = y ~ x)) - logLik(lm(formula = y ~ 1)))

lm_lrs_yes_covar <- function(y, x, z) {
  2*(logLik(lm(formula = y ~ x + z)) - logLik(lm(formula = y ~ z)))

n <- 1e2
num_sims <- 1e4

no_covar <- yes_covar <- rep(NA, num_sims)

for (sim_idx in 1:num_sims) {

  x <- runif(n = n)
  z <- runif(n = n)
  y <- rnorm(n = n, mean = 0.2*x + z)

  yes_covar[sim_idx] <- lm_lrs_yes_covar(y = y, x = x, z = z)
  no_covar[sim_idx] <- lm_lrs_no_covar(y = y, x = x)

plot(x = sort(no_covar),
     y = sort(yes_covar),
     type = 'l')
abline(a = 0, b = 1)

In fact, this plot shows that the LR statistic from the model with the covariate takes the same distribution as the LR statistic from the model without the covariate. Since both the “with covariate” and the “without covariate” tests have the same differences in degrees of freedom between null and alternative models (1), they both have the same power to reject the null.

enter image description here

But in the context of Wald testing, it seems obvious that improved the model fit and thus reducing the residual variance must improve the standard error of the estimate and therefore improve the power to reject the null at any given effect size.

Where have I gone wrong? Surely a model that fits the data well must outperform one that doesn’t? Surely any benefit reaped by a Wald test must somehow also be reaped by an LR test? (Neyman-Pearson lemma)

Get this bounty!!!

#StackBounty: #hypothesis-testing #correlation #statistical-significance #anova #multilevel-analysis Correlation or difference? appropr…

Bounty: 50

In a comparative study on human motion and perception in both reality and virtual reality, I am examining differences and relations between specific entities.

In two separate experiments, TTG Estimation and Distance Estimation data were collected from the same set of test persons who had following tasks:

  • *TTG Estimation task: Estimation for a Time-To-Go in [seconds] – test subjects are asked to signal the point of time (in Reality and Virtual Reality) at which they feel there’s just not enough time for securely crossing
    the street until the arrival of a vehicle (at 3 different
  • *Distance Estimation task: Test subjects are asked to estimate the distance in [m] between them and a pylon (in Reality and
    Virtual Reality)

Following table shall give a visualisation of the experiment design with the corresponding variables to be examined:

|                 | TTG Estimation Experiment   | Distance Estimation Experiment  |
+                 +-----------------------------+---------------------------------+
|                 | 30 km/h | 35 km/h | 40 km/h | 8m | 9m | 10m | 11m | 12m | 13m |
| Reality         |         |         |         |    |    |     |     |     |     |
| Virtual Reality |         |         |         |    |    |     |     |     |     |
| Test Variables  |     abs(TTG_R - TTG_VR)     |      abs(Dist_R - Dist_VR)      |

So in general I’m interested in the answer to the following question:
Is there a relation between the estimation ability/quality for TTG and the estimation ability/quality for Distance? Or in other words …
Do test persons with a good estimation for the TTG also give a good estimation for the Distance?

So my questions are:
a) What hypothesis approach is more appropriate? A comparison between two test variables OR a correlation analysis?
I am confused by the fact that the test variables |TTG_R – TTG_VR| and |Dist_R – Dist_VR| have different levels (TTG: 3 different vehicle speeds; Distance: 6 different distances) and that the test variables are measured in different units (TTG: s; Distance: m).

b) What exact method/test can I use to adress my point of interest? Does difference hypothesis make sense? (e.g. ANOVA for multiple factors?)
I have thought of normalising the test variables via zscore and check for bivariate correlation but then immediately I realise there are different test levels for each variable TTG: 3 different vehicle speeds; Distance: 6 different distances).
And regarding a difference hypothesis I am not sure if test levels can be used as factors?

Available software for evaluation:

  • Matlab
  • SPSS

Get this bounty!!!

#StackBounty: #r #hypothesis-testing #classification #experiment-design #double-blind Design a double-blind expert test when there is i…

Bounty: 50

I would like to submit to an expert group a set of images from two genetically different types of plants to see if they can find a difference. I have 20 images for each of the two types. Images are images of cells that are so specific that only a small group of 10 experts are able to see the potential differences (without knowing which type it is).

To have a powerful test, I choose to work with a two-out-of-five test, which means that I show experts five images in which 3 are from type1 and 2 are from type2. The have to define two groups of images. I chose this test because of its power. The probability to group the images correctly by chance is only 1/10. Because the number of experts is really small, using this test instead of a triangular test or a simple two images test, is in my point of view more powerful.

The variability of intra-type images is quite high. If there is a difference between the two types, it will be small, within this set of 2*20 images. Hence, if there is a difference, I would like to know which images are the most different. I cannot show more than ~50 sets of 5 images to my experts as they do not have a lot of time to perform the analysis. The problem is that there is about 400000 possibilities if sets of 5 images (3 type1 – 2 type2).

I think that choosing randomly my 50 sets among 400000 for each expert is not really representative. It would be more interesting to choose the sets so that each image has been compared to each other, so that I can define if there are similar or different (which can be infered from the 2-out-of-5 grouping). Randomly, I will not be sure that I tested all images against all (at least indirectly) with my only 10 experts. Thus, I decided to create a sample of 50 sets, different for each expert, that maximises the number of couples compared either directly or indirectly. However, with this sampling, some couples of images are more compared than others, which is why I forced my selection so that each couple is compared at least 3 times directly or indirectly.

This seems to be a quite complicated sampling procedure. For me, this is the only one that makes me sure I will be able to correctly compare all my images, in a few shot for each expert, knowing that once I have shown it to my experts, I will not have any other chance to find another expert group. And they won’t be able to take more time to do it again.

To summarize, I want to know :

  1. Are my two types different ?
  2. Which cell images are more different than the others ?

Do you think I am going in a too complicated way ? Is random a better solution even if I will be able to only test 50 sets * 10 experts among 400000 possible sets ? How can I be sure not to bias my sets selection procedure ?

By the way, I work with R, but I don’t think this is really important for this question

Get this bounty!!!

#StackBounty: #time-series #hypothesis-testing #population Population models – statistical analysis over time of subpopulation expansion

Bounty: 50

I am struggling with a statistical method to assess whether two populations have “changed” over time.

In my case I have a clonal bacteria species, some strains with a mutation conferring resistance to drugs and others without this mutation (e.g. susceptible). Under drug therapy, one would expect a small clonal subpopulation with resistance to eventually dominate the population (because everything else is eventually killed off).

In my case, I will first attempt some test tube experiments. Obviously one can set up a highly controlled experiment where the starting conditions is 1% resistance of a single mutation and 99% the susceptible clonal strain, adding drugs, and watching over time. In this case I was going to try a non-parametric test for trend such as Mann-Kendall and then using Sen-Slope estimation.

However, I want to also look at this situation in ‘nature’ where there are multiple mutations and I am struggling to conceptualize how to analyze my data. Also, to make things worse, I will instead of being able to count population size, unfortunately my genetic tests will only give me the proportion of the sample (0-100%) for each mutation.

The closest analogous situation I can find is Clonal Interference (see bottom image below which comes from https://en.wikipedia.org/wiki/Clonal_interference).

Any suggestions?

Clonal interference

Get this bounty!!!

#StackBounty: #hypothesis-testing #variance #heteroscedasticity #breusch-pagan Test of heteroscedasticity for a categorical/ordinal pre…

Bounty: 100

I have different number of measurements from various classes. I used one-way anova to see if the means of the observations in each class is different from others. This used the ratio of the between-class variance to the total variance.

Now, I want to test whether some classes (basically those with more observations) have a larger variance than expected by chance. What statistical test should I do? I can calculate the sample variance for each class, and then find the $R^2$ and p-value for the correlation of the sample variance vs. class size. Or in R, I could do

summary(lm(sampleVar ~ classSize))

But the variance of the esitmator of variance (sample variance) depends on the sample size, even for random data.

For example, I generate some random data:

dt <- as.data.table(data.frame(obs=rnorm(4000), clabel=as.factor(sample(x = c(1:200),size = 4000, replace = T, prob = 5+c(1:200)))))

I compute the sample variance and class sizes

dt[,classSize := length(obs),by=clabel]; dt[,sampleVar := var(obs),by=clabel]

and then test to see if variance depends on the class size

summary(lm(data=unique(dt[,.(sampleVar, classSize),by=clabel]),formula = sampleVar ~ classSize))
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.858047   0.056605  15.159   <2e-16 ***
classSize   0.006035   0.002393   2.521   0.0125 *  

There seems to be a dependence of the variance with the class size, but this is simply because the variance of the estimator depends on the sample size. How do I construct a statistical test to see if the variances in the different classes are actually dependent on the class sizes?

If my the variable I was regressing against was a continuous variable instead of the ordinal variable classSize, then I could have used the Breusch-Pagan test.

For example, I could do
fit <- lm(data=dt, formula= obs ~ clabel)

Get this bounty!!!