#StackBounty: #hypothesis-testing #nonparametric Runs test based on the median intuition

Bounty: 50

The following document http://www.ifp.illinois.edu/~ywang11/paper/ECE461Proj.pdf contains a very nice summary of some nonparametric tests for randomness. My question concerns the popular Runs test based on the median. In this test, under the null hypothesis of randomness, every arrangement of "+" and "-" signs is supposedly equiprobable.

According to this document, assume that the ordered sequence has $n$
samples, $n_{1}$ of +, $n_{2}$ of – and $n=n_{1}+n_{2}.$
Also, we denote the total number of runs of “+” as $R_{1}$ ,
and the number of runs of $”-“$ as $R_{2}$ ,and the total number
of runs as $R=R_{1}+R_{2}$. The author derives, for an even number
of runs $r$, the pdf of $R$ is
$$
f_{R}left(rright)=left(_{n_{1}-1}C_{r/2-1}right)left(_{n_{2}-1}C_{r/2-1}right)/_{n_{1}+n_{2}}C_{n_{1}}
$$

I have two questions:

  1. Why are $n_{1}$ and $n_{2}$ allowed to take on any values, and still
    be compatible with randmoness? For instance, imagine that $n=n_{1}+1$
    such that there is only one $n_{2}$ , and every other position is
    occupied by an $n_{1}$. This does not seem random to me. In other
    words, shouldnt a null distribution of randomness imply values of
    $n_{1}$ and $n_{2}$ as well.
  1. How is the pdf derived? Why do we subtract $1$ from $n_{1},n_{2}$
    and $r/2$ ?

Any guidance of this is much appreciated.


Get this bounty!!!

#StackBounty: #hypothesis-testing #nonparametric Runs Up and Down Test intuition

Bounty: 50

The following document http://www.ifp.illinois.edu/~ywang11/paper/ECE461Proj.pdf contains a very nice summary of some nonparametric tests for randomness. My question concerns the popular runs up and down test. In this test, under the null hypothesis of randomness, every arrangement of "+" and "-" signs is supposedly equiprobable.

According to this document, assume that the ordered sequence has $n$
samples, $n_{1}$ of +, $n_{2}$ of – and $n=n_{1}+n_{2}.$
Also, we denote the total number of runs of “+” as $R_{1}$ ,
and the number of runs of $”-“$ as $R_{2}$ ,and the total number
of runs as $R=R_{1}+R_{2}$. The author derives, for an even number
of runs $r$, the pdf of $R$ is
$$
f_{R}left(rright)=left(_{n_{1}-1}C_{r/2-1}right)left(_{n_{2}-1}C_{r/2-1}right)/_{n_{1}+n_{2}}C_{n_{1}}
$$

I have two questions:

  1. Why are $n_{1}$ and $n_{2}$ allowed to take on any values, and still
    be compatible with randmoness? For instance, imagine that $n=n_{1}+1$
    such that there is only one $n_{2}$ , and every other position is
    occupied by an $n_{1}$. This does not seem random to me. In other
    words, shouldnt a null distribution of randomness imply values of
    $n_{1}$ and $n_{2}$ as well.
  1. How is the pdf derived? Why do we subtract $1$ from $n_{1},n_{2}$
    and $r/2$ ?

Any guidance of this is much appreciated.


Get this bounty!!!

#StackBounty: #nonparametric #stata Which one is the correct specification to estimate Nonparametric regressions with discrete and cont…

Bounty: 50

I was trying to implement manually the estimation of nonparametric regression using local-linear approximation with a mixture of discrete and continuous data.
consider a simple model:
$y=f(xc,xd)$
where xc is continuous and xd is discrete

Say that I want to estimate this model non parametrically. Which one of the two following regressions is the correct one (assuming local linear estimation.

1:
$$y=a0+a1*(xc-c)+e$$
2:
$$y=a0+a1*(xc-c)+a2*xd +e$$

Assume that both models are estimated using the correct kernel weights and that xd is a dummy.

I thought the correct model was (1), but npregress in Stata uses (2). Which one would be the correct one?

Thank you

EDIT:
Perhaps a different way to ask the same question.
Say that you have a 3 variables, y, xc (continuous) and xd (discrete), and that you want to estimate a nonparametric, using local linear kernel estimation, for:
$$y=f(xc,xd)$$
Empirically, how would you estimate this model using WLS? which one is the correct specification? equation 1 or equation 2 (assuming weights are appropriately obtained)


Get this bounty!!!

#StackBounty: #references #nonparametric Nonparametric theory textbook(s)?

Bounty: 200

I am looking for a nonparametric statistical theory textbook that does not avoid tools from measure-theoretic probability and covers proofs on topics outside of rank-based hypothesis tests: e.g., kernel density estimation, nonparametric estimation, and resampling methods (i.e., bootstrapping, jackknife).

The closest textbook I can find to this is Tsybakov’s Introduction to Nonparametric Estimation. What can I use to supplement this textbook?

Wasserman’s All of Nonparametric Statistics has the appropriate topic coverage, but does not really cover proofs.


Get this bounty!!!

#StackBounty: #probability #hypothesis-testing #nonparametric #distance #vegan How is pairwise PERMANOVA/adonis a valid non-parametric …

Bounty: 50

Assume that we have taken independent random samples of several individuals from 5 locations that represent 5 populations. The design is fairly unbalanced: the number of individuals sampled from each location is variable (example in code). For each individual, we measured some continuous random variable (e.g., the concentration of some chemical, assuming its value is purely a function of where the respective individual was sampled), and we wish to understand which locations are different using this variable (the concentration of the chemical).
I will simulate example of these data here:

set.seed(123)
data <- data.frame(group = factor(rep(c(paste0("G",1:5)), c(10,24,10,12,9))),
                   val = c(rnorm(10, mean=1.34,sd=0.17), 
                           rnorm(24, mean = 1.14, sd=0.11),
                           rnorm(10, mean=1.19, sd=0.15),
                           rnorm(12, mean=1.06, sd=0.11),
                           rnorm(9, mean=1.09, sd = 0.10)))

Let us assume that group denotes the 5 sampling locations, and val denotes the concentration of the chemical in each sample.

Assume that we have conducted a series of omnibus non-parametric tests (using distance matrices, PERMANOVA) that suggest there are differences between the concentration of the chemical between the groups. We want to conduct pairwise analyses to see which groups are different, so we calculate a dissimilarity matrix (using the vegan package in R for this example) for these data using Euclidean distance:

library(vegan)
dist.mat <- vegdist(data$val, method = "euclidean")

There are several packages out there with functions that conduct pairwise versions of what is generally referred to as PERMANOVA/ permutaional MANOVA/ adonis test (based on the approaches by Marti J. Anderson: Anderson, M.J. 2001. A new method for non-parametric multivariate analysis of variance. Austral Ecology, 26: 32-46.)
Some examples can be found in these links:
https://www.rdocumentation.org/packages/RVAideMemoire/versions/0.9-77/topics/pairwise.perm.manova
https://github.com/pmartinezarbizu/pairwiseAdonis/blob/master/pairwiseAdonis/R/pairwise.adonis.R
https://rdrr.io/github/gauravsk/ranacapa/man/pairwise_adonis.html
https://rdrr.io/github/GuillemSalazar/EcolUtils/man/adonis.pair.html

Most of the examples I run across that demonstrate how to use and interpret these approaches in R are applied to count data (mostly species composition in ecological data, which corresponds with Anderson’s intentions). However, there are examples of these approaches being applied in other situations, notably in my interest, for continuous data, such as that presented above. In situations like the one I have proposed, if we calculate a dissimilarity matrix using Euclidean distance, and perform the pairwise version of this procedure (I will use the pairwise.adonis() function by @pmartinezarbizu, 2nd link above):

library(pairwiseAdonis)
#default is 999 permutations
res<-pairwiseAdonis::pairwise.adonis(dis.mat, data[,"group"])
res[,3:5] <- round(res[,3:5],2)  
res

      pairs Df SumsOfSqs F.Model   R2 p.value p.adjusted sig
1  G1 vs G2  1      0.26   19.95 0.38   0.001       0.01   *
2  G1 vs G3  1      0.09    5.06 0.22   0.038       0.38    
3  G1 vs G4  1      0.28   23.98 0.55   0.001       0.01   *
4  G1 vs G5  1      0.34   18.72 0.52   0.001       0.01   *
5  G2 vs G3  1      0.02    1.78 0.05   0.172       1.00    
6  G2 vs G4  1      0.01    0.95 0.03   0.323       1.00    
7  G2 vs G5  1      0.04    2.92 0.09   0.093       0.93    
8  G3 vs G4  1      0.05    3.87 0.16   0.077       0.77    
9  G3 vs G5  1      0.09    4.62 0.21   0.047       0.47    
10 G4 vs G5  1      0.01    0.78 0.04   0.376       1.00    

Am I incorrect in saying that this is equivalent to a simple series of pairwise anovas with p-values calculated according to the observed F statistics probability under the empirical null distribution that was generated through random permutations of group membership (or "location" membership in this case). If so, how can this be a valid non-parametric approach to pairwise comparisons? Let me explain:

Lets use groups 1 and 5 (G1 and G5) as an example:

library(dplyr)
ex <- c("G1","G5")
se <- function(x) sd(x) / sqrt(length(x))
data%>%
  dplyr::filter(., group %in% ex)%>%
  group_by(group)%>%
  summarise_at(., "val", list(mean=mean,med=median,sd=sd,se=se))%>%
  mutate(across(is.numeric, round, 2))

# A tibble: 2 x 5
  group  mean   med    sd    se
  <fct> <dbl> <dbl> <dbl> <dbl>
1 G1     1.35  1.33  0.16  0.05
2 G5     1.05  1.06  0.07  0.02

we know they were heteroscedastic to begin with, but still lets fit a lm and look at residual plots:

ex <- c("G1","G5")
dat2 <- data%>%dplyr::filter(., group %in% ex)
summary(lm(val~group, dat2))
plot(lm(val~group, dat2))

Residuals:
     Min       1Q   Median       3Q      Max 
-0.22775 -0.08020 -0.00070  0.06125  0.27887 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.35269    0.04061  33.306  < 2e-16 ***
groupG5     -0.29792    0.05901  -5.049  9.9e-05 ***

enter image description here

Now lets do regular ANOVA:

ex <- c("G1","G5")
dat2 <- data%>%dplyr::filter(., group %in% ex)
summary(aov(val ~ group, data = dat2))

            Df Sum Sq Mean Sq F value  Pr(>F)    
group        1 0.4204  0.4204   25.49 9.9e-05 ***
Residuals   17 0.2804  0.0165    

Now lets do the pairwise adonis procedure for just these two groups:

set.seed(123)
dis.mat <- vegdist(dat2$val, method="euclidean")
res<-pairwiseAdonis::pairwise.adonis(dis.mat, dat2[,"group"])
res[,3:5] <- round(res[,3:5],2)  
res

     pairs Df SumsOfSqs F.Model  R2 p.value p.adjusted sig
1 G1 vs G5  1      0.42   25.49 0.6   0.001      0.001  **

We get the same observed model, as we should, and a p-value that is based off of permutations. A little look under the hood shows me that all this function really does is perform the adonis function on each pair (or the single pair in this case). To demonstrate, we get the same answer by doing this:

set.seed(123)
dis.mat <- vegdist(dat2$val, method="euclidean")
res2<- adonis(dis.mat ~ dat2$group, method = "euclidean")
res2$aov.tab

Terms added sequentially (first to last)

           Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)    
dat2$group  1   0.42041 0.42041  25.488 0.59989  0.001 ***
Residuals  17   0.28041 0.01649         0.40011           
Total      18   0.70082                 1.00000     

So all we have really done here (still using the G1vsG5 example) is

  1. calculate an F test for the raw comparison (of the dissimilarity values, which are mathematically equivalent to differences in the raw data since we used Euclidean distance),
  2. shuffled the raw data and calculated a new F statistic
  3. repeated step 2 999 times to create an empirical F distribution (generating the null model),
  4. and finally calculated the probability of the observed F value occurring under the null model:
perm<-permustats(res2)
densityplot(perm)

enter image description here

So if all we are really doing is comparing F values that are calculated from the normal anova/linear model, is this really a valid "non-parametric" approach to make pairwise comparisons?


Get this bounty!!!

#StackBounty: #nonparametric #gaussian-process #kernel-trick #spectral-analysis Feature map corresponding to diffusion kernel defined o…

Bounty: 50

It is well known that there exists an explicit feature map representation for every kernel ($K$) i.e. for any two points $x_1$ and $x_2$
begin{align}
K(x_1, x_2) = Phi(x_1)^TPhi(x_2)
end{align}

where $Phi(x_1)$ and $Phi(x_2)$ are the featurized representation of the inputs $x_1$ and $x_2$
respectively.

Diffusion Kernels are popular kernels defined on discrete spaces (graphs). Given a graph $G$ with $n$ nodes that represents the input space, the diffusion kernel over the nodes of the graph is given as follows:
begin{align}
K(v, v) = Uexp(-betaLambda)U^T
end{align}

where $U$ and $Lambda$ are eigenvectors and eigenvalue matrix of the graph Laplacian of G and $beta$ is a hyperparameter. Is it possible to find the corresponding featurized representation for each node in the graph from this kernel definition?


Get this bounty!!!

#StackBounty: #r #repeated-measures #nonparametric #sample-size #friedman-test Repeated measurements with different measurement methods…

Bounty: 50

I am struggling to find the appropriate statistical test to analyze my data. I hope that my question will be understandable.

I have the following setup:

  • A porcine spine with three vertebral bodies (L1,L2,L3).

  • The spine was scanned on three different imaging modalities (Modality A,B,C)

  • On each of the modalities, different rings of fat were wrapped around the spine resulting in 5 different simulated sizes (size 1 to 5).
  • For each vertebral body of each of the sizes of each modality, I can measure the bone density (BD) as BD.L1, BD.L2, BD.L3

Here the first 10 rows of the table structure with some fictional values for the BD:

my.df <- structure(list(Modality = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L), .Label = c("A", "B", "C"), class = "factor"), 
    Size = structure(c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L
    ), .Label = c("1", "2", "3", "4", "5"), class = "factor"), 
    Repeat = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 
    2L), .Label = c("1", "2", "3"), class = "factor"), BD.L1 = c(1.3, 
    1.5, 2.2, 1.2, 1.8, 1.7, 0.7, 2.3, 2.5, 1.3), BD.L2 = c(1.2, 
    1.7, 1.6, 1.6, 1.1, 1.3, 1, 1.3, 1.2, 1.5), BD.L3 = c(1.6, 
    1, 1.8, 1.2, 1, 1.1, 1.6, 1.5, 1.6, 1.8)), row.names = c(NA, 
10L), class = "data.frame")

I would like to answer the following questions:

  1. Are there significant differences in bone density (BD) measurements among the three modalities for each phantom size?
  2. Are there significant differences in bone density (BD) measurements among the sizes within each modality?

Here the tricky part: for modality A all sizes were scanned twice (2 repeats) while for modalities B and C all sizes were scanned thrice (3 repeats).

Because the data points are very few, I thought to compare the BD measurements for each size not on a per-vertebra basis, but using the BD measurements of all three vertebra together for each modality and size.

Specific questions:

In regards to Analysis 1.) I was thinking about using the Friedman Test. However, I have unequal sample sizes (2 repeats for modality A) vs. (3 repeats for modality B). Which non-parametric test could I use here with unequal sample sizes?

In regards to Analysis 2.): Are the different sizes paired? If I add additional fat rings to the spine is it still considered the same or an independent sample. If independent is it correct to use Kruskal Wallis with Dunn post-hoc test to make comparisons among the five sizes?

I hope that my question is understandable.

Thank you very much!

Update:

For reproducibility a dataset representing the full data with fictive values has been added:

set.seed(23)

df <- data.frame(
  Modality = c(rep("A",30),rep("B",45),rep("C",45)),
  Size = factor(c(rep(rep(1:5,each=2),3),rep(rep(1:5,each=3),6)), levels=c(1,2,3,4,5),ordered=TRUE),
  Repeat = factor(c(rep(1:2,15),rep(rep(1:3,15),2))),
  Level = c(rep(c("L1","L2","L3"),each=10),rep(rep(c("L1","L2","L3"),each=15),2)),
  BD = c(runif(30,1,3),runif(45,2,4),runif(45,3,5))
)


str(df)
'data.frame':   120 obs. of  5 variables:
 $ Modality: Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 1 1 1 1 ...
 $ Size    : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 1 1 2 2 3 3 4 4 5 5 ...
 $ Repeat  : Factor w/ 3 levels "1","2","3": 1 2 1 2 1 2 1 2 1 2 ...
 $ Level   : Factor w/ 3 levels "L1","L2","L3": 1 1 1 1 1 1 1 1 1 1 ...
 $ BD      : num  2.15 1.45 1.66 2.42 2.64 ...


Get this bounty!!!

#StackBounty: #repeated-measures #nonparametric #sample-size #kruskal-wallis #friedman-test Repeated measurements with different measur…

Bounty: 50

I am struggling to find the appropriate statistical test to analyze my data. I hope that my question will be understandable.

I have the following setup:

  • A porcine spine with three vertebral bodies (L1,L2,L3).

  • The spine was scanned on three different imaging modalities (Modality A,B,C)

  • On each of the modalities, different rings of fat were wrapped around the spine resulting in 5 different simulated sizes (size 1 to 5).
  • For each vertebral body of each of the sizes of each modality, I can measure the bone density (BD) as BD.L1, BD.L2, BD.L3

Here the first 10 rows of the table structure with some fictional values for the BD:

my.df <- structure(list(Modality = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L), .Label = c("A", "B", "C"), class = "factor"), 
    Size = structure(c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L
    ), .Label = c("1", "2", "3", "4", "5"), class = "factor"), 
    Repeat = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 
    2L), .Label = c("1", "2", "3"), class = "factor"), BD.L1 = c(1.3, 
    1.5, 2.2, 1.2, 1.8, 1.7, 0.7, 2.3, 2.5, 1.3), BD.L2 = c(1.2, 
    1.7, 1.6, 1.6, 1.1, 1.3, 1, 1.3, 1.2, 1.5), BD.L3 = c(1.6, 
    1, 1.8, 1.2, 1, 1.1, 1.6, 1.5, 1.6, 1.8)), row.names = c(NA, 
10L), class = "data.frame")

I would like to answer the following questions:

  1. Are there significant differences in bone density (BD) measurements among the three modalities for each phantom size?
  2. Are there significant differences in bone density (BD) measurements among the sizes within each modality?

Here the tricky part: for modality A all sizes were scanned twice (2 repeats) while for modalities B and C all sizes were scanned thrice (3 repeats).

Because the data points are very few, I thought to compare the BD measurements for each size not on a per-vertebra basis, but using the BD measurements of all three vertebra together for each modality and size.

Specific questions:

In regards to Analysis 1.) I was thinking about using the Friedman Test. However, I have unequal sample sizes (2 repeats for modality A) vs. (3 repeats for modality B). Which non-parametric test could I use here with unequal sample sizes?

In regards to Analysis 2.): Are the different sizes paired? If I add additional fat rings to the spine is it still considered the same or an independent sample. If independent is it correct to use Kruskal Wallis with Dunn post-hoc test to make comparisons among the five sizes?

I hope that my question is understandable.

Thank you very much!


Get this bounty!!!

#StackBounty: #regression #nonparametric #kernel-smoothing #mse #kernel Minimizing MISE to find consistent estimator

Bounty: 50

Consider kernel regression estimation of the mean function $m$ of the process

$$y_t = m(x_t) + epsilon_t,$$ where $epsilon_t$‘ s are correlated with covariance function $R(s,t) = exp {-lambda|s-t|}$. In my scenario, $x_t$ is a function of a parameter $alpha$ and $R(s,t)$ is a function of $lambda$.

In a situation where $alpha$ and $lambda$ are known, we minimize the mean integrated squared error (MISE) to find an appropriate $h$.

My question is:
In my scenario, to find the smoothing parameter $h$ and estimate $alpha$ and $lambda$, can I minimize the MISE simultaneously with respect to $h$, $alpha$, and $lambda$? Will the minimizer of $alpha$ and $lambda$ be consistent (maybe under some additional conditions)?

Another option is: if I minimize MISE with respect to $h$ for $alpha$ and $ lambda$ fixed, put back the value of $h$ as an expression of $alpha$ and $lambda$ in MISE, and finally minimize it with respect to $alpha$ and $lambda$ simultaneously, will the estimators of $alpha$ and $lambda$ be consistent (maybe under some additional conditions)?

Or in another way, if I use an iterative algorithm: fix $alpha = alpha_0, lambda = lambda_0$ to find $h=h_0$, put the value of $h=h_0$ in MISE, minimize it with respect to $alpha, lambda$ to get $alpha = alpha_1, lambda = lambda_1$ and so on, how can I prove theoretically that the algorithm converges?

Any suggestions and/or related articles will be greatly appreciated!


Get this bounty!!!

#StackBounty: #anova #linear-model #nonparametric #kruskal-wallis #ranks Can I use multiple regression on a ranked response variable as…

Bounty: 200

This blog post illustrates the relationship between inference tests on groups (t-test, ANOVA, etc.) and equivalent linear models. It also claims that for reasonable sample size, regression of a ranked variable approaches the nonparametric versions of these tests. The author links to some simulations.

For example, the author claims that for a non-normal response variable, and N > 11,

lm(rank(y) ~ X1 + X2 + X3 + ...)

would be roughly equivalent to the Kruskall-Wallis test. I was under the impression that KW could only handle two groups.

This would be fantastic because I am working with highly skewed genomic data but have multiple confounding demographic variables. For example, we are testing the significance of a response variable to disease state. However, other variables such as Age and Gender not only correlate with disease state, they also independently correlate with the response variable.

Would a journal accept this approach? Are there some references I could back it up with?


Get this bounty!!!