#StackBounty: #r #lme4-nlme #multiple-comparisons #post-hoc #lsmeans Post hoc for a specific variable part of an interaction, linear mi…

Bounty: 50

There are many posts about post hoc testing but I did not find an answer to my question.

my model:

 mod<-lmer(T ~ A*B + C + (1|D), REML=TRUE, data=dat) 

A,B,C are categorical with 2, 4 and 2 levels respectively.

I want to check the effect of the variable A on T:

lsmeans(T, pairwise~A)

I receive the warning:

NOTE: Results may be misleading due to involvement in interactions *

How can I evaluate the effect of A and A only considering the interaction?

NB: I know I can use lsmeans(T, pairwise ~ A:B, adjust = "tukey") but then I obtain the effect of A for each level of B.

*I also carefully checked the documentation of the package lsmeans and there is only one example with interactions. However it turns out that the interaction did not influence the results and so how to include it in the results is not discussed.

Get this bounty!!!

#StackBounty: #r #matlab #quantiles #approximation Quantile approximation using Cornish-Fisher expansion

Bounty: 50

I am trying to approximate a set of quantiles from the estimated mean, variance, skewness and kurtosis of a random variable with unknown distribution. I tried to apply the Cornish-Fisher expansion of level 3, however, I run over the problem of non-monotonic patterns of estimated quantiles. According to Chernozhukov, I. Fernandez-Val and A. Galichon (2010) (https://arxiv.org/abs/0708.1627), I should rearrange the expansion to correct for the approximation error and impose monotonic pattern in estimated quantiles. However, I failed to conduct the code to correct the issue. I am just wondering is there any code (MatLab or R) that could help me ease this problem?

Get this bounty!!!

#StackBounty: #r #vim #cygwin #tmux #vim-r cygwin + vim + tmux + R

Bounty: 50

I am trying to use Cygwin + vim + tmux and R together. I tried Nvim-R. I got the following problem. I start tmux, than vim with an r file and type rf I get an error message /tmp/Nvim-R/libpaths Error in file, cannot open connection, calls:sink -> file

if there are other ways than Nvim-R to get cygqwin + vim + tmux + R, please let me know

Get this bounty!!!

#StackBounty: #r #confidence-interval #random-forest #caret #uncertainty Uncertainty in Binary Classification of New Data (via Random F…

Bounty: 100

We trained a binary classification RF and validated it with a test set of about N=300 entries. Here are the performance statistics:

enter image description here

We now would like to classify a production set of a couple of million items. If we just do that, the output of that will be p TRUE and 1-p FALSE where p lies within [0,1].

However, because we have a certain FP- and FN-Rate, this result/ratio comes with some uncertainty, right?

For that reason, we would like to quantify this uncertainty, e.g. with a 95% confidence interval, based on our performance measures. E.g. something like: With 95% probability, 20-25% of items in the production set are TRUE, 75-80% are FALSE. These numbers probably don’t even add up, the range for TRUE would be sufficient.

Also, all these performance measures displayed above come with their own confidence intervals, maybe these could be considered too in that uncertainty calculation?

Bonus-Question: We used the caret package in R. Is there some sort of a function (in caret or in another R package) to do this automatically?

Get this bounty!!!

#StackBounty: #r #distributions #maximum-likelihood #fitting #extreme-value Fitting custom distributions by MLE

Bounty: 50

My question relates to fitting custom distributions in R but I feel it has enough of a probability element to remain on CV.

I have an interesting set of data which has the following characteristics:

  • Large mass at zero
  • Sizeable mass below a threshold that fits a right-skewed parametric distribution very well
  • Small amount of mass at extreme values
  • A number of covariates that should drive the variable of interest

I was hoping to model this using a zero-inflated distribution approach, which is widely explored in the literature. Essentially, the density is:

pi quadquadquadquad,,,,,,,,,y=0 \

This is easy enough to fit as is. However, I would like the mixing parameter $pi$ to be dependent on the covariates $Z$:

$$text{logit}(pi)=f(beta Z)$$

Furthermore, because of the extreme-tail nature of my data, my distribution $f_{X}(y)$ fits best with an extreme-value approach:

f_{A}(y;a,b) quad,,,,,,,,,yleq mu \
where $text{GPD}(y;mu,sigma,xi)$ refers to the Generalized Pareto distribution, modelling the excess above a certain threshold $mu$ and $f_{A}(y;a,b)$ is a given right-skewed distribution with scale and shape parameters $a$ and $b$, respectively.

In addition, I would ideally want the parameters of the above distributions to also depend on covariates:

$$f_{A}(y;a,b,beta Z)$$
$$text{GPD}(y;mu,sigma,xi,beta Z)$$

I realize that the above setup is quite complex but I was wondering if there is a way to derive the MLE estimates of each of the desired parameters by maximizing the likelihood function i.e. to obtain:

$$hat{mu}, hat{sigma}, hat{xi}, hat{a}, hat{b}, hat{beta}$$

Is there an feasible/ideal way to go about this in R? Both in terms of my specific problem but also fitting custom distributions more generally?

Get this bounty!!!

#StackBounty: #r #categorical-data #chi-squared #binomial #contingency-tables Test individual categories in a contingency table for sig…

Bounty: 50

I have asked people which food they prefer:

group apple orange pizza beer
    A   374     63   216  101
    B   510     65   125   76

Apparently group B prefers fruit and group A prefers pizza and beer, and a chi-square test shows that the overall differences between groups are significant. But how can I test for which individual choice there is a significant difference between groups?

For example, I want to know whether there is a significant difference in the preference for oranges. But I cannot, I believe, just subset the orange choices, because that way I wouldn’t consider the total number of participants per group. I mean, a difference between 1 from A and 2 from B will be significant if I have only sampled three people, but not if those are three in a million.

Participants were asked to choose one from the four foods. They could not select multiple answers.

How can I test this?

My hunch would be to either add up the non-orange answers and test the resulting 2×2 table with a chi-square test:

group orange not orange
    A     63        691
    B     65        711

orange <- matrix(c(63, 691, 65, 711), 2, 2, TRUE,
                 list(group = c("A", "B"), choice = c("orange", "not orange"))

chisq.test(orange, correct = FALSE)
# p = .9883

or to calculate the percentage of orange answers in each group, consider the two numbers as counts in a binomial distribution and test that with a binomial test:

a <- 63 / (63 + 691)
b <- 65 / (65 + 711)
all <- 63 + 691 + 65 + 711

binom.test(c(round(a * all / (a + b)), round(b * all / (a + b))))
# p = .9796

# just checkin'
all == sum(c(round(a * all / (a + b)), round(b * all / (a + b))))
[1] TRUE

Or is there a better, maybe more common way?

Sample data

food <- c("apple", "orange", "pizza", "beer")
dat <- data.frame(
                  group  = rep(c("A", "B"), c(754, 776)),
                  choice = c(
                             rep(food, c(374, 63, 216, 101)),
                             rep(food, c(510, 65, 125, 76))
tab <- table(dat)

Explanation of second procedure

We want to compare the orange answers between groups. But if we only look at the orange answers themselves, we disregard the fact that other answers could be given. So instead of comparing the absolute numbers of orange answers, what we do is weigh the absolute number of orange answers by their proportion within all the answers in each group. Or in other words, we test if there is a significant difference between the percentages of orange answers in both groups.

Given this contingency table:

group orange not orange
    A     63        691
    B     65        711

for group A, the percentage of orange answers is:

a <- 63 / (63 + 691)  # 0.08355438 * 100 = 8.36%

and for group B it is:

b <- 65 / (65 + 711)  # 0.08376289 * 100 = 8.38%

We can already tell that the difference in percentages is minimal, but this is only an example, so let’s continue.

To compare the percentages, we are going to consider them as two categories (A and B) in a binomial distribution. For a binomial test, we need a vector of the same length as the overall number of answers. The overall number of answers in my study is:

all <- 63 + 691 + 65 + 711

To calculate the proportion of the binomial distribution that corresponds to the percentages of orange answers in each group, we simply “scale” (i.e. multiply by the same factor) both percentages to add up to 100% (of all observations); that is, we resolve the calculation:

a * x + b * x = all

The resolution, of course is:

x = all / (a + b)

Now we can calculate the number of observations for each category:

# for A:
a * all / (a + b)

# for B:
b * all / (a + b)

Finally we round the possibly fractional numbers to integers and perform the binomial test:

binom.test(c(round(a * all / (a + b)), round(b * all / (a + b))))

which returns:

number of successes = 764, number of trials = 1530, p-value = 0.9796
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.4739876 0.5247077
sample estimates:
probability of success 

Get this bounty!!!

#StackBounty: #r #normality Applying the Shapiro Wilk test to a large number of numeric variables to test for normality

Bounty: 50

I am working with data from a cellular network and I have 5000+ variables (cells in the network) and for each of these variables I have measurements of noise levels. Each variable has 175 samples and the data is continuous in nature but has been sampled every 15 mins. This is my dataset, 175 x 5000+ values all measuring the same metric, noise levels in dBm.

Since this data is timeseries data, I was looking at the Hurst exponent to explain the long term memory for each of the variables (5000+). I then plotted the distribution of the H values and the graph in green is the outcome. As you can see below the distribution of H values is approximately normal.

It has been suggested to me in conversation that since the Hurst exponent is approximately normally distributed then the underlying data is also normally distributed. But I know from working with the data that for each variable the distribution is right skewed rather than normally distributed. I would like to either prove or disprove this statement and I am looking for a procedure to test this.

I came across this paper where the authors had randomly generated a couple of hundred time series and the Hurst distribution of these couple of Hundred time series were also normally distributed. So I guess I wanted to disprove the belief that the underlying data was also normal.

I hope what I am trying to do is a bit clearer.

enter image description here

enter image description here

enter image description here

Get this bounty!!!

#StackBounty: #r #time-series #correlation #wavelet #rollapply Wavelet correlation using rolling window in R

Bounty: 200

I have 3 time series which I can apply the wavelet transform to using a rolling window. The rolling window takes a single time series of length 200 and applies the waveslim::modwt function to it over the first 30 samples. This outputs 5 lists of which I am only interested in (d1,d2,d3,d4) and these each have a length of 30. A simple example can be found here:

J <- 4 #no. of levels in decomposition
ar1.modwt <- modwt(ar1, "la8", J)

@G. Grothendieck has kindly provided a neat piece of code for the rolling window approach for a single time series here.

The rolling window increments by 1 and we go again, producing another 5 lists of which I only care for d1->d4 and so on and so on until the full length of the time series had been rolled over.

The next step is to apply the waveslim::brick.wall function to the output of the rolling window lists. The brick.wall function looks at the output of modwt for the first window across the 4 levels and replaces some of the values with NAs.

I believe I have covered this by modifying @G. Grothendieck answer using the following approach, I hope I am right:

modwt2 <- function(...) unlist(head(brick.wall(modwt(...)), 4))
rollr <- rollapplyr(ar1, 30, FUN = modwt2, wf = "la8", n.levels = 4, boundary = "periodic")
L <- lapply(1:nrow(rollr), function(i) matrix(rollr[i,], , 4))

The final piece is to construct correlation matrices for the outputs of the brick.wall function which is L above over the 4 levels of interest.

There is a function called waveslim::wave.correlation which takes two brick.wall outputs X and Y and computes the wave.correlation over the various levels.

returns <- diff(log(as.matrix(exchange)))
returns <- ts(returns, start=1970, freq=12)
wf <- "la8"
J <- 4
demusd.modwt <- modwt(returns[,"DEM.USD"], wf, J)
demusd.modwt.bw <- brick.wall(demusd.modwt, wf)
jpyusd.modwt <- modwt(returns[,"JPY.USD"], wf, J)
jpyusd.modwt.bw <- brick.wall(jpyusd.modwt, wf)
returns.modwt.cor <- wave.correlation(demusd.modwt.bw, jpyusd.modwt.bw,
                                      N = dim(returns)[1])

I wish to expand on this and compute the full correlation matrix for my 3 time series. Please note that the example above with exchange rates does not use the rolling window approach as it uses the full length of the time series which I would like to now do and it also produces a single values for the correlation between two time series. It does not construct the full correlation matrix which I need as I am interested in the eigenvalues of these correlation matrix over time.

I hope this makes sense because it has caused me a lot of heartache over the last few weeks and days.

So in summary:

  1. Take 3 time series
  2. Apply modwt function using rolling window
  3. Apply brick.wall function to each output of the rolling window in 2 above
  4. Create full 3×3 correlation matrix for the 4 levels using outputs of 3 above over time

    Many thanks.

Get this bounty!!!

#StackBounty: #r #regression #circular-statistics Interpreting circular-linear regression coefficient

Bounty: 50

I’m trying to use the circular package in R to perform regression of a circular response variable and linear predictor, and I do not understand the coefficient value I’m getting. I’ve spent considerable time searching in vain for an explanation that I can understand, so I’m hoping somebody here may be able to help.

Here’s an example:


# simulate data
x <- 1:100
y <- circular(seq(0, pi, pi/99) + rnorm(100, 0, .1))

# fit model
m <- lm.circular(y, x, type="c-l", init=0)

> coef(m)
[1] 0.02234385

I don’t understand this coefficient of 0.02 — I would expect the slope of the regression line to be very close to pi/100, as it is in garden variety linear regression:

> coef(lm(y~x))[2]

Does the circular regression coefficient not represent the change in response angle per unit change in the predictor variable? Perhaps the coefficient needs to be transformed via some link function to be interpretable in radians? Or am I thinking about this all wrong? Thanks for any help you can offer.

Get this bounty!!!

#StackBounty: #r #machine-learning #classification #caret comparing caret models with mean or median?

Bounty: 50

I am using caret to evaluate the classification performance of several models on a small dataset (190 obs) with two classes and just a handful of features.

When I inspect the train() object for one of the models, I get what looks to be the mean metric values (ROC, Sens, and Spec).

Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 171, 171, 171, 171, 171, 171, ... 
Resampling results across tuning parameters:

  nIter  method         ROC        Sens       Spec
   50    Adaboost.M1    0.8866667  0.9866667  0.58
   50    Real adaboost  0.5566667  0.9844444  0.50
  100    Adaboost.M1    0.8844444  0.9877778  0.58
  100    Real adaboost  0.5738889  0.9833333  0.52
  150    Adaboost.M1    0.8800000  0.9877778  0.60
  150    Real adaboost  0.5994444  0.9833333  0.52

When I use the resamples() function and put all of the models in a list, I get the means again, but also the median values. (other model results omitted for clarity)

Number of resamples: 50 

            Min. 1st Qu. Median   Mean 3rd Qu. Max. NA's
ADABOOST 0.25000  0.8958 0.9444 0.8867       1    1    0

           Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
ADABOOST 0.8889  1.0000 1.0000 0.9867  1.0000 1.0000    0

         Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
ADABOOST    0       0      1 0.58       1    1    0

The bwplot() function appears to display the median values as the point estimates.

enter image description here

It seems to me like the train() output wants me to evaluate the models based on the means. bwplot() focuses on the median. My first thought was that the median would be a better metric with such spread.

Which would you use, and why?

Get this bounty!!!