#StackBounty: #r #shiny #reactive-programming #dt Dynamically change column values based on selectinput from another column in R shiny …

Bounty: 50

I am trying to create a table (with DT, pls don’t use rhandsontable) which has few existing columns, one selectinput column (where each row will have options to choose) and finally another column which will be populated based on what user select from selectinput dropdown for each row.

in my example here, ‘Feedback’ column is the user dropdown selection column.
I am not able to update the ‘Score’ column which will be based on the selection from ‘Feedback’ column dropdown.

  tbl1 <- data.frame(A = c(1:10), B = LETTERS[1:10], C = c(11:20), D = LETTERS[1:10])
  ui <- fluidPage(
    DT::dataTableOutput(outputId = 'my_table')
  server <- function(input, output, session) {
    rv <- reactiveValues(tbl = tbl1)
      for (i in 1:nrow(rv$tbl)) {
        rv$tbl$Feedback[i] <- as.character(selectInput(paste0("sel", i), "",
                                                       choices = c(1,2,3,4)
        if(!is.null(input[[paste0("sel", i)]])) {
          if(input[[paste0("sel", i)]] == 1) {
            rv$tbl$Score[i] <- 10
          } else if(input[[paste0("sel", i)]] == 2) {
            rv$tbl$Score[i] <- 20
          } else if(input[[paste0("sel", i)]] == 3) {
            rv$tbl$Score[i] <- 25
          } else if(input[[paste0("sel", i)]] == 4) {
            rv$tbl$Score[i] <- 30
          output$my_table = DT::renderDataTable({
              rv$tbl, escape = FALSE, selection = 'none', rownames = F,
              options = list( paging = FALSE, ordering = FALSE, scrollx = T, dom = "t",
                              preDrawCallback = JS('function() { Shiny.unbindAll(this.api().table().node()); }'),
                              drawCallback = JS('function() { Shiny.bindAll(this.api().table().node()); } ')
          }, server = FALSE)
  shinyApp(ui = ui, server = server)

Get this bounty!!!

#StackBounty: #r #mixed-model #anova #repeated-measures Running a simple anova test in R (repeated measures)

Bounty: 50

We are studying 3 different proteins, each under 9 different conditions at 3 different timepoints (Day 1,2,3). For these we have 3 biological replicates. So we have 81 different experiments (3 proteins * 9 conditions * 3 replicates) ‒ and for each experiment we have data at three different timepoint readings on consecutive days. This gives us 243 observations in a balanced design.

We would like to show which of these proteins and conditions are statistically different from each other. We would like a comparision between proteins, and the conditions of each protein compared. For this we were thinking of using a repeated measure anova test (using R).

I replicated a MWE of the dataset and example here:


raw.data <- getURL("https://gist.githubusercontent.com/jp-um/1849ac4ac61411d0751cdbec4406e0cd/raw/4b014f986085665e75806c38a25f39093b2d19df/anon.csv")
exp.data <- read.csv(text = raw.data, colClasses=c("experiment"="factor", 
summary(exp.data, maxsum=10)

aov.model <- aov(density ~ protein*condition*day + Error(experiment/day), data = exp.data)

The output is:

Error: experiment
                  Df   Sum Sq Mean Sq F value Pr(>F)    
protein            2 10729989 5364994  1166.3 <2e-16 ***
condition          8 16430568 2053821   446.5 <2e-16 ***
protein:condition 16 29649758 1853110   402.8 <2e-16 ***
Residuals         54   248404    4600                   
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Error: experiment:day
                       Df  Sum Sq Mean Sq F value  Pr(>F)    
day                     2   92976   46488   13.24 7.2e-06 ***
protein:day             4 1776592  444148  126.49 < 2e-16 ***
condition:day          16 3419459  213716   60.87 < 2e-16 ***
protein:condition:day  32 7415908  231747   66.00 < 2e-16 ***
Residuals             108  379221    3511                    
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

I have a few questions, please:

  1. Is repeated measure anova the way to go here (as opposed to mixed models)?
  2. Is this the correct way to specify the formula? Will this give me repeated measures over the timepoints (Day)? Specifically what does Error(experiment/day) mean? My interpretation is that we have a random effect based on the biological replicate (experiment in my case) and repeated readings for the timepoint (Day).
  3. What is the ezANOVA equivalent way to write this?
  4. Is the above anova, equivalent to the linear mixed effect model lme(density ~ protein*condition*day, random = ~1 | experiment/day, data = exp.data) ?
  5. The output tells me that there are differences between them, but I would like to know which combination gives the difference. I know I can use a post-hoc test for this, and I found TukeyHSD does not work on repeated measures. I have found I can use glht for this; but I am unable to interpret its output. (I tried glht(lme.model, linfct=mcp(protein="Tukey", condition="Tukey")) but I am not sure this is correct)

Apologies for the (many, and rather basic I am afraid) questions. I would appreciate your time and help.

Many thanks,

Get this bounty!!!

#StackBounty: #r #function #integer #numeric #interactive Benefits of using of integer values for constants rather than numeric values …

Bounty: 50

In R source code, most (but not all) functions use integer values for constants:

colnames <- function(x, do.NULL = TRUE, prefix = "col")
    if(is.data.frame(x) && do.NULL)
    dn <- dimnames(x)
    else {
        nc <- NCOL(x)
    if(do.NULL) NULL
        else if(nc > 0L) paste0(prefix, seq_len(nc))
        else character()

R Language Definition says:

In most cases, the difference between an integer and a numeric value will be unimportant as R will do the right thing when using the numbers. There are, however, times when we would like to explicitly create an integer value for a constant.

  • What are these cases where there is a need to force integer values for constants instead of simply using numeric values? Examples where e.g., 1 would fail but e.g., 1L would not are welcome.
  • Conversely, in which cases using integer values is not necessary (e.g., interactive use vs programming, indexing with constants, etc.)?

The questions are about good practice and the rationale, not about e.g. the "L" notation itself, the difference between integer class and numeric class, or comparing numbers.

Get this bounty!!!

#StackBounty: #r #mathematical-statistics #variance #sampling #mean Right way to compute mean and variance

Bounty: 50

1.If I take as definition of $a_{lm}$ following a normal distribution with mean equal to zero and $C_ell=langle a_{lm}^2 rangle=text{Var}(a_{lm})$, and taking the following random variable $Z$ defined by this expression :

Z = sum_{ell=ell_{min}}^{l_{max}} sum_{m=-ell}^{ell} a_{ell m}^{2}

Then, the goal is to compute $langle Zrangle$ :

If I consider the random variable $Y=sum_{m=-ell}^{ell} C_ell bigg(dfrac{a_{ell m}}{sqrt{C_ell}}bigg)^{2}
, this random variable $Y$ follows a $chi^2(1)$ distribution weighted by $C_ell$.

  1. Can I write from this that mean of $Y$ is equal to :

$$langle Yrangle =langlebigg(sum_{m=-ell}^{ell} a_{ell m}^{2}bigg)rangle = (2ell+1),C_ell$$


and so, we would have :

$$langle Zrangle = sum_{ell=ell_{min}}^{ell_{max}},C_ell,(2ell+1)$$

?? I have serious doubts since the $a_{lm}$ doesn’t follow a reduced Normal distribution $mathcal{N}(0,1)$.

Shouldn’t it be rather :

Z&equiv sum_{ell=ell_{min}}^{ell_{max}} sum_{m=-ell}^ell a_{ell,m}^2 [6pt]
&= sum_{ell=ell_{min}}^{ell_{max}} sum_{m=-ell}^ell C_ell cdot bigg( frac{a_{ell,m}}{sqrt{C_ell}} bigg)^2 [6pt]
&sim sum_{ell=ell_{min}}^{ell_{max}} sum_{m=-ell}^ell C_ell cdot text{ChiSq}(1) [6pt]
&= sum_{ell=ell_{min}}^{ell_{max}} C_ell sum_{m=-ell}^ell text{ChiSq}(1) [6pt]
&= sum_{ell=ell_{min}}^{ell_{max}} C_ell cdot text{ChiSq}(2 ell + 1). [6pt]

  1. Now, I want to calculate the mean $langle Zrangle$ of $Z$ :

Do you agree that my case here is the computation of a mean for a weighted sum of $chi^2$ ?

So the computation is not trivial, isn’t it ? Maybe I could compute the mean by starting from analytical :

$$langle Zrangle=sum_{ell=ell_{min}}^{ell_{max}} C_ell (2ell + 1)quad(1)$$

and directly doing the numerical computation :

$$langle Zrangle=sum_{i=1}^{N} C_{ell_{i}} (2ell_{i} + 1)quad(2)$$

  1. What do you think about this direct computation, is it correct ?

I make confusions between $(1)$ and $(2)$ above since there is each $C_ell$ corresponds to each $ell$ (I mean on a numerically point of view, each $C_{ell_{i}}$ value is associated to a $ell_{i}$ value)

  1. If the direct computation $langle Zrangle=sum_{i=1}^{N} C_{ell_{i}} (2ell_{i} + 1)$ not correct, then I have to consider random variable $Z$ following a weighted sum of different chisquared distrbutions :

I have tried with following R script where nRed is one of the 5 bins considered and nRow the number of values for $ell$ (from $ell_{min}$ to $ell_{max}$), and also the Cl_sp[,i] the vector of nRow values of $C_ell$ for each bin $i$ taken into acccount.

   # Number of bin
   nRed <- 5
   # Number of rows
   nRow <- 36
   # Size of sample
   nSample_var <- 1000
   # NRow values of multipoles l
   L <- 2*(array_2D[,1])+1
   # Weighted sum of Chi squared distribution
      for (i in 1:nRed) {
        for (j in 1:nRow) { 
          y3_1[,i] <- y3_1[,i] + Cl_sp[j,i] * rchisq(nSample_var,df=L[j])
   # Print the mean of Z for each bib
   for (i in 1:nRed) {
     print(paste0('red=',i,'mean_exp = ', mean(y3[,i])))
  1. Is it the right thing to implement to compute the mean of $Z$ if I can’t compute it analytically (see expression $(2)$ above).

I would like to compute also the variance of $Z$, maybe a simple adding in my R script like :

# Print the variance of Z for each bin
for (i in 1:nRed) {
  print(paste0('red=',i,'mean_exp = ', var(y3[,i])))

should be enough. What do you think about this ?

Get this bounty!!!

#StackBounty: #r #anova #lme4-nlme #ancova #lsmeans Different ways to include pre-test performance as a covariate in a linear mixed-eff…

Bounty: 50

I have a pre-post experimental design, where I have measured participants’ performance in three courses (tasks) at both pre and post-test. The participants were quasi-randomly assigned to two groups. The data structure looks like this:

subjectID Course Group pre-test.score post-test.score
1 A ii ### #
1 B ii ### #
1 C ii ### #
2 A b ### #
2 B b ### #
2 C b ### #

I have analysed these data using a linear mixed-effect regression model where I predict post-test performance, controlling for pre-test performance with the + sign:

# I fit these models with lmer in R
CI_post <- lmer(
  post.diff ~ 
    pre.diff +
    group * course 
  + (1|subjectID) , 
  data = dat, 

Using Satterthwaite’s method from the emmeans package I get:

CI_post_interaction_coursegroup <- emmeans(CI_post, specs = c("course", "group"),lmer.df = "satterthwaite")

course group       emmean    SE   df lower.CL upper.CL
 A      blocked      0.311 0.191 6.65  -0.1452    0.768
 B      blocked      0.649 0.180 5.38   0.1954    1.102
 C      blocked      1.141 0.195 7.28   0.6847    1.598
 A      interleaved  0.189 0.194 7.15  -0.2666    0.645
 B      interleaved  0.497 0.179 5.31   0.0451    0.949
 C      interleaved  1.046 0.191 6.72   0.5907    1.502

But I could perhaps also perform the same model adding pre-test with as an interaction term to the model, so that that the model becomes pre.test * course * group

CI_post <- lmer(
  post.diff ~ 
    pre.diff *
    group * course 
  + (1|subjectID),
  data = dat, 

, which gives very different estimates:

 course group        emmean    SE    df lower.CL upper.CL
 A      blocked     -0.0669 0.188 11.10   -0.481    0.347
 B      blocked      0.6466 0.161  6.09    0.255    1.038
 C      blocked      1.1980 0.194 12.65    0.778    1.618
 A      interleaved -0.1520 0.211 16.76   -0.597    0.293
 B      interleaved  0.4872 0.160  6.12    0.098    0.876
 C      interleaved  1.0593 0.181  9.82    0.654    1.464

I am trying to understand the exact differences between these two models, and which is them is "correct"?. Long (see comments below) gave a helpful comment that "if you want the group:course interaction to vary depending on the value of pre.diff then you fit the 2nd model with the 3-way interaction". But since my pre-test score is a continuous variable that measured each participant’s performance on course A, course B and course C, will the model use this data structure when I add an interaction term between the covariate and factors in the model?

Get this bounty!!!

#StackBounty: #r #ordered-logit Need help understanding ordinal logistic regression output, R

Bounty: 50

I THINK this will boil down to a simple yes/no answer, but allow me to walk you through my thought process.

I have data like this:

    df<-structure(list(Survey = structure(c(1L, 1L, 3L, 3L, 2L, 2L, 1L, 
5L, 1L, 2L, NA, 1L, 4L, 4L, 3L, 2L, 2L, 5L, 1L, 4L, 1L, 1L, 1L, 
1L, 1L, 4L, 3L, 4L), .Label = c("Never", "Rarely", "Sometimes", 
"Often", "Always"), class = "factor"), Score = c(2.27272727272727, 
18.1818181818182, 29.5454545454545, 22.7272727272727, 43.1818181818182, 
25, 75, 61.3636363636364, 29.5454545454545, 6.81818181818182, 
54.5454545454545, 36.3636363636364, 36.3636363636364, 70.4545454545455, 
22.7272727272727, 45.4545454545455, 31.8181818181818, 61.3636363636364, 
15.9090909090909, 47.5, 11.3636363636364, 52.5, 22.7272727272727, 
22.7272727272727, 4.54545454545455, 37.5, 35, 65.9090909090909
)), row.names = c(NA, -28L), groups = structure(list(.rows = structure(list(
    1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 
    15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 
    27L, 28L), ptype = integer(0), class = c("vctrs_list_of", 
"vctrs_vctr", "list"))), row.names = c(NA, -28L), class = c("tbl_df", 
"tbl", "data.frame")), class = c("rowwise_df", "tbl_df", "tbl", 

Consisting of a survey question that we gave participants with answers on a likert-type scale of never, rarely, sometimes, often, and always. Those participants also had a "score" from a previous instrument (continuous variable). What we’re trying to examine is whether the continuous "score" predicted the survey response.

Given my survey results are ordinal, and given my score variable is continuous, I figured that an ordinal logistic regression like shown in these examples was my best bet.

So I ran this code:

m <- polr(Survey ~ Score, data = df, Hess=TRUE)
(ctable <- coef(summary(m)))
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
(ctable <- cbind(ctable, "p value" = p))

Like in that UCLA example and I received this output:

enter image description here
enter image description here
enter image description here

When sharing this output with my colleagues, I want to make sure I am interpreting correctly (and it was the right test).

So two part question:

-Given I am trying to see whether "score" predicted "Survey" response, was this the correct statistical test to choose?
-If it was the correct test, I interpret the results as "yes, there is a significant relationship (basing that on the confint not crossing zero) and for a one unit increase in ‘score’ the odds of answering higher on the survey multiply by 1.065. Is that correct?

Get this bounty!!!

#StackBounty: #r #time-series #hypothesis-testing #statistical-significance #granger-causality How to better use and interpret granger …

Bounty: 50

I have the following code and I want to show the connection of two different factors aith a specific one. I want to use grangertest in R and I have the following question:

  1. how can I interpret the results based on different levels of significance?
  2. how can I interpret non-significant results?
  3. is there a way to visualise the results?
my_example <- data.frame(matrix(ncol = 3, nrow = 10))
my_example$X1 <- c(0.8619616, 1.1818621, 0.5530410, 0.6255634, 
       0.9971764, 1.3464298, 2.0889985, 1.5303893, 2.9503790, 
my_example$X2 <- c( -5.7333332, -4.7000000, -7.7000000, 
     -2.5000000,  1.5666667,  0.2666667, -2.7000000, -6.2000000, 
      0.2333333  ,0.5333333)
my_example$X3 <- c( 0.2200000, 0.3625000, 0.2100000, 0.3750000, 
      0.4966667, 0.4133333, 0.3800000, 0.2133333, 0.3733333, 

grangertest(X1 ~ X2, order = 2, data = my_example)

grangertest(X1 ~ X3, order = 2, data = my_example)

Get this bounty!!!

#StackBounty: #r #lme4-nlme #multilevel-analysis #glmm GLMM model formulation with a partial "subcondition"

Bounty: 50

I am modeling reaction times in a GLMM using the lme4 package. My data have the following structure:

  • Subject ID
  • Reaction times (RT)
  • Distractor type (Type): (3 levels): moving – static – no distractor
  • Distractor content (Content): (3 levels): neutral – positive – negative, existing only as a "subcondition" for trials with a moving or static distractor.

Subjects were randomly assigned to three groups (corresponding to distractor type, manipulated between subjects).
Looking at the subjects in the conditions with distractors only, this model gives a good fit and interpretable results:

 model_1 <- glmer(RT ~ Content*Type+ (1 + Content| SubjectID), data = InputData, family = inverse.gaussian(link = "identity"))

Comparing all three distractor types across all three groups, ignoring content, is straightforward:

  model_2 <- glmer(RT ~ Type + (1  | SubjectID), data = InputData, family = inverse.gaussian(link = "identity"))

Model_2 clearly shows large difference between the "no distractor" and both "distractor" conditions, but of course this disregards the distractor subconditions (e.g., it could very well be that the overall difference is driven by a difference between, say, negative distractors versus no distractors only).

Thus, I’m trying to come up with a model formulation incorporating all three groups, enabling investigation of fixed main and interaction effects as in model_1, bearing in mind that Content only exists as a subgroup in two levels of the Type-factor. I’ve tried working from the example under "partially nested models" as mentioned here, but due to the different factor hierarchy I can’t seem to transpose the solution to this context.

I would be really grateful if anybody could help out!

Get this bounty!!!

#StackBounty: #r #machine-learning #caret Best model (set of predictors) for my data?

Bounty: 50

I’m exploring some ML strategies using caret package. My goal is to select best predictors and to obtain optimal model for further predictions. My dataset is:

  • 75 observations (39 S and 36 F – dependent variable named ‘group’) – dataset is well balanced
  • 13 independent variables (predictors), continous values from 0 to 1, without any NA’s named:

    A_1, A_2, A_3, A_4, A_5, B_1, B_2, C_1, C_2, C_3, D_1, D_2, E_1

Moreover, values of each predictor (F vs S) significantly differ (Wilcoxon test).

I started division of the data and 10-fold cross validation:

trainIndex <- createDataPartition(data$group, p = 0.7, list = FALSE)
trainingSet <- data[trainIndex,]
testSet <- data[-trainIndex,]

methodCtrl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 5,
  savePredictions = "final",
  classProbs = T,
  summaryFunction = twoClassSummary

Then, based on several articles and tutorials I selected six ML methods to obtain some models with all predictor variables:

rff <- train(group ~., data = trainingSet, method = "rf", metric = "ROC", trControl = methodCtrl)
nbb <- train(group ~., data = trainingSet, method = "nb", metric = "ROC", trControl = methodCtrl)
glmm <- train(group ~., data = trainingSet, method = "glm", metric = "ROC", trControl = methodCtrl)
nnett <- train(group ~., data = trainingSet, method = "nnet", metric = "ROC", trControl = methodCtrl)
glmnett <- train(group ~., data = trainingSet, method = "glmnet", metric = "ROC", trControl = methodCtrl)
svmRadiall <- train(group ~., data = trainingSet, method = "svmRadial", metric = "ROC", trControl = methodCtrl)

How accurate are the models?

fitted <- predict(rff, testSet)
confusionMatrix(reference = factor(testSet$group), data = fitted, mode = "everything", positive = "F")#61 #61
fitted <- predict(nbb, testSet)
confusionMatrix(reference = factor(testSet$group), data = fitted, mode = "everything", positive = "F")#66 #66
fitted <- predict(glmm, testSet)
confusionMatrix(reference = factor(testSet$group), data = fitted, mode = "everything", positive = "F")#57 #66
fitted <- predict(nnett, testSet)
confusionMatrix(reference = factor(testSet$group), data = fitted, mode = "everything", positive = "F")#42 #66
fitted <- predict(glmnett, testSet)
confusionMatrix(reference = factor(testSet$group), data = fitted, mode = "everything", positive = "F")#61 #57
fitted <- predict(svmRadiall, testSet)
confusionMatrix(reference = factor(testSet$group), data = fitted, mode = "everything", positive = "F")#66 #66

After first # I put the % of accuracy of prediction of each model. I also draw simple ROC comparison of all models:
enter image description here

Now I’d like to improve my model, so I used glmStepAIC to get only best (most important) predictors, here’s what I got:

aic <- train(group ~., data = trainingSet,  method = "glmStepAIC", trControl = methodCtrl, metric = "ROC", trace = FALSE)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0191  -0.6077   0.3584   0.6991   2.5416  

            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  0.39809    1.01733   0.391  0.69557   
A_5          0.11726    0.04701   2.494  0.01263 * 
C_2          0.17789    0.11084   1.605  0.10852   
C_3         -0.18231    0.11027  -1.653  0.09828 . 
E_1         -0.14176    0.05260  -2.695  0.00704 **
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 74.786  on 53  degrees of freedom
Residual deviance: 48.326  on 49  degrees of freedom
AIC: 58.326

Number of Fisher Scoring iterations: 5

Based on this result I chose this 4 predictor variables:

data <- read.table("data.txt", sep ='t',header = T, dec = ',')
data <- data[,c('group','A_5','C_2','C_3','E_1')]

And I repeated everything, data division train – test, model obtain, model testing etc only with these 4 predictors instead of all 13. Unfortunately the accuracy is still low, take a look of % after second # in confusionMatrix part. Moreover the ROC comparison is even worse:

enter image description here

I’m a new in such analysis, so could you please tell me if I’m making some bad mistake in my analysis or maybe my dataset is to small/my data is rubish?
How can I choose optimal predictors to get best model? Which ML methos should I pick?

Best Regards,

Get this bounty!!!

#StackBounty: #r #regression #repeated-measures #panel-data How to analyze longitudinal data in R?

Bounty: 50

I am trying to examine how an athlete’s performance influences their articulations on Twitter on specific dimensions of research interest (e.g., use of ‘we’ personal pronoun).

I have all the tweets of over 100 athletes along with time stamp. I have a frequency count measure of ‘we’ for every tweet. For every athlete, I also have performance track record – i.e., contest participated, date for contest, and result (win/loss/draw). I would like to statistically analyze the effect of performance on the use of ‘we’ and test hypothesis such as the following:

Hypothesis. Win (loss) decreases (increases) the likelihood of using ‘we’ in tweets.

I would like to understand how to analyze such a time series data. How should I structure the data for such an analysis in R or Python? What regression models are most appropriate for such an analysis in R?

Get this bounty!!!