#StackBounty: #mixed-model #residuals #normality-assumption #heavy-tailed Heavy-tailed errors in mixed-effects model

Bounty: 50

I’m relatively new to statistical modelling and `R’, so please let me know If I should provide any further information/plots. I did originally post this question here, but unfortunately have not received any responses yet.

I am using the lme() function from nlme in R to test the significance of fixed
effects of a repeated measures design. My experiment involves subjects listening to a pair of sounds and adjusting the sound level (in decibels) until both are equally loud. This is done for a 40 different pairs of stimuli, with both orders tested (A/B and B/A). There are a
total of 160 observations per subject (this includes 1 replication for every condition), and 14 participants in all.

The model:

LevelDifference ~ pairOfSounds*order, random = (1|Subject/pairOfSounds/order)

I have built the model up by AIC/BIC and likelihood ratio tests (method =
“ML”). Residual plots for the within-group errors are shown below:

Residual plots for mixed-effects model

The top left plot shows standardised residuals vs fitted values. I don’t see any
systematic pattern in the residuals, so I assume that the constant variation assumption
is valid, although further inspection of the subject-by-subject residuals do
show some unevenness. In conjunction with the top right plot, I have no reason to suspect
non-linearities.

My main concern is the lower left qqplot which reveals that the residuals are
heavy-tailed. I’m not sure where to go from here. From reading Pinheiro and Bates
(2000, p. 180), the
fixed-effects tests tend to be more conservative when the tails are
symmetrically distributed. So perhaps I’m OK if the p-values are very low?

The level two and three random effects show a similar departure from normality.

Basically:

  1. How do such heavy-tailed residuals effect the inference for fixed-effects?
  2. Would robust regression be an appropriate alternative to check?


Get this bounty!!!

#StackBounty: #mixed-model #lme4-nlme #glmm #gamma-distribution What can Ido if I get patterns in residuals vs predicted values using `…

Bounty: 50

I want to model a response variable (y) as a function of two explanatory variables (x and z). y are concentration measures of a physiological parameter of the human body with a worldwide accepted method (method A). x is a novel method (method B) that might be alternative since it is much cheaper. However, it is thought that method B is more or less accurate depending on the time (z) that the equipment needed is used. So, my goal is to test the significance of x and z to accurately measure y. I have 6 individuals (ID: A,B,C,D,E,F). Below I show the relationship between x and y:

enter image description here

Given that my response variable has a sharp non-normal distribution, that the variance increases as x increases, and that I have several individuals but I am not interested in differences among individuals, I though on GLMM with a GAMMA distribution to test the significant effect of x and z for explaining y.
enter image description here

First, I scaled x and z since I have read that for interactions is advisable to scale to reduce collinearity between x and z. Then, I run GLMMs with a gamma distribution and using the log link.

df$X <- scale(df$X)
df$Y <- scale(df$Y)

## Setting random structure ####
mod1<-glmer(Y~ 1 + (1|ID),data = df, family=Gamma(link=log)) 
mod2<-glmer(Y~ 1 + (X|ID),data = df, family=Gamma(link=log)) 
AIC(mod1,mod2)

## Setting fixed structure ####
mod1<-glmer(Y~ 1 + (X|ID),data = df, family=Gamma(link=log), control = glmerControl(optimizer ="Nelder_Mead")) 
mod2<-glmer(Y~X + (X|ID),data = df, family=Gamma(link=log), control = glmerControl(optimizer ="Nelder_Mead")) 
mod3<-glmer(Y~X + Z + (X|ID),data = df, family=Gamma(link=log), control = glmerControl(optimizer ="Nelder_Mead")) 
mod4<-glmer(Y~X + X:Z +  (X|ID),data = df, family=Gamma(link=log), control = glmerControl(optimizer ="Nelder_Mead")) 

r.squaredGLMM(mod4)[1,c(1,2)]

mod.list <- list(mod1,mod2,mod3,mod4)
model.sel<-model.sel(mod.list, rank="AIC",extra=c(r2=function(x) round(r.squaredGLMM(x)[1,c(1,2)],2)))

To test if x and z are significant I compared models by AIC.

model.sel

Model selection table 
   (Int)      X       Z     X:Z r2.R2m r2.R2c df   logLik     AIC delta weight
4 -1.300 0.7530         0.04838   0.75   0.78  7 1014.235 -2014.5  0.00  0.923
3 -1.293 0.7341 0.02748           0.74   0.77  7 1011.364 -2008.7  5.74  0.052
2 -1.297 0.7390                   0.74   0.77  6 1009.596 -2007.2  7.28  0.024
1 -1.627                          0.00   0.46  5 1003.959 -1997.9 16.55  0.000
Models ranked by AIC(x) 
Random terms (all models): 
‘X | ID’

My problem comes when I observe the residual patterns vs predicted values since I think there are clear patterns. Here I don’t show the distribution of the residuals since if I understand correctly, a normal distribution of them is not needed. Alternatively, I haven’t transformed x nor z since if I understood correctly if I use a GLMM it does not take too much sense to make transformations.

enter image description here

Does anyone know what I could do? Any proposal? I did not share the data because is too long (n=1079).

Thanks in advance

Head of my dataframe:

head <- read.table(header=TRUE, text="
   ID           Z          X          Y
1   A -2.06857299  0.0999264 0.10303660
2   A  0.47766952 -1.5716850 0.06905438
3   A -0.08816215 -1.5873427 0.04125269
4   A -1.36128341  0.2356783 0.27105239
5   A -2.21003091  2.7351100 0.13510265
6   A -2.06857299 -1.5503906 0.04370665
7   A -1.92711507 -1.0819135 0.16452563
8   A  0.19475368 -1.2214479 0.08038283
9   A  1.18495910 -1.4283572 0.04704274
10  A  0.76058535 -1.3846645 0.07481843
11  A  1.46787493 -1.4024705 0.07301108
12  A  0.19475368 -0.9608716 0.12025389
13  A  1.46787493 -1.5005769 0.05606175
14  A  1.18495910 -0.6408735 0.16091346
15  A  0.90204327 -0.4098086 0.25047437
16  A  0.76058535 -1.4679905 0.06739064
17  A  0.61912743 -1.4173005 0.05596254
18  A  0.76058535 -1.3374465 0.07334454
19  A  0.90204327 -0.8583222 0.16164041
20  A  0.47766952  0.9304085 0.46551506
")

Tail of my dataframe:

tail <- read.table(header=TRUE, text="
    ID           Z          X          Y
269  F  0.05329576 -1.5734595 0.03550929
270  F -0.93690965 -1.1399892 0.03945609
271  F -0.08816215 -1.5723113 0.03062313
272  F  0.19475368 -1.5845737 0.03012372
273  F -0.93690965 -1.5058659 0.05277640
274  F  0.61912743 -1.3621286 0.04939641
275  F  0.05329576 -0.8920383 0.13336570
276  F  0.19475368 -1.4702564 0.11916083
277  F  0.19475368 -1.4853866 0.11188576
278  F -0.08816215 -1.3167784 0.12943442
279  F -0.08816215 -1.3261729 0.17835805
280  F -0.08816215 -1.5647956 0.05257086
281  F -1.07836757 -1.0531661 0.10717841
282  F -1.07836757 -1.5897853 0.06021677
283  F -1.64419924 -1.1298259 0.09429196
284  F -1.36128341 -1.5770086 0.03486231
285  F  0.19475368 -1.4635318 0.09377946
286  F  0.76058535 -1.5638153 0.04056188
287  F  0.19475368 -0.6263252 0.23290781
288  F -2.35148882  0.1957513 0.41243622
")

Procedure followed according to the proposal of JTH

mod1<-glmer(Y~ 1 + (X|ID),data = df, family=Gamma(link=log), control = glmerControl(optimizer ="Nelder_Mead")) 
mod2<- glmer(Y~ ns(X, 4)  + (X|ID),data = df, family=Gamma(link=log), control = glmerControl(optimizer ="Nelder_Mead")) 
mod3<-glmer(Y~ ns(X, 4) + Z + (X|ID),data = df, family=Gamma(link=log), control = glmerControl(optimizer ="Nelder_Mead")) 
mod4<-glmer(Y~ X:Z + ns(X, 4) +  (X|ID),data = df, family=Gamma(link=log), control = glmerControl(optimizer ="Nelder_Mead")) 
mod5<-glmer(Y~ ns(X, 4):Z +  (X|ID),data = df, family=Gamma(link=log), control = glmerControl(optimizer ="Nelder_Mead")) 
AIC(mod1,mod2,mod3,mod4,mod5)

The results are next:

r.squaredGLMM(mod4)[1,c(1,2)]

mod.list <- list(mod1,mod2,mod3,mod4)
model.sel<-model.sel(mod.list, rank="AIC",extra=c(r2=function(x) round(r.squaredGLMM(x)[1,c(1,2)],2)))

model.sel

        Int ns(X)            Z       Z:X Z:ns(X)  r2m  r2c df   logLik       AIC     delta       weight
4 -2.827029     +              0.1871558         0.65 0.69 10 320.9342 -621.8684   0.00000 1.000000e+00
2 -2.660326     +                                0.48 0.54  9 289.8442 -561.6884  60.17996 8.552394e-14
3 -2.660204     + 0.0001000814                   0.48 0.54 10 289.8443 -559.6886  62.17976 3.146559e-14
5 -2.220731                                    + 0.04 0.72  9 259.3610 -500.7219 121.14643 4.936132e-27
1 -1.377842                                      0.00 0.27  5 246.9520 -483.9039 137.96446 1.100015e-30

The plot of my residuals vs predicted values is this now:

enter image description here

Although the patterns have been sharply reduced, I suspect there is still some heteroscedasticity. However, I can’t remove it.


Get this bounty!!!

#StackBounty: #r #mixed-model #lme4-nlme #glmm #gamma-distribution What can Ido if I get patterns in residuals vs predicted values usin…

Bounty: 50

I want to model a response variable (y) as a function of two explanatory variables (x and z). y are concentration measures of a physiological parameter of the human body with a worldwide accepted method (method A). x is a novel method (method B) that might be alternative since it is much cheaper. However, it is thought that method B is more or less accurate depending on the time (z) that the equipment needed is used. So, my goal is to test the significance of x and z to accurately measure y. I have 6 individuals (ID: A,B,C,D,E,F). Below I show the relationship between x and y:

enter image description here

Given that my response variable has a sharp non-normal distribution, that the variance increases as x increases, and that I have several individuals but I am not interested in differences among individuals, I though on GLMM with a GAMMA distribution to test the significant effect of x and z for explaining y.
enter image description here

First, I scaled x and z since I have read that for interactions is advisable to scale to reduce collinearity between x and z. Then, I run GLMMs with a gamma distribution and using the log link.

df$X <- scale(df$X)
df$Y <- scale(df$Y)

## Setting random structure ####
mod1<-glmer(Y~ 1 + (1|ID),data = df, family=Gamma(link=log)) 
mod2<-glmer(Y~ 1 + (X|ID),data = df, family=Gamma(link=log)) 
AIC(mod1,mod2)

## Setting fixed structure ####
mod1<-glmer(Y~ 1 + (X|ID),data = df, family=Gamma(link=log), control = glmerControl(optimizer ="Nelder_Mead")) 
mod2<-glmer(Y~X + (X|ID),data = df, family=Gamma(link=log), control = glmerControl(optimizer ="Nelder_Mead")) 
mod3<-glmer(Y~X + Z + (X|ID),data = df, family=Gamma(link=log), control = glmerControl(optimizer ="Nelder_Mead")) 
mod4<-glmer(Y~X + X:Z +  (X|ID),data = df, family=Gamma(link=log), control = glmerControl(optimizer ="Nelder_Mead")) 

r.squaredGLMM(mod4)[1,c(1,2)]

mod.list <- list(mod1,mod2,mod3,mod4)
model.sel<-model.sel(mod.list, rank="AIC",extra=c(r2=function(x) round(r.squaredGLMM(x)[1,c(1,2)],2)))

To test if x and z are significant I compared models by AIC.

model.sel

Model selection table 
   (Int)      X       Z     X:Z r2.R2m r2.R2c df   logLik     AIC delta weight
4 -1.300 0.7530         0.04838   0.75   0.78  7 1014.235 -2014.5  0.00  0.923
3 -1.293 0.7341 0.02748           0.74   0.77  7 1011.364 -2008.7  5.74  0.052
2 -1.297 0.7390                   0.74   0.77  6 1009.596 -2007.2  7.28  0.024
1 -1.627                          0.00   0.46  5 1003.959 -1997.9 16.55  0.000
Models ranked by AIC(x) 
Random terms (all models): 
‘X | ID’

My problem comes when I observe the residual patterns vs predicted values since I think there are clear patterns. Here I don’t show the distribution of the residuals since if I understand correctly, a normal distribution of them is not needed. Alternatively, I haven’t transformed x nor z since if I understood correctly if I use a GLMM it does not take too much sense to make transformations.

enter image description here

Does anyone know what I could do? Any proposal? I did not share the data because is too long (n=1079).

Thanks in advance

Head of my dataframe:

head <- read.table(header=TRUE, text="
   ID           Z          X          Y
1   A -2.06857299  0.0999264 0.10303660
2   A  0.47766952 -1.5716850 0.06905438
3   A -0.08816215 -1.5873427 0.04125269
4   A -1.36128341  0.2356783 0.27105239
5   A -2.21003091  2.7351100 0.13510265
6   A -2.06857299 -1.5503906 0.04370665
7   A -1.92711507 -1.0819135 0.16452563
8   A  0.19475368 -1.2214479 0.08038283
9   A  1.18495910 -1.4283572 0.04704274
10  A  0.76058535 -1.3846645 0.07481843
11  A  1.46787493 -1.4024705 0.07301108
12  A  0.19475368 -0.9608716 0.12025389
13  A  1.46787493 -1.5005769 0.05606175
14  A  1.18495910 -0.6408735 0.16091346
15  A  0.90204327 -0.4098086 0.25047437
16  A  0.76058535 -1.4679905 0.06739064
17  A  0.61912743 -1.4173005 0.05596254
18  A  0.76058535 -1.3374465 0.07334454
19  A  0.90204327 -0.8583222 0.16164041
20  A  0.47766952  0.9304085 0.46551506
")

Tail of my dataframe:

tail <- read.table(header=TRUE, text="
    ID           Z          X          Y
269  F  0.05329576 -1.5734595 0.03550929
270  F -0.93690965 -1.1399892 0.03945609
271  F -0.08816215 -1.5723113 0.03062313
272  F  0.19475368 -1.5845737 0.03012372
273  F -0.93690965 -1.5058659 0.05277640
274  F  0.61912743 -1.3621286 0.04939641
275  F  0.05329576 -0.8920383 0.13336570
276  F  0.19475368 -1.4702564 0.11916083
277  F  0.19475368 -1.4853866 0.11188576
278  F -0.08816215 -1.3167784 0.12943442
279  F -0.08816215 -1.3261729 0.17835805
280  F -0.08816215 -1.5647956 0.05257086
281  F -1.07836757 -1.0531661 0.10717841
282  F -1.07836757 -1.5897853 0.06021677
283  F -1.64419924 -1.1298259 0.09429196
284  F -1.36128341 -1.5770086 0.03486231
285  F  0.19475368 -1.4635318 0.09377946
286  F  0.76058535 -1.5638153 0.04056188
287  F  0.19475368 -0.6263252 0.23290781
288  F -2.35148882  0.1957513 0.41243622
")

Procedure followed according to the proposal of JTH

mod1<-glmer(Y~ 1 + (X|ID),data = df, family=Gamma(link=log), control = glmerControl(optimizer ="Nelder_Mead")) 
mod2<- glmer(Y~ ns(X, 4)  + (X|ID),data = df, family=Gamma(link=log), control = glmerControl(optimizer ="Nelder_Mead")) 
mod3<-glmer(Y~ ns(X, 4) + Z + (X|ID),data = df, family=Gamma(link=log), control = glmerControl(optimizer ="Nelder_Mead")) 
mod4<-glmer(Y~ X:Z + ns(X, 4) +  (X|ID),data = df, family=Gamma(link=log), control = glmerControl(optimizer ="Nelder_Mead")) 
mod5<-glmer(Y~ ns(X, 4):Z +  (X|ID),data = df, family=Gamma(link=log), control = glmerControl(optimizer ="Nelder_Mead")) 
AIC(mod1,mod2,mod3,mod4,mod5)

The results are next:

r.squaredGLMM(mod4)[1,c(1,2)]

mod.list <- list(mod1,mod2,mod3,mod4)
model.sel<-model.sel(mod.list, rank="AIC",extra=c(r2=function(x) round(r.squaredGLMM(x)[1,c(1,2)],2)))

model.sel

        Int ns(X)            Z       Z:X Z:ns(X)  r2m  r2c df   logLik       AIC     delta       weight
4 -2.827029     +              0.1871558         0.65 0.69 10 320.9342 -621.8684   0.00000 1.000000e+00
2 -2.660326     +                                0.48 0.54  9 289.8442 -561.6884  60.17996 8.552394e-14
3 -2.660204     + 0.0001000814                   0.48 0.54 10 289.8443 -559.6886  62.17976 3.146559e-14
5 -2.220731                                    + 0.04 0.72  9 259.3610 -500.7219 121.14643 4.936132e-27
1 -1.377842                                      0.00 0.27  5 246.9520 -483.9039 137.96446 1.100015e-30

The plot of my residuals vs predicted values is this now:

enter image description here


Get this bounty!!!

#StackBounty: #r #mixed-model #lme4-nlme #glmm #gamma-distribution What can Ido if I get patterns in the plot of residuals vs predicted…

Bounty: 50

I want to model a response variable (y) as a function of two explanatory variables (x and z). y are concentration measures of a physiological parameter of the human body with a worldwide accepted method (method A). x is a novel method (method B) that might be alternative since it is much cheaper. However, it is thought that method B is more or less accurate depending on the time (z) that the equipment needed is used. So, my goal is to test the significance of x and z to accurately measure y. I have 6 individuals (ID: A,B,C,D,E,F). Below I show the relationship between x and y:

enter image description here

Given that my response variable has a sharp non-normal distribution, that the variance increases as x increases, and that I have several individuals but I am not interested in differences among individuals, I though on GLMM with a GAMMA distribution to test the significant effect of x and z for explaining y.
enter image description here

First, I scaled x and z since I have read that for interactions is advisable to scale to reduce collinearity between x and z. Then, I run GLMMs with a gamma distribution and using the log link.

df$X <- scale(df$X)
df$Y <- scale(df$Y)

## Setting random structure ####
mod1<-glmer(Y~ 1 + (1|ID),data = df, family=Gamma(link=log)) 
mod2<-glmer(Y~ 1 + (X|ID),data = df, family=Gamma(link=log)) 
AIC(mod1,mod2)

## Setting fixed structure ####
mod1<-glmer(Y~ 1 + (X|ID),data = df, family=Gamma(link=log), control = glmerControl(optimizer ="Nelder_Mead")) 
mod2<-glmer(Y~X + (X|ID),data = df, family=Gamma(link=log), control = glmerControl(optimizer ="Nelder_Mead")) 
mod3<-glmer(Y~X + Z + (X|ID),data = df, family=Gamma(link=log), control = glmerControl(optimizer ="Nelder_Mead")) 
mod4<-glmer(Y~X + X:Z +  (X|ID),data = df, family=Gamma(link=log), control = glmerControl(optimizer ="Nelder_Mead")) 

r.squaredGLMM(mod4)[1,c(1,2)]

mod.list <- list(mod1,mod2,mod3,mod4)
model.sel<-model.sel(mod.list, rank="AIC",extra=c(r2=function(x) round(r.squaredGLMM(x)[1,c(1,2)],2)))

To test if x and z are significant I compared models by AIC.

model.sel

Model selection table 
   (Int)      X       Z     X:Z r2.R2m r2.R2c df   logLik     AIC delta weight
4 -1.300 0.7530         0.04838   0.75   0.78  7 1014.235 -2014.5  0.00  0.923
3 -1.293 0.7341 0.02748           0.74   0.77  7 1011.364 -2008.7  5.74  0.052
2 -1.297 0.7390                   0.74   0.77  6 1009.596 -2007.2  7.28  0.024
1 -1.627                          0.00   0.46  5 1003.959 -1997.9 16.55  0.000
Models ranked by AIC(x) 
Random terms (all models): 
‘X | ID’

My problem comes when I observe the residual patterns vs predicted values since I think there are clear patterns. Here I don’t show the distribution of the residuals since if I understand correctly, a normal distribution of them is not needed. Alternatively, I haven’t transformed x nor z since if I understood correctly if I use a GLMM it does not take too much sense to make transformations.

enter image description here

Does anyone know what I could do? Any proposal? I did not share the data because is too long (n=1079).

Thanks in advance

Head of my dataframe:

head <- read.table(header=TRUE, text="
   ID           Z          X          Y
1   A -2.06857299  0.0999264 0.10303660
2   A  0.47766952 -1.5716850 0.06905438
3   A -0.08816215 -1.5873427 0.04125269
4   A -1.36128341  0.2356783 0.27105239
5   A -2.21003091  2.7351100 0.13510265
6   A -2.06857299 -1.5503906 0.04370665
7   A -1.92711507 -1.0819135 0.16452563
8   A  0.19475368 -1.2214479 0.08038283
9   A  1.18495910 -1.4283572 0.04704274
10  A  0.76058535 -1.3846645 0.07481843
11  A  1.46787493 -1.4024705 0.07301108
12  A  0.19475368 -0.9608716 0.12025389
13  A  1.46787493 -1.5005769 0.05606175
14  A  1.18495910 -0.6408735 0.16091346
15  A  0.90204327 -0.4098086 0.25047437
16  A  0.76058535 -1.4679905 0.06739064
17  A  0.61912743 -1.4173005 0.05596254
18  A  0.76058535 -1.3374465 0.07334454
19  A  0.90204327 -0.8583222 0.16164041
20  A  0.47766952  0.9304085 0.46551506
")

Tail of my dataframe:

tail <- read.table(header=TRUE, text="
    ID           Z          X          Y
269  F  0.05329576 -1.5734595 0.03550929
270  F -0.93690965 -1.1399892 0.03945609
271  F -0.08816215 -1.5723113 0.03062313
272  F  0.19475368 -1.5845737 0.03012372
273  F -0.93690965 -1.5058659 0.05277640
274  F  0.61912743 -1.3621286 0.04939641
275  F  0.05329576 -0.8920383 0.13336570
276  F  0.19475368 -1.4702564 0.11916083
277  F  0.19475368 -1.4853866 0.11188576
278  F -0.08816215 -1.3167784 0.12943442
279  F -0.08816215 -1.3261729 0.17835805
280  F -0.08816215 -1.5647956 0.05257086
281  F -1.07836757 -1.0531661 0.10717841
282  F -1.07836757 -1.5897853 0.06021677
283  F -1.64419924 -1.1298259 0.09429196
284  F -1.36128341 -1.5770086 0.03486231
285  F  0.19475368 -1.4635318 0.09377946
286  F  0.76058535 -1.5638153 0.04056188
287  F  0.19475368 -0.6263252 0.23290781
288  F -2.35148882  0.1957513 0.41243622
")


Get this bounty!!!

#StackBounty: #regression #mixed-model #multilevel-analysis #bias How to explain a multilevel model bias towards one of the levels?

Bounty: 50

My multilevel model’s 2 levels consist of land surface images and meteorological readings from different places. The readings are level 2, with 7 variables and 1 observation per group. The images are level 1, with 5 variables (different images) where each pixel is an observation. The images are of size 1000 x 1000, which means that per group, there are 1 million observations.

enter image description here

The predictand (left) is taken by a thermal camera, while the predictor images are all taken by a more accurate normal camera, and some of them show these features (like trees) vividly.
The results of the model (on the right) after cross validation are wildly inaccurate numerically, but perfectly capture every spatial detail in the image, clearly having a massive bias toward the first level.

My question is 2-fold:

  1. How do I show the model’s bias towards one level?
  2. How many observations is, in fact, too many? Should I cut the images down in size?


Get this bounty!!!

#StackBounty: #mixed-model #experiment-design #power-analysis Nested Experimental Design: Intuition about power?

Bounty: 50

I am planning a nested experimental design, where 1000 participants respond to one policy proposal coming from a 2^3 design (125 participants for each condition A x B x C).

Yet, the policy proposal can come from one of 25 policy categories. But I am not particularly interested in their specific effects, I just want to be able to make claims across these categories.
Therfore I plan to model them as random effects in a mixed model (Each policy category would receive 40 partcipants).

Ideally, the model would be:

lmer(response ~ A*B*C + (1 + A*B*C|category)) 

Would such a model come even close to being adequately powered? Or does the inclusion of the 25 policy categories in the experimental design increase the number of experimental conditions to 200, therefore requiring an extremely large amount of participants?

Is there a benefit, when it comes to statistical power, to model a categorical treatment variable as random?

Edit:

Some info about likely parameter values:

Y = values from 1-7 (let’s assume it’s metric and gaussian)
Ymean = 4

BetaA = 0.0

BetaB = -0.4

BetaC = 0.4

BetaAxB = 0.2

BetaAxC = -0.2

BetaBxC = -0.1

BetaAxBxC = -0.1

Residual Standard Deviation = 1

Within-category standard deviation = 0.5

alpha = 0.05


Get this bounty!!!

#StackBounty: #r #bayesian #mixed-model #circular-statistics #stan Interpretation of coefficients in mixed-effects model with circular …

Bounty: 50

I have a dataset from an experiment where wild ants were surveyed continuously for 24 hours under a number of temperature treatments (chambers). Whenever an ant was observed, the species of the ant and the time, rounded to the nearest hour, was recorded. This is circular data because the observations cover the entire 24-hour period (at least some ants are active at any time of day or night). I calculated the circular median time within each species and chamber. The null hypothesis is that an individual species does not change its median time with change in temperature.

I fit a mixed-effects model with the R package brms (a wrapper for Stan software) using a von Mises distribution (with default link functions) for the response, with temperature as a fixed effect and species as a random effect (each species has both random slope and random intercept). I had to transform the hour values to radians such that 0:00 maps to $-pi$, 12:00 maps to 0, and 24:00 maps to $pi$.

I am confused about how to interpret the species-level coefficients. I see the highest coefficient on a species that basically shows no change in response to temperature treatment but where the median time crosses midnight. I am concerned that I set up the model wrong or that I am interpreting the coefficients wrong.

data

library(circular)
library(brms)

dat <- structure(list(sp = c("apla", "apla", "apla", "apla", "apla", 
"apla", "apla", "apru", "apru", "apru", "apru", "apru", "apru", 
"apru", "apru", "apru", "apru", "apru", "apru", "caca", "caca", 
"caca", "caca", "caca", "caca", "caca", "caca", "caca", "caca", 
"caca", "cape", "cape", "cape", "cape", "cape", "cape", "cape", 
"cape", "cape", "cape", "cape", "cape", "crli", "crli", "crli", 
"crli", "crli", "crli", "crli", "crli", "crli", "crli", "crli", 
"crli", "fosu", "fosu", "fosu", "fosu", "fosu", "fosu", "fosu", 
"fosu", "fosu", "fosu", "fosu", "prim", "prim", "prim", "prim", 
"prim", "prim", "prim", "prim", "prim", "prim", "prim", "prim"
), chamber = c(1, 2, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 
10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 1, 2, 3, 4, 5, 
6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 
10, 11, 12), temperature = c(3.5, 0, 2, 0, 1.5, 3, 5, 3.5, 0, 
4.5, 2, 0, 1.5, 3, 5, 5.5, 2.5, 0, 4, 3.5, 0, 4.5, 2, 0, 1.5, 
3, 5, 5.5, 0, 4, 3.5, 0, 4.5, 2, 0, 1.5, 3, 5, 5.5, 2.5, 0, 4, 
3.5, 0, 4.5, 2, 0, 1.5, 3, 5, 5.5, 2.5, 0, 4, 3.5, 0, 4.5, 2, 
1.5, 3, 5, 5.5, 2.5, 0, 4, 3.5, 0, 4.5, 2, 0, 1.5, 3, 5, 5.5, 
2.5, 0, 4), median_time = structure(c(11, 8, 14, 17.5, 16, 9, 
8, 20, 9, 13, 11, 9, 7, 9, 14, 6, 22, 7, 19, 23, 1, 23, 23, 2, 
0, 1, 23, 2, 1, 2, 15, 19.508716014162, 21, 20, 3, 12, 22, 21, 
1, 23, 0.999999999999999, 12, 23, 0.999999999999999, 0.999999999999999, 
17, 2, 3, 17, 0.999999999999999, 0.999999999999999, 16, 14, 0, 
12.3324823150422, 14, 13, 12, 10, 12, 18, 15, 9.65973937593219, 
15, 13, 0.999999999999999, 23, 0.999999999999999, 6, 21, 17, 
4, 0.999999999999999, 4, 4, 2, 3), medians = 11, circularp = list(
    type = "angles", units = "hours", template = "none", modulo = "2pi", 
    zero = 0, rotation = "counter"), class = c("circular", "numeric"
)), median_time_radians = c(-0.26179938779915, -1.0471975511966, 
0.523598775598299, 1.43989663289532, 1.0471975511966, -0.785398163397448, 
-1.0471975511966, 2.0943951023932, -0.785398163397448, 0.261799387799149, 
-0.26179938779915, -0.785398163397448, -1.30899693899575, -0.785398163397448, 
0.523598775598299, -1.5707963267949, 2.61799387799149, -1.30899693899575, 
1.83259571459405, 2.87979326579064, -2.87979326579064, 2.87979326579064, 
2.87979326579064, -2.61799387799149, -3.14159265358979, -2.87979326579064, 
2.87979326579064, -2.61799387799149, -2.87979326579064, -2.61799387799149, 
0.785398163397447, 1.96577725566528, 2.35619449019234, 2.0943951023932, 
-2.35619449019234, 0, 2.61799387799149, 2.35619449019234, -2.87979326579064, 
2.87979326579064, -2.87979326579064, 0, 2.87979326579064, -2.87979326579064, 
-2.87979326579064, 1.30899693899575, -2.61799387799149, -2.35619449019234, 
1.30899693899575, -2.87979326579064, -2.87979326579064, 1.0471975511966, 
0.523598775598298, -3.14159265358979, 0.0870436665320824, 0.523598775598299, 
0.261799387799149, 0, -0.523598775598299, -4.44089209850063e-16, 
1.5707963267949, 0.785398163397448, -0.612678798671407, 0.785398163397447, 
0.261799387799149, -2.87979326579064, 2.87979326579064, -2.87979326579064, 
-1.5707963267949, 2.35619449019234, 1.30899693899575, -2.09439510239319, 
-2.87979326579064, -2.0943951023932, -2.0943951023932, -2.61799387799149, 
-2.35619449019234)), class = "data.frame", row.names = c(NA, 
-77L))

model

priors <- prior_string("student_t(3, 0, 5)", class = "sd")

fit <- brm(median_time_radians ~ temperature + (temperature | sp), 
                            family = von_mises(), 
                            prior = priors,
                            data = median_times,
                            control = list(adapt_delta = 0.9),
                            chains = 2, iter = 7500, warmup = 5000, seed = 12345)

species level coefficients

coef(fit)$sp[,,'temperature']

      Estimate  Est.Error          Q2.5      Q97.5
apla -0.3153341 0.23798523  -0.892426917  0.0289234
apru  0.2865710 0.27866258   0.002069992  0.8184251
caca -6.5935606 3.15748526 -14.064381326 -2.5290273
cape  3.0701637 2.21674069  -0.253182098  7.5921491
crli  3.2702919 1.82584857   1.068027298  7.8987657
fosu  0.0571131 0.08858313  -0.101666321  0.2462271
prim -3.3404271 1.61870242  -7.440654851 -1.3915963

I’m confused why the species caca has the highest absolute value of its coefficient even though its median time barely changes — all its median values are between 23:00 and 2:00, so its trend crosses midnight but the times do not change much. I would appreciate any help interpreting these coefficients, or coefficients from a mixed-effects model with circular response more generally.


Get this bounty!!!

#StackBounty: #mixed-model #repeated-measures #lme4-nlme #nested-data Repeated measures analysis with LME: Why nest experimental factor…

Bounty: 50

Consider a pure repeated measures design, with (let’s say) 3 experimental within-subject factors A, B, and C, and (for simplicity) 2 levels per factor. So we have 222 = 8 measurements per subject.

Now I would like to test the fixed effects with a linear mixed effects model. I have read in several sources (for example Andy Field’s Book “Discovering Statistics using R”, and this site: http://www.jason-french.com/tutorials/repeatedmeasures.html ) that with lme, one should use the following syntax:

model <- lme(dv ~ A*B*C, random = ~1|id/A/B/C)

However, I do not understand why you would “nest” the factors within the subject in the random part of the model, and not just use (1|id). What is the point of this, and what does it do?

Conceptually, I don’t understand why one would nest the experimental fixed factors within the random subject factor. The way I understood nesting until now, you would only use it to account for the fact that certain lower factor levels only exist within certain higher factor levels – like pupils within classes within schools within cities, etc. How does this apply to a repeated measures design with fully crossed within-subject factors?

Mathematically, the way I understood this is that such a model would first estimate a random intercept for each subject, capturing random differences in the average values of the dependent variable between subjects. So in the case of, (let’s say) 20 subjects, we get 20 different random intercepts. Then, apparently, the model estimates random intercepts for each combination of subject and level of factor A (resulting in 40 random intercepts), then for each combination of subject, factor A and factor B (80 random intercepts), all the way down to the most specific level, where we get as many estimated random intercepts as we have measurements (160). What is the point of this, and why would we not only estimate a random intercept per subject (1|subject)? Also, wouldn’t all of these random intercepts together explain the dependent variable nearly perfectly, and leave little to nothing to be explained by the fixed effects?

Lastly, my intuition tells me that these random intercepts should at least partially explain the same information as would be captured by entering random slopes of the experimental factors into the model. Is that correct?


Get this bounty!!!

#StackBounty: #r #mixed-model #fixed-effects-model #categorical-encoding The difference in interpretation between a country and a year …

Bounty: 50

I am trying to expand my knowledge about the different interpretations of combinations of fixed effects.

I am using a pooled cross section dataset with observations at the firm level. The dataset spans multiple countries over 2 years (2005-2010).

EDIT: Please see sample data below

My question is very simple. In this scenario, what is the difference of interpretation between including country and year fixed effects, country-year fixed effects, or both?

Is there a case to be made for each option when taking into account my dataset?

I read the following on another site:

When you interact state and year dummies (i.e. when you include state,
year, and state*year in the
regression, which by the way is the same as creating state-year dummies and including them in the
regression), you are assuming that the unobserved state-level heterogeneity varies over time. Also,
you are assuming the time effect to vary by state. If you include state and year separately and no
interaction, you are assuming that the unobserved state-level heterogeneity is constant over time.

If I read this, I get the feeling that it is always better to include the interactions. But in statistics, nothing seems to come for free. So what is the downside here?

Please see my thought process below (and please correct me if I am wrong):

  1. If I add a country dummy, I account for static differences per country. In other words, I control for time constant omitted variable bias.
  2. If I add a year dummy I account for trends (I would say world trends in this case).
  3. If I add a country-year dummy, I am controlling for trends that are country specific.
  4. If I add a country dummy, a year dummy and a country-year dummy, I am doing all of this at once?

For 1 and 2 I am pretty much okay.

By point 3 I begin to wonder: do I need to? Should I always include this? At what cost do I include this country-year dummy?

If I have a country-year dummy without a country and year dummy, does that make sense?

Or should I therefore put them all in? Coming to point 4..

Data

panelID= c(1:50)
year= c(2005, 2010)
country = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
urban = c("A", "B", "C")
indust = c("D", "E", "F")
sizes = c(1,2,3,4,5)
n <- 2
library(data.table)
library(dplyr)
set.seed(123)
DT <- data.table(   country = rep(sample(country, length(panelID), replace = T), each = n),
                    year = c(replicate(length(panelID), sample(year, n))),
                    sales= round(rnorm(10,10,10),2),
                    industry = rep(sample(indust, length(panelID), replace = T), each = n),
                    urbanisation = rep(sample(urban, length(panelID), replace = T), each = n),
                    size = rep(sample(sizes, length(panelID), replace = T), each = n))
DT <- DT %>%
group_by(country) %>%
mutate(base_rate = as.integer(runif(1, 12.5, 37.5))) %>%
group_by(country, year) %>%
mutate(taxrate = base_rate + as.integer(runif(1,-2.5,+2.5)))
DT <- DT %>%
group_by(country, year) %>%
mutate(vote = sample(c(0,1),1), 
votewon = ifelse(vote==1, sample(c(0,1),1),0))

# No interaction

summary(ivreg(sales ~ taxrate + as.factor(size) + as.factor(urbanisation) + country + as.factor(year) | 
as.factor(votewon) + as.factor(size) + as.factor(urbanisation) + country + as.factor(year), data=DT))

summary(ivreg(sales ~ taxrate + as.factor(size) + as.factor(urbanisation) + as.factor(vote) + country + as.factor(year) 
| as.factor(votewon) + as.factor(size) + as.factor(urbanisation) + as.factor(vote) + country + as.factor(year), data=DT))

# Interaction

summary(ivreg(sales ~ taxrate + as.factor(size) + as.factor(urbanisation) + country:as.factor(year) | 
as.factor(votewon) + as.factor(size) + as.factor(urbanisation) + country:as.factor(year), data=DT))

summary(ivreg(sales ~ taxrate + as.factor(size) + as.factor(urbanisation) + as.factor(vote) + country:as.factor(year) 
| as.factor(votewon) + as.factor(size) + as.factor(urbanisation) + as.factor(vote) + country:as.factor(year), data=DT))

# Both

summary(ivreg(sales ~ taxrate + as.factor(size) + as.factor(urbanisation) + country*as.factor(year) | 
as.factor(votewon) + as.factor(size) + as.factor(urbanisation) + country*as.factor(year), data=DT))

summary(ivreg(sales ~ taxrate + as.factor(size) + as.factor(urbanisation) + as.factor(vote) + country*as.factor(year) 
| as.factor(votewon) + as.factor(size) + as.factor(urbanisation) + as.factor(vote) + country*as.factor(year), data=DT))


Get this bounty!!!