#StackBounty: #r #time-series #mixed-model #lme4-nlme #linear-model Linear Mixed Model: how to test interactions within and across blocks

Bounty: 50

I know that there are many related questions and answers, but each setup, including my case, is so specific in LMM that it is hard to draw reliable analogies.

The data (available here):

  • subject_id: Participant ID.
  • rt_start: Response time (RT) to stimulus (between 150 and 1000 ms).
  • stim_type: Either of two types of the displayed stimulus, probe or control.
  • block_number: Block (1, 2, 3, or 4). (Different blocks contain different specific stimulus words, but their types are always either probe or control.)
  • trial_number: Trial, numbered 1-648, with ca. 46% "missing" trials randomly throughout, but relatively evenly distributed (too slow and incorrect responses, and other, irrelevant types excluded from the data, originally mixed together with probe and control stimuli in a balanced sequence).

It is important that the stim_type effect is large and expected: probe RT is on average much larger than control RT. What I want to test is just how much this difference is influenced by learning (practice/habituation effect) throughout the task, expecting that the difference is decreasing with time. Crucially, I want to see both (a) whether there is a within-block effect (increasing trial number leads to decreasing probe-control difference; stim_type x trial_number interaction), and (b) whether there is a between-block effect (increasing block number leads to decreasing probe-control difference; stim_type x block_number interaction)

I’m pretty sure that I need LMM to do this, but I’m not sure how exactly.

Some general relevant points:

  • I expect an overall learning effect (generally faster responses), hence both block_number and trial_number main fixed effects are included, but otherwise they are not variables of interest in themselves.
  • Since each trial happens in each block (in principle), perhaps one could argue that trials in the first blocks (1-162) are no different from e.g. those in the fourth blocks (1-162 in that block, but counting with the entire test 487-648). However, for now this is not taken into account.
  • I assume that the degree of change per trial and per block are both identical, hence I treat them as numeric (continuous) variables.
  • The general RT as well as the item_type factor (probe vs. control) differs by subject, hence I include (item_type|subjects) that should, if I understand correctly, account for both general RT baseline (intercept) and for probe-control difference ("slope" of the dummy variable item_type) per individual.
  • I’m using glmer with Gamma(link = "identity"), as this is recommended for RTs, plus this seems to improve a little bit the residual problem (see below), though still not very well. I had no success with robust alternatives (e.g. robustlmm and similar), mostly because they run into errors due to the large sample.
  • No matter how I vary the approach (e.g. any of the parameters mentioned above, or even with block-wise aggregation using simple ANOVA), the overall length effect (e.g., stim_type x trial_number interaction) seems robust (e.g. at least p < .015 for everything I tried so far), so there seems to be little doubt about it. I even tried robustlmm:rlmer with decimated data (to avoid the large-data error), and even so the effects were significant and with similar estimates. What’s more, I got the same results (again very low p values) with another very similar independent dataset (omitted here for brevity), where residuals look somewhat better.

library('lme4')

# fitting full model
mlm_full = glmer(
rt_start ~ stim_type + block_number + trial_number +
    stim_type:block_number + stim_type:trial_number +
    (stim_type | subject_id),
data = lgcit_dat,
family = Gamma(link = "identity")
)


> Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
> Family: Gamma  ( identity )
> Formula: rt_start ~ stim_type + block_number + trial_number + stim_type:block_number +  
>     stim_type:trial_number + (stim_type | subject_id)
> Data: lgcit_dat
>     AIC       BIC    logLik  deviance  df.resid 
> 910966.3  911058.7 -455473.1  910946.3     76637 
> Random effects:
> Groups     Name           Std.Dev. Corr 
> subject_id (Intercept)    20.1051       
>             stim_typeprobe 28.2413  -0.07
> Residual                   0.1904       
> Number of obs: 76647, groups:  subject_id, 219
> Fixed Effects:
>                 (Intercept)               stim_typeprobe                 block_number  
>                 514.31118                    115.23858                     18.92389  
>             trial_number  stim_typeprobe:block_number  stim_typeprobe:trial_number  
>                 -0.19776                     -3.21046                     -0.05767  


# without stim_type:trial_number
mlm_xtrial = glmer(
rt_start ~ stim_type + block_number + trial_number +
    stim_type:block_number +
    (stim_type | subject_id),
data = lgcit_dat,
family = Gamma(link = "identity")
)

# without stim_type:block_number
mlm_xblock = glmer(
rt_start ~ stim_type + block_number + trial_number +
    stim_type:trial_number +
    (stim_type | subject_id),
data = lgcit_dat,
family = Gamma(link = "identity")
)

# test whether stim_type:trial_number is significant contributor
aov_trials = anova(mlm_full, mlm_xtrial, )

>         npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)   
> mlm_xtrial    9 910973 911056 -455477   910955                        
> mlm_full     10 910966 911059 -455473   910946 8.3873  1   0.003779 **



# test whether stim_type:block_number is significant contributor
aov_trials = anova(mlm_full, mlm_xblock, )

>         npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)
> mlm_xblock    9 910965 911048 -455474   910947                     
> mlm_full     10 910966 911059 -455473   910946 0.9347  1     0.3336

As noted above, the residuals are (still) problematic (via ggpubr::ggqqplot(resid(mlm_full), shape = 1)):

QQ

Which is strange because the underlying data seems normally distributed, see e.g. (via lmer/profile/xyplot):

xyplot

Questions:

  1. Is this approach generally correct? In particular: is this how I should include and test the significance of the two critical interactions?

  2. Given the very strong and repeated (replicated) evidence, is the assumption violation "kind of" permissible? If not, how can I do it better (in practice, not in theory)?

  3. Looking at the full model, would it be correct to say that the probe-control differences decrease significantly by 0.06 ms per each trial, and nominally (but without statistical significance) by an additional 3 ms per each block? (Note that these coefficients too hardly change depending on settings.)


Get this bounty!!!

#StackBounty: #regression #mathematical-statistics #correlation #linear-model #multiple-comparisons Correlation of t-statistics for two…

Bounty: 50

Say we run two simple linear regressions with common $y$ but different and correlated predictors $x_1, x_2$:
$$y = beta_{01} + beta_1 x_1 + epsilon_1,$$
$$y = beta_{02} + beta_2 x_2 + epsilon_2.$$

We obtain the t-statistics $t_1 = hatbeta_1/hat{se}(hatbeta_1), t_2 = hatbeta_2/hat{se}(hatbeta_2)$. The t-statistics can be expressed as a function of sample correlation $r$, $t = sqrt{(n-2)/(r^{-2}-1)}.$ My questions are:

  1. Is $rho(t_1, t_2)=rho(x_1, x_2)$ approximately true? $rho$ represents population correlation. A simple simulation shows that the two correlations are pretty close. Normality assumption can be assumed if needed.
  2. If Q1 is untrue, would $rho(t_1, t_2)$ be unrelated to $y$?

Any suggestions would be greatly appreciated.


Get this bounty!!!

#StackBounty: #machine-learning #linear-model #interpretation #intercept #lime Using LIME without intercept term?

Bounty: 50

I’m playing with LIME to explain the prediction of a machine learning model.

LIME trains a (locally weighted) linear surrogate model around a point of interest. The weights of that surrogate model are the feature importances of your model’s prediction at that point.

However, it’s not clear to me how to interpret the intercept term. It’s the "base" prediction if all features are zero, which seems meaningless.

(Compare this to SHAP, where the "base" value is the average prediction across the training set.)

So, does it make sense to run LIME without using an intercept term? This way, the feature importances will directly sum up to the prediction.

If not, which I suspect, how do I interpret the intercept term?


Get this bounty!!!

#StackBounty: #regression #mathematical-statistics #correlation #linear-model Correlation of t-statistics for two correlated predictors…

Bounty: 50

Say we run two simple linear regressions with common $y$ but different and correlated predictors $x_1, x_2$:
$$y = beta_{01} + beta_1 x_1 + epsilon_1,$$
$$y = beta_{02} + beta_2 x_2 + epsilon_2.$$

We obtain the t-statistics $t_1 = hatbeta_1/hat{se}(hatbeta_1), t_2 = hatbeta_2/hat{se}(hatbeta_2)$. My questions are:

  1. Is $cor(t_1, t_2)=cor(x_1, x_2)$ approximately true? A simple simulation shows that the two correlations are pretty close.
  2. If Q1 is untrue, would $cor(t_1, t_2)$ be unrelated to $y$?

Any suggestions would be greatly appreciated.


Get this bounty!!!

#StackBounty: #regression #variance #econometrics #linear-model #estimators Compare the variances of restricted and unrestricted estima…

Bounty: 50

Problem
Given a linear model $y_i = beta_1 + beta_2 x_i +epsilon_i, quad i = 1, dots, n$
I need to compare the variance ordinary least squares estimator of $beta_2$ without the restrictions and the variance of ordinary least squares estimator of $beta_2$ under linear restriction $beta_1= 0$ (i.e. $Bbb Var(beta_2^R)$ and $Bbb Var(beta_2^U) $).
Is $beta_2^R$ unbiased and are there any violations of Gauss–Markov theorem?

My ideas
Intuitively, the restricted variance should be less than the unrestricted variance because the restrictions always damage the flexibility of a model hence reducing the variance.
However, I need a mathematical proof for the problem above.
UPD
With help from comments I have obtained that $$ Var(beta_2^{UR}) = frac{sum(y_i – beta_1- beta_2x_i)^2/(n-2)}{sum(x_i – bar{x})^2}$$ and $$ Var(beta_2^{R}) = frac{sum(y_i – beta_2x_i)^2/(n-2)}{sum(x_i )^2}$$


Get this bounty!!!

#StackBounty: #self-study #pca #linear-model #functional-data-analysis Asymptotic properties of functional models

Bounty: 100

When working in Functional Data Analysis, a classical "preprocessing" step is to represent the "observations" using a B-spline expansion:

$$
X_i(t) approx sum_{j=1}^J lambda_{ij} f_j(t) qquad i=1, ldots, n
$$

where $J$ is the number of elements in the basis and $f_1, ldots, f_J$ are suitably defined B-spline functions.
Then, statistical methods are performed by working directly on the coefficients ${lambda_{ij}}$.

My question is if there are some asymptotic guarantees that as the number of data $n$ and the truncation level $J$ increase to $+infty$ the statistical methods converge to a "true" idealized solution.

In particular, I’m interested in functional on functional regression an functional PCA.
I know the literature is huge, but it would be great to have some papers to start from!


Get this bounty!!!

#StackBounty: #regression #inference #linear-model #matlab #standard-error How to best find standard error *across* linear regression f…

Bounty: 50

So I have a scenario where I n = 8 subjects, each of which have a different source of noise in their respective observations $y$. For example, consider the following:

num_datasets = 8;

x = [1:20]';

%define matrix for the response for 8 different datasets
Y = repmat(x,1,8) * nan;

for i = 1:size(X,2)
    Y(:,i) = 2*x + unifrnd(3,8)*randn(size(x));
end

So clearly each observation/subject has the same true slope but different amounts/sources of noise. Now, I know that the standard error for the linear regression fit has the form:

$sigmasqrt{frac{1}{n}+ frac{(x^*-bar x)^2}{sum_{i=1}^n (x_i-bar{x})^2} }$

where $sigma$ represents the standard deviation of the residuals of the fit, $n$ represents the number of samples in the observation (in my example above this would be 20, not 8), $(x^* – bar x)$ represents the distance of each $x_i$ sample from the mean (which is why the standard error increases hyperbolically as you deviate from the mean), and then ${sum_{i=1}^n (x_i-bar{x})^2}$ is simply the variance in $x$.

However, if I interpret this equation correctly, I think this gives the standard error across the dimension of $x$, and doesn’t directly tell me the standard error across subjects. In other words, I suspect it wouldn’t be a good idea to use this formula for each subject and then take the mean standard error (please correct me if I am wrong). So I have 2 questions:

  1. What would be the best way to calculate the standard error across subjects? Would it simply be to perform the fit for each subject, and then take the standard deviation of the fits?
  2. What would the shape of the standard error of the fit look like, and what is the intuition behind that? Would it still be hyperbolic? I don’t think it would, but actually really not sure.


Get this bounty!!!

#StackBounty: #regression #mixed-model #linear-model Appropriate linear mixed-effects model for hierarchical data?

Bounty: 50

We have some hypothetical data for the abundance of a protein that accumulates in cells with age.

The goal is to run a linear mixed model to test how much of the variation in the rate at which cells accumulate this protein is explained by differences in the lifespan between species. For example do the cells in long lived elephants accumulate the protein at a slower rate than short lived mice?

For multiple animals of different ages we have measured the abundance of the protein from multiple independent cells. I’ve run a basic linear mixed model but am not sure how to correctly incorporate the nested hierarchical nature of the data (i.e. that we have cells sampled from animals that are sampled from different species).

The data:

Protein_count   Sample  Age_years   Species Lifespan_years  Protein_count_per_year
55  mouse_1 0.5 mouse   2   110
60  mouse_1 0.5 mouse   2   120
50  mouse_1 0.5 mouse   2   100
49  mouse_1 0.5 mouse   2   98
101 mouse_2 1   mouse   2   101
110 mouse_2 1   mouse   2   110
98  mouse_2 1   mouse   2   98
104 mouse_2 1   mouse   2   104
206 mouse_3 2   mouse   2   103
190 mouse_3 2   mouse   2   95
230 mouse_3 2   mouse   2   115
212 mouse_3 2   mouse   2   106
80  mink_1  4   mink    8   20
88  mink_1  4   mink    8   22
90  mink_1  4   mink    8   22.5
108 mink_1  4   mink    8   27
180 mink_2  7   mink    8   25.71428571
184 mink_2  7   mink    8   26.28571429
175 mink_2  7   mink    8   25
38  tiger_1 5   tiger   20  7.6
44  tiger_1 5   tiger   20  8.8
55  tiger_1 5   tiger   20  11
50  tiger_1 5   tiger   20  10
163 tiger_2 15  tiger   20  10.86666667
145 tiger_2 15  tiger   20  9.666666667
155 tiger_2 15  tiger   20  10.33333333
152 tiger_2 15  tiger   20  10.13333333
89  elephant_1  30  elephant    60  2.966666667
97  elephant_1  30  elephant    60  3.233333333
112 elephant_1  30  elephant    60  3.733333333
104 elephant_1  30  elephant    60  3.466666667
204 elephant_2  58  elephant    60  3.517241379
187 elephant_2  58  elephant    60  3.224137931
199 elephant_2  58  elephant    60  3.431034483
184 elephant_2  58  elephant    60  3.172413793
180 elephant_2  58  elephant    60  3.103448276

Definition of headers:

Protein_count = abundance of the protein measured in a cell
Sample = ID of animal from which cells are taken
Age_years = age of the sampled animal
Species = the species of the animal
Lifespan_years = lifespan of the species
Protein_count_per_year = protein abundance adjusted by sample age to give a rate of protein accumulation per year (Protein_count_per_year = Protein_count/Age_years)

I’ve done a simple linear mixed-effect model like this:

mixed.lmer <- lmer(Protein_count_per_year ~ Lifespan_years + (1|Species), data = species_data)

However I think I should be taking a few more things into account:

1) There are hierarchical relationships in the data. For example there is hierarchical nesting between sample and species. Should I be doing something like this to take that into account:

  + (1|Species/Sample)

or

+ (1|Sample) + (1|Species:animal)

I’m not sure which formulation is correct.

2) We think that the protein abundance starts at 0 when each animal is born and increases linearly with age, but we want to know if the rate of accumulation varies with species and is explained by lifespan. So I think we would want to fix the intercept at 0 and add a random slope?

Please let me know if I can be more clear in my question. I really appreciate any feedback or suggestions. Thanks in advance!


Get this bounty!!!

#StackBounty: #regression #hypothesis-testing #multiple-regression #inference #linear-model Working with Covid-19 Dataset in R, Checkin…

Bounty: 100

Im working on a dataset of individuals with the covid-19 infection. The dataset consist of 2010 individuals and four columns:

  • deceased: if the person died of corona (1:yes, 0:no)
  • sex: male/female
  • age: age of the person (ranging from 2 to 99 years old)
  • country: which country the person is from (France, Japan, Korea, Indonesia)

The dataset can be loaded with the following code:

id = "1CA1RPRYqU9oTIaHfSroitnWrI6WpUeBw"
d.corona = read.csv(sprintf("https://docs.google.com/uc?id=%s&export=download",id),header = T)

Im trying to answer the following questions:

  • Is age a greater risk factor for males than for females?
  • Is age a greater risk factor for the French population than for the Korean population?
  • It seems like the French population is at much higher risk of dying from Covid-19 than the other countries (just from inspecting data). Does this seem like a valid result, or could it be influenced by the way the data were collected?

I was thinking it would be a good idea to fit some appropriate models to answer the questions. I already fit a logistic regression model with glm but I don’t see how I could use that to answer these specific questions. For the very last question I believe the mean age in France is significantly higher than in the other countries, which would explain why it seems like the probability of dying from the virus is higher in France based on the data. But Im sure there could be a better explanation.


Get this bounty!!!

#StackBounty: #regression #hypothesis-testing #multiple-regression #estimation #linear-model Multiple Linear Regression Coefficient Est…

Bounty: 100

A multiple linear regression model is considered. It is assumed that $$ Y_i = beta_1x_{i1} + beta_2x_{i2} + beta_3x_{i3} + epsilon_i$$ where $epsilon$-s are independent and have the same normal distribution with zero expectation and unknown variance $sigma^2$. 100 measurements are made, i.e $i = 1,2,…, 100.$ The explanatory variables take the following values: $x_{i1} = 2$ for $1 leq i leq 25$ and $0$ otherwise, $x_{i2} = sqrt{2}$ for $26 leq i leq 75$ and $0$ otherwise, $x_{i3} = 2$ for $76 leq i leq 100$ and $0$ otherwise.

a) Let $hat{beta_1},hat{beta_2}, hat{beta_3}$ be least squares estimators of $beta_1, beta_2, beta_3$. Prove that in the considered case $hat{beta_1},hat{beta_2}, hat{beta_3}$ are independent, and $$Var(hat{beta_1}) = Var(hat{beta_3}) = Var(hat{beta_3})$$ Do these properties hold in the general case? If not, give counterexamples.

b) Perform a test for $$H_0: beta_1 + beta_3 = 2beta_2$$vs.$$H_1: beta_1 + beta_3 neq 2beta_2$$ The significance level is 0.05. The least squares estimates of $beta_1, beta_2$ and $beta_3$ are $0.9812, 1.8851$ and $3.4406$, respectively. The unbiased estimate of the variance $sigma^2$ is $3.27$.

For a) I know the OLS estimator for $hat{beta} = (X^TX)^{-1}X^Ty$, and $Var(hat{beta}) = sigma^2 (X^TX)^{-1}$. But I don’t know how to attain explicit expressions for each of the coefficients from this. Although it seems quite clear that the estimators are independent, for instance $P(hat{beta_3} = beta_3, hat{beta_1} = 0, hat{beta_2} = 0) = P(hat{beta_3} = beta_3)$ but I don’t how to write a proper proof. I believe the estimators are generally dependent and have unequal variance, but can’t come up with any particular examples.

For b) not sure what test-statistic to use (t or F) and how to set it up. Also don’t know the standard errors of the coefficients


Get this bounty!!!