#StackBounty: #r #lme4-nlme #panel-data #cross-section Combining cross-sectional data with panel data

Bounty: 50

Let’s say I want to regress crop yield as a function of rainfall and temperature.

I collected data across 20 locations and 10 years and construct a model regressing yield on rainfall and temperature with location and year as random effects in R.

  mdl <- lmer(yield ~ rainfall + temperature + (1|location) + (1|year) 

However, I know irrigation is quite important for crop yield. However, I have irrigation data only for one year across all the 20 locations. So I can construct a single model for a year using this:

mdl.yld <- lmer(yield ~ rainfall + temperature + irrigation + (1|location) # for a single year which has irrigation data

Can I use the regression coefficients of irrigation from mdl.yld and add to the regression equation inmdl so that I have an equation that predicts yield as a function of rainfall, temperature and irrigation?


Get this bounty!!!

#StackBounty: #html #r #knitr #bookdown #bibliography Gitbook chapter bibliography not in alphabetical order

Bounty: 50

I’m using bookdown to create an HTML gitbook from R markdown files (i.e. .Rmd), with the default option of split_bib = TRUE resulting in a bibliography at the end of each chapter, as well as a complete bibliography at the end of the book.

The end-of-book bibliography is in alphabetical order, but the end-of-chapter bibliographies are not. (Here’s an example).

How can I arrange all reference lists alphabetically?


Get this bounty!!!

#StackBounty: #r #save #environment #glm #rdata Saving a single object within a function in R: RData file size is very large

Bounty: 50

I am trying to save trimmed-down GLM objects in R (i.e. with all the “non-essential” characteristics set to NULL e.g. residuals, prior.weights, qr$qr).

As an example, looking at the smallest object that I need to do this with:

print(object.size(glmObject))
168992 bytes
save(glmObject, "FileName.RData")

Assigning this object in the global environment and saving leads to an RData file of about 6KB.

However, I effectively need to create and save the glm object within a function, which is in itself within a function. So the code looks something like:

subFn <- function(DT, otherArg, ...){
                 glmObject <- glm(...)
                 save(glmObject,"FileName.RData")
}

mainFn <- function(DT, ...){ 
             subFn(DT, otherArg, ...)
}

mainFn(DT, ...)

Which leads to much, much larger RData files of about 20 MB, despite the object itself being the same size.

So I understand this to be an environment issue, but I’m struggling to pinpoint exactly how and why it’s happening. The resulting file size seems to vary quite a lot. I have tried using saveRDS, and equally I have tried assigning the glmObject via <<- to make it global, but nothing seems to help.

My understanding of environments in R clearly isn’t very good, and would really appreciate if anyone could suggest a way around this. Thanks.


Get this bounty!!!

#StackBounty: #r #mixed-model #lme4-nlme #simulation Simulating linear mixed model data for factorial design with 3 levels

Bounty: 50

I struggle to simulate linear mixed model data for a within-subject within-item factorial design (with crossed participant and item effects) where the categorical independent variable has three levels.

The default parameters correspond to the estimates I got from an analysis of previous data.

As I am using the mvrnorm() function with empirical = TRUE I expected that the estimated parameters from lmer and my specified parameters for the simulation would match closely.

However, they differ more than I expected:

  • effect1: should be 12, estimated as 13.771 (Fixed effects table: contrast1 in summary below)
  • item_cor_s2_i: should be 0.3, estimated as 0.11 (Random effects table: Corr(item, contrast2) in summary below))
  • item_cor_s1_s2: should be -0.7, estimated as -0.88 (Random effects table: Corr(contrast1, contrast2) in summary below)

Since this is my first time programming a simulation for mixed models, I’m not sure if the (for me) unexpected results are caused by a programming error or if the default parameters are difficult to estimate for lmer.

I would be grateful if somebody more experienced in mixed model simulations could comment on this.

Running this code …

library("MASS")
library("lme4")    
set.seed(123)

simulate_data <- function(n_items = 60, n_subj = 27, n_subj_item_rep = 30,
                          mu = 308, 
                          contrast1 = c(2/3, -1/3, -1/3), contrast2 = c(-1/3, 2/3, -1/3), 
                          effect1 = 12, effect2 = -4.5, 
                          subj_interc_sd = 43, 
                          subj_slope1_sd = 17, 
                          subj_slope2_sd = 112, 
                          subj_cor_s1_i = 0.4,
                          subj_cor_s2_i = -0.65,
                          subj_cor_s1_s2 = -0.55,
                          item_interc_sd = 18, 
                          item_slope1_sd = 6, 
                          item_slope2_sd = 6, 
                          item_cor_s1_i = 0.4,
                          item_cor_s2_i = 0.3,
                          item_cor_s1_s2 = -0.7,
                          subj_item_interc_sd = 20,
                          residual_sd = 180) {

  # generate design
  dat <- expand.grid(item = seq_len(n_items), subject = seq_len(n_subj))
  dat$item_group <- dat$item %% 3 + 1    
  dat$subject_group <- dat$subject %% 3 + 1
  dat$condition <- (dat$item_group + dat$subject_group) %% 3 + 1

  # generate appropriate contrasts
  # R requires the generalized inverse of the contrast matrix
  contrast_mat <- ginv(rbind(contrast1, contrast2))
  dat$contrast1 <- contrast_mat[, 1][dat$condition]
  dat$contrast2 <- contrast_mat[, 2][dat$condition]

  # fixed grand mean
  dat$mean_response <- mu

  # condition effects
  dat$contrast_effect1 <- effect1 * dat$contrast1
  dat$contrast_effect2 <- effect2 * dat$contrast2

  # random effects subjects: intercept, slope1, slope2
  subject_effect <- mvrnorm(n_subj, 
                            mu = c(0, 0, 0), 
                            Sigma = matrix(c(subj_interc_sd^2, 
                                             subj_cor_s1_i * subj_slope1_sd * subj_interc_sd,
                                             subj_cor_s2_i * subj_slope2_sd * subj_interc_sd,
                                             subj_cor_s1_i * subj_slope1_sd * subj_interc_sd,
                                             subj_slope1_sd^2,
                                             subj_cor_s1_s2 * subj_slope1_sd * subj_slope2_sd,
                                             subj_cor_s2_i * subj_slope2_sd * subj_interc_sd,
                                             subj_cor_s1_s2 * subj_slope1_sd * subj_slope2_sd,
                                             subj_slope2_sd^2), nrow = 3), 
                            empirical = TRUE)  # mu & sigma as empirical (not population!) parameters for testing

  dat$subject_intercept <- subject_effect[dat$subject, 1] 
  dat$subject_slope1 <- subject_effect[dat$subject, 2] * dat$contrast1
  dat$subject_slope2 <- subject_effect[dat$subject, 3] * dat$contrast2

  # random effects items: intercept, slope1, slope2
  item_effect <- mvrnorm(n_items, 
                         mu = c(0, 0, 0), 
                         Sigma = matrix(c(item_interc_sd^2, 
                                          item_cor_s1_i * item_slope1_sd * item_interc_sd,
                                          item_cor_s2_i * item_slope2_sd * item_interc_sd,
                                          item_cor_s1_i * item_slope1_sd * item_interc_sd,
                                          item_slope1_sd^2,
                                          item_cor_s1_s2 * item_slope1_sd * item_slope2_sd,
                                          item_cor_s2_i * item_slope2_sd * item_interc_sd,
                                          item_cor_s1_s2 * item_slope1_sd * item_slope2_sd,
                                          item_slope2_sd^2), nrow = 3), 
                         empirical = TRUE)

  dat$item_intercept <- item_effect[dat$item, 1] 
  dat$item_slope1 <- item_effect[dat$item, 2] * dat$contrast1
  dat$item_slope2 <- item_effect[dat$item, 3] * dat$contrast2

  # random effects subject-item combinations
  dat$subj_item_intercept <- 0 
  if (n_subj_item_rep > 1) {  # if more than 1 observations of the same subject-item combination
    dat$subj_item_intercept <- mvrnorm(n_subj * n_items, mu = 0, Sigma = subj_item_interc_sd^2, empirical = TRUE)[, 1]
  }

  # generate subject-item replications according to number of observations of the same subject-item combination
  dat <- do.call("rbind", replicate(n_subj_item_rep, dat, simplify = FALSE))

  # residual variation
  dat$residual_sd <- mvrnorm(nrow(dat), mu = 0, Sigma = residual_sd^2, empirical = TRUE)[, 1] 

  # calculate response
  dat$response <- dat$mean_response + dat$contrast_effect1 + dat$contrast_effect2 + 
    dat$item_intercept + dat$item_slope1 + dat$item_slope2 + 
    dat$subject_intercept + dat$subject_slope1 + dat$subject_slope2 + 
    dat$subj_item_intercept +
    dat$residual_sd

  # convert categorical variables to factors
  dat[, 1:5] <- lapply(dat[, 1:5], factor)

  # set specified contrasts
  contrasts(dat$condition) <- contrast_mat

  dat
}

d <- simulate_data()

# if more than 1 observations of the same subject-item combination ..
summary(lmer(response ~ contrast1 + contrast2 + (contrast1 + contrast2|item) +
               (contrast1 + contrast2|subject) + (1|subject:item), d))
# .. otherwise
# summary(lmer(response ~ contrast1 + contrast2 + (contrast1 + contrast2|item) +
#                (contrast1 + contrast2|subject), d))

… I get the following output

Linear mixed model fit by REML ['lmerMod']
Formula: response ~ contrast1 + contrast2 + (contrast1 + contrast2 | item) +  
    (contrast1 + contrast2 | subject) + (1 | subject:item)
   Data: d

REML criterion at convergence: 643678

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.1695 -0.6709 -0.0006  0.6728  4.2132 

Random effects:
 Groups       Name        Variance Std.Dev. Corr       
 subject:item (Intercept)   423.44  20.578             
 item         (Intercept)   337.69  18.376             
              contrast1      36.02   6.002   0.38      
              contrast2      66.85   8.176   0.11 -0.88
 subject      (Intercept)  1819.74  42.658             
              contrast1     270.89  16.459   0.45      
              contrast2   12677.59 112.595  -0.65 -0.60
 Residual                 32431.73 180.088             
Number of obs: 48600, groups:  subject:item, 1620; item, 60; subject, 27

Fixed effects:
            Estimate Std. Error t value
(Intercept)  308.000      8.600   35.82
contrast1     13.771      3.534    3.90
contrast2     -4.708     21.737   -0.22

Correlation of Fixed Effects:
          (Intr) cntrs1
contrast1  0.410       
contrast2 -0.619 -0.557


Get this bounty!!!

#StackBounty: #r #time-series #model #bioinformatics How can I analyze multi-dimensional (concentration, time, …) dose-response data?

Bounty: 100

I have data where different compounds were applied to cells and a numerical readout was measured.

  • Each compound was used at multiple concentrations
  • The readout was measured at multiple points in time.
  • Compound-dose combinations were applied to two types of cells.

The goal is to identify compounds which affect the cell types in very different ways.

My current approach is to use the R package drc to fit 3-parameter log-logistic models to estimate an EC50 for each (cell type, time point) combination. Afterward, I use ad hoc methods like examining differences of EC50s to identify compounds which affect the two cell types differently.

Is there a way to fit a dose-response model which accounts for the time variable as well, or ideally both the time variable and the cell type variable? A solution in R is preferred.


Get this bounty!!!

#StackBounty: #r #residuals #prediction-interval #lm Is the variation in the residual standard deviation (on sample) accounted for when…

Bounty: 50

This question is somehow related to Is the residual, e, an estimator of the error, $epsilon$?

I also found some information here: Confidence interval of RMSE

Let’s say, I got a model that explains sampled X by mean(X).

Using suggestions from Mr. @whuber I reproduced calculations of the PI from predict.lm in R.

## lm PI

x_dat <- data.frame(x = rnorm(100, 0, 1))

lm_model <- lm(data = x_dat, x ~ 1)

summary(lm_model)

lm_preds <- predict.lm(lm_model
                       , x_dat
                       , interval = "prediction"
                       , level = 0.95
                       )

beta.hat <- lm_model$coefficients

# var-covariance matrix

V <- vcov(lm_model)

# mean prediction

Xp <- model.matrix(~ 1, x_dat)
pred <- as.numeric(Xp %*% beta.hat)[1]

# mean prediction variance:

pred_var <- unname(rowSums((Xp %*% V) * Xp))[1]

# confidence

t_stat <- qt((1 - 0.95)/2, df = lm_model$df.residual)

# residual MSE

res_mse <- sum(lm_model$residuals ^ 2) / lm_model$df.residual

# PI

PI <- pred + c(t_stat, -t_stat) * sqrt(pred_var + res_mse)

print(PI)

print(lm_preds[1, ])

> print(PI)
[1] -2.043175  2.572508
> 
> print(lm_preds[1, ])
       fit        lwr        upr 
 0.2646666 -2.0431747  2.5725079 

I only have 2 questions.

Is it right that we assume that the true model error variance is that of sampled residual variance in order to make an unbiased estimate?

Given that we don’t know what the true error variance is, does it mean we make a biased estimate of PI, in particular, by not adjusting for an additional dispersion of the residual variance? If so, can we assume that variance of the residual variance follows chi-square distribution in order to get an upper quantile of the value that can be supplied inside for an exact calculation?

predict.lm(...,
        pred.var = 
)


Get this bounty!!!

#StackBounty: #r #hypothesis-testing #chi-squared #garch #restrictions Restriction test (H0: alpha1+beta1 = 1, H1:alpha1 + beta1 ≠ 1) o…

Bounty: 50

I am trying to do the restriction test for GARCH model (ugarch from ‘rugarch’ package) using the following hypothesis:

 H0: alpha1 + beta1 = 1

 H1: alpha1 + beta1 ≠ 1 

So I am trying to follow the advice from
Testing the sum of GARCH(1,1) parameters

1.Specify the restricted model using ugarchspec with option variance.model = list(model = “sGARCH”) and estimate it using ugarchfit. Obtain the log-likelihood from the slot fit sub-slot likelihood.

2.Specify the restricted model using ugarchspec with option variance.model = list(model = “iGARCH”) and estimate it using ugarchfit. Obtain the log-likelihood as above.

3.Calculate LR=2(Log-likelihood of unrestricted model − Log-likelihood of restricted model) and Obtain the p-value as pchisq(q = LR, df = 1).

I have the following ‘sGARCH’ and ‘iGARCH’ models I am using from ‘rugarch’ package.

(A) sGARCH (unrestricted model):

 speccR = ugarchspec(variance.model=list(model = "sGARCH",garchOrder=c(1,1)),mean.model=list(armaOrder=c(0,0), include.mean=TRUE,archm = TRUE, archpow = 1))

 ugarchfit(speccR, data=data.matrix(P),fit.control = list(scale = 1))

And the following is this sGARCH output:

    *---------------------------------*
    *          GARCH Model Fit        *
    *---------------------------------*

    Conditional Variance Dynamics   
    -----------------------------------
    GARCH Model     : sGARCH(1,1)
    Mean Model      : ARFIMA(0,0,0)
    Distribution    : norm 

    Optimal Parameters
    ------------------------------------
            Estimate  Std. Error  t value Pr(>|t|)
    mu     -0.000355    0.001004 -0.35377 0.723508
    archm   0.096364    0.039646  2.43059 0.015074
    omega   0.000049    0.000010  4.91096 0.000001
    alpha1  0.289964    0.021866 13.26117 0.000000
    beta1   0.709036    0.023200 30.56156 0.000000

    Robust Standard Errors:
            Estimate  Std. Error  t value Pr(>|t|)
    mu     -0.000355    0.001580 -0.22482 0.822122
    archm   0.096364    0.056352  1.71002 0.087262
    omega   0.000049    0.000051  0.96346 0.335316
    alpha1  0.289964    0.078078  3.71375 0.000204
    beta1   0.709036    0.111629  6.35173 0.000000

    LogLikelihood : 5411.828 

    Information Criteria
    ------------------------------------

    Akaike       -3.9180
    Bayes        -3.9073
    Shibata      -3.9180
    Hannan-Quinn -3.9141

    Weighted Ljung-Box Test on Standardized Residuals
    ------------------------------------
                            statistic p-value
    Lag[1]                      233.2       0
    Lag[2*(p+q)+(p+q)-1][2]     239.1       0
    Lag[4*(p+q)+(p+q)-1][5]     247.4       0
    d.o.f=0
    H0 : No serial correlation

    Weighted Ljung-Box Test on Standardized Squared Residuals
    ------------------------------------
                            statistic p-value
    Lag[1]                      4.695 0.03025
    Lag[2*(p+q)+(p+q)-1][5]     5.941 0.09286
    Lag[4*(p+q)+(p+q)-1][9]     7.865 0.13694
    d.o.f=2

    Weighted ARCH LM Tests
    ------------------------------------
                Statistic Shape Scale P-Value
    ARCH Lag[3]     0.556 0.500 2.000  0.4559
    ARCH Lag[5]     1.911 1.440 1.667  0.4914
    ARCH Lag[7]     3.532 2.315 1.543  0.4190

    Nyblom stability test
    ------------------------------------
    Joint Statistic:  5.5144
    Individual Statistics:             
    mu     0.5318
    archm  0.4451
    omega  1.3455
    alpha1 4.1443
    beta1  2.2202

    Asymptotic Critical Values (10% 5% 1%)
    Joint Statistic:         1.28 1.47 1.88
    Individual Statistic:    0.35 0.47 0.75

    Sign Bias Test
    ------------------------------------
                       t-value   prob sig
    Sign Bias           0.2384 0.8116    
    Negative Sign Bias  1.1799 0.2381    
    Positive Sign Bias  1.1992 0.2305    
    Joint Effect        2.9540 0.3988    


    Adjusted Pearson Goodness-of-Fit Test:
    ------------------------------------
      group statistic p-value(g-1)
    1    20     272.1    9.968e-47
    2    30     296.9    3.281e-46
    3    40     313.3    1.529e-44
    4    50     337.4    1.091e-44


    Elapsed time : 0.4910491 

(B) iGARCH (restricted model):

 speccRR = ugarchspec(variance.model=list(model = "iGARCH",garchOrder=c(1,1)),mean.model=list(armaOrder=c(0,0), include.mean=TRUE,archm = TRUE, archpow = 1))

 ugarchfit(speccRR, data=data.matrix(P),fit.control = list(scale = 1))

However, I get the following output of beta1 with N/A in its standard error, t-value, and p-value.

The following is the iGARCH output:

    *---------------------------------*
    *          GARCH Model Fit        *
    *---------------------------------*

    Conditional Variance Dynamics   
    -----------------------------------
    GARCH Model     : iGARCH(1,1)
    Mean Model      : ARFIMA(0,0,0)
    Distribution    : norm 

    Optimal Parameters
    ------------------------------------
            Estimate  Std. Error  t value Pr(>|t|)
    mu     -0.000355    0.001001 -0.35485 0.722700
    archm   0.096303    0.039514  2.43718 0.014802
    omega   0.000049    0.000008  6.42826 0.000000
    alpha1  0.290304    0.021314 13.62022 0.000000
    beta1   0.709696          NA       NA       NA

    Robust Standard Errors:
            Estimate  Std. Error  t value Pr(>|t|)
    mu     -0.000355    0.001488  -0.2386 0.811415
    archm   0.096303    0.054471   1.7680 0.077066
    omega   0.000049    0.000032   1.5133 0.130215
    alpha1  0.290304    0.091133   3.1855 0.001445
    beta1   0.709696          NA       NA       NA

    LogLikelihood : 5412.268 

    Information Criteria
    ------------------------------------

    Akaike       -3.9190
    Bayes        -3.9105
    Shibata      -3.9190
    Hannan-Quinn -3.9159

    Weighted Ljung-Box Test on Standardized Residuals
    ------------------------------------
                            statistic p-value
    Lag[1]                      233.2       0
    Lag[2*(p+q)+(p+q)-1][2]     239.1       0
    Lag[4*(p+q)+(p+q)-1][5]     247.5       0
    d.o.f=0
    H0 : No serial correlation

    Weighted Ljung-Box Test on Standardized Squared Residuals
    ------------------------------------
                            statistic p-value
    Lag[1]                      4.674 0.03063
    Lag[2*(p+q)+(p+q)-1][5]     5.926 0.09364
    Lag[4*(p+q)+(p+q)-1][9]     7.860 0.13725
    d.o.f=2

    Weighted ARCH LM Tests
    ------------------------------------
                Statistic Shape Scale P-Value
    ARCH Lag[3]    0.5613 0.500 2.000  0.4538
    ARCH Lag[5]    1.9248 1.440 1.667  0.4881
    ARCH Lag[7]    3.5532 2.315 1.543  0.4156

    Nyblom stability test
    ------------------------------------
    Joint Statistic:  1.8138
    Individual Statistics:             
    mu     0.5301
    archm  0.4444
    omega  1.3355
    alpha1 0.4610

    Asymptotic Critical Values (10% 5% 1%)
    Joint Statistic:         1.07 1.24 1.6
    Individual Statistic:    0.35 0.47 0.75

    Sign Bias Test
    ------------------------------------
                       t-value   prob sig
    Sign Bias           0.2252 0.8218    
    Negative Sign Bias  1.1672 0.2432    
    Positive Sign Bias  1.1966 0.2316    
    Joint Effect        2.9091 0.4059    


    Adjusted Pearson Goodness-of-Fit Test:
    ------------------------------------
      group statistic p-value(g-1)
    1    20     273.4    5.443e-47
    2    30     300.4    6.873e-47
    3    40     313.7    1.312e-44
    4    50     337.0    1.275e-44


    Elapsed time : 0.365 

If I calculate the log-likelihood difference to derive the chi-square value
as suggested I get negative value as such:

 2*(5411.828-5412.268)=-0.88

The Log-likelihood of the restricted model “iGARCH” is 5412.268 which is higher than the Log-likelihood of the unrestricted model “sGARCH” of 5411.828
which should not happen.

The data I use are as follows in time series manner (I am only posting first 100 data due to space limit):

   Time      P
    1   0.454213593
    2   0.10713195
    3   -0.106819399
    4   -0.101610699
    5   -0.094327846
    6   -0.037176107
    7   -0.101550977
    8   -0.016309894
    9   -0.041889484
    10  0.103384357
    11  -0.011746377
    12  0.063304432
    13  0.059539249
    14  -0.049946177
    15  -0.023251656
    16  0.013989353
    17  -0.002815588
    18  -0.009678745
    19  -0.011139779
    20  0.031592303
    21  -0.02348106
    22  -0.007206591
    23  0.077422089
    24  0.064632768
    25  -0.003396734
    26  -0.025524166
    27  -0.026632474
    28  0.014614485
    29  -0.012380888
    30  -0.007463018
    31  0.022759969
    32  0.038667465
    33  -0.028619484
    34  -0.021995984
    35  -0.006162809
    36  -0.031187399
    37  0.022455611
    38  0.011419264
    39  -0.005700445
    40  -0.010106343
    41  -0.004310162
    42  0.00513715
    43  -0.00498106
    44  -0.021382251
    45  -0.000694252
    46  -0.033326085
    47  0.002596086
    48  0.011008057
    49  -0.004754233
    50  0.008969559
    51  -0.00354088
    52  -0.007213115
    53  -0.003101495
    54  0.005016228
    55  -0.010762641
    56  0.030770993
    57  -0.015636325
    58  0.000875417
    59  0.03975863
    60  -0.050207219
    61  0.011308261
    62  -0.021453315
    63  -0.003309127
    64  0.025687191
    65  0.009467306
    66  0.005519485
    67  -0.011473758
    68  0.00223934
    69  -0.000913651
    70  -0.003055385
    71  0.000974694
    72  0.000288611
    73  -0.002432251
    74  -0.0016975
    75  -0.001565034
    76  0.003332848
    77  -0.008007295
    78  -0.003086435
    79  -0.00160435
    80  0.005825885
    81  0.020078093
    82  0.018055453
    83  0.181098137
    84  0.102698818
    85  0.128786594
    86  -0.013587077
    87  -0.038429879
    88  0.043637258
    89  0.042741709
    90  0.016384872
    91  0.000216317
    92  0.009275681
    93  -0.008595197
    94  -0.016323335
    95  -0.024083247
    96  0.035922206
    97  0.034863621
    98  0.032401779
    99  0.126333922
    100 0.054751935

In order to perform the restriction test from my H0 and H1 hypothesis, may I know how I can fix this problem?


Get this bounty!!!

#StackBounty: #r #residuals #lm Is the variation in the residual standard deviation (on sample) accounted for when one builds a predict…

Bounty: 50

This question is somehow related to Is the residual, e, an estimator of the error, $epsilon$?

I also found some information here: Confidence interval of RMSE

Let’s say, I got a model that explains y by the mean of y:

> x <- data.table(x = rnorm(100, 0, 1), y = rnorm(100, 0, 1)); summary(lm(y ~ 1, data = x))

Call:
lm(formula = y ~ 1, data = x)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.41639 -0.84908  0.05192  0.79689  2.82043 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.05278    0.10635   0.496    0.621

Residual standard error: 1.063 on 99 degrees of freedom

That is how I get the so called residual standard error (a better term would the sample standard deviation of residuals or RMSE):

> sqrt(sum((x[, y] - x[, mean(y)]) ^ 2) / 99)
[1] 1.06346

That is how I get the standard error of the sample mean:

> sd(x[, y]) / sqrt(nrow(x))
[1] 0.106346

So now what I don’t completely understand is whether the estimation of the prediction confidence interval for y-hat includes all three things:

  • variation in the sample mean
  • residual standard deviation (as an estimate of model error standard deviation)
  • variation in the residuals’ standard deviation (are residual variance distributed following chi-square?)

    lm_model <- lm(y ~ 1, data = x);
    predict.lm(lm_model, newdata = x, interval = “prediction”)[1]
    [1] 0.05277586


Get this bounty!!!

#StackBounty: #color #listings #minted #r #knitr colour R code to match knitr theme using listings, minted, or other

Bounty: 50

with the package, and the help of this answer and this answer, I can include R code in my LaTeX document that looks almost like what knitr produces natively.

However, as I am combining a LaTeX document (.tex) with a document made by knitr (.Rnw document), both with some R code, I would like to exactly match knitr‘s style.

Is there a way to do that? Like as in adopt knitr‘s code definitions exactly? Possibly using or some other package? Below is an illustration of the issue using .

Here’s my style emulating knitr manually,

what I have#4

and here’s how the same code looks made by knitr,

knitr #2

My question is if there’s an automated way to twerk my lstset{} definitions to match knitr exactly or if I can extract the style from knitr somehow?

Here’s the code use for my style,

documentclass[11pt, oneside]{article}   
usepackage{geometry}
geometry{verbose,tmargin=2.5cm,bmargin=2.5cm,lmargin=2.5cm,rmargin=2.5cm}
usepackage{listings} % to inlude R code with begin{lstlisting}[language=R] etc.
usepackage[usenames,dvipsnames]{xcolor}
definecolor{backgroundCol}{rgb}{.97, .97, .97}
definecolor{commentstyleCol}{rgb}{0.678,0.584,0.686}
definecolor{keywordstyleCol}{rgb}{0.737,0.353,0.396}
definecolor{stringstyleCol}{rgb}{0.192,0.494,0.8}
definecolor{NumCol}{rgb}{0.686,0.059,0.569}    
lstset{ 
  language=R,                     % the language of the code
  basicstyle=small  ttfamily, % the size of the fonts that are used for the code
  %numbers=left,                   % where to put the line-numbers
%  numberstyle=color{green},  % the style that is used for the line-numbers
  stepnumber=1,                   % the step between two line-numbers. If it is 1, each line
                                  % will be numbered
  numbersep=5pt,                  % how far the line-numbers are from the code
  backgroundcolor=color{backgroundCol},  % choose the background color. You must add usepackage{color}
  showspaces=false,               % show spaces adding particular underscores
  showstringspaces=false,         % underline spaces within strings
  showtabs=false,                 % show tabs within strings adding particular underscores
  %frame=single,                   % adds a frame around the code
  %rulecolor=color{white},        % if not set, the frame-color may be changed on line-breaks within not-black text (e.g. commens (green here))
  tabsize=2,                      % sets default tabsize to 2 spaces
  captionpos=b,                   % sets the caption-position to bottom
  breaklines=true,                % sets automatic line breaking
  breakatwhitespace=false,        % sets if automatic breaks should only happen at whitespace
  keywordstyle=color{keywordstyleCol},      % keyword style
  commentstyle=color{commentstyleCol},   % comment style
  stringstyle=color{stringstyleCol},      % string literal style
  literate=%
   *{0}{{{color{NumCol}0}}}1
    {1}{{{color{NumCol}1}}}1
    {2}{{{color{NumCol}2}}}1
    {3}{{{color{NumCol}3}}}1
    {4}{{{color{NumCol}4}}}1
    {5}{{{color{NumCol}5}}}1
    {6}{{{color{NumCol}6}}}1
    {7}{{{color{NumCol}7}}}1
    {8}{{{color{NumCol}8}}}1
    {9}{{{color{NumCol}9}}}1
} 

begin{document}
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis cursus elementum massa, vitae fermentum dui blandit in.
begin{lstlisting}[language=R]
require(graphics)

## Annette Dobson (1990) "An Introduction to Generalized Linear Models".
## Page 9: Plant Weight Data.
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group)
lm.D90 <- lm(weight ~ group - 1) # omitting intercept
end{lstlisting}
Morbi ipsum neque, auctor sit amet malesuada a, malesuada sit amet odio. Proin imperdiet ipsum ligula, at aliquet ante pretium eu. 
end{document}  

and here’s the code for the knitr document, the .Rnw

documentclass{article}
usepackage[sc]{mathpazo}
usepackage[T1]{fontenc}
usepackage{geometry}
geometry{verbose,tmargin=2.5cm,bmargin=2.5cm,lmargin=2.5cm,rmargin=2.5cm}
setcounter{secnumdepth}{2}
setcounter{tocdepth}{2}
usepackage{url}
usepackage[unicode=true,pdfusetitle,
 bookmarks=true,bookmarksnumbered=true,bookmarksopen=true,bookmarksopenlevel=2,
 breaklinks=false,pdfborder={0 0 1},backref=false,colorlinks=false]
 {hyperref}
hypersetup{
 pdfstartview={XYZ null null 1}}
usepackage{breakurl}
begin{document}
<<setup, include=FALSE, cache=FALSE>>=
library(knitr)
# set global chunk options
opts_chunk$set(fig.path='figure/minimal-', fig.align='center', fig.show='hold')
options(formatR.arrow=TRUE,width=90)
@
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis cursus elementum massa, vitae fermentum dui blandit in.
<<boring-random, eval= FALSE>>=
require(graphics)

## Annette Dobson (1990) "An Introduction to Generalized Linear Models".
## Page 9: Plant Weight Data.
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group)
lm.D90 <- lm(weight ~ group - 1) # omitting intercept
@
Morbi ipsum neque, auctor sit amet malesuada a, malesuada sit amet odio. Proin imperdiet ipsum ligula, at aliquet ante pretium eu. 
end{document}


Get this bounty!!!

#StackBounty: #r #predictive-models #survival #cox-model #hazard Does centring the baseline hazard and linear predictor after cox regre…

Bounty: 50

I am trying to derive risk scores from a cox regression. For reasons not relevant to this topic, I must do it manually, as opposed to using the predict function in R.

My manual method, uses an un-centred baseline hazard and linear predictor. The built in function in R predict uses a centred linear predictor, and so you must also use the centred baseline hazard from survfit.

To my suprise, when using the two methods, the risk scores match up, but the confidence interval around the risk scores do not match up. I see no reason why the confidence interval around the risk score should change size depending on whether the baseline hazard was centred or not.

I provide a reproducible example here, where the risks match up, but the confidence intervals do not:

data(ovarian)
library(survival)

## Fit cox model
fit <- coxph( Surv(futime,fustat)~age + rx,data=ovarian)


### Calculate risks and associated confidence interval manually, using non centred baseline hazard and linear predictor

## Extract design matrix
test.model.matrix<-model.matrix(Surv(futime,fustat)~age + rx,data=ovarian)
S<-test.model.matrix
## Remove intercept from design matrix
S<-S[,-c(1)]


## Extract coefficients and covariance matrix
B<-fit$coefficients
C<-fit$var

## Calculate linear predictor
p<-(S%*%B)[,1]
head(p)

## Calculate standard error and CI
std.er.p1<-(S%*%C)

std.er.p2<-std.er.p1*S

std.er.p2<-data.frame(std.er.p2)
head(std.er.p2)

se<-sqrt(rowSums(std.er.p2))

## Calculate the relevant t statistic, degrees of freedom = n-p = 26-2
t<-qt(0.975,24)


## Create dataset with linear predictor, upper and lower limit
lp<-data.frame(risk=p,low.lim=p-t*se,upp.lim=p+t*se)

## Now get cumulative baseline hazard using basehaz function, not centred
basehaz<-basehaz(fit,centered=FALSE)

## Extract hazard at relevant time
## Must be a time at which an event happened, choose time=156 (3rd event)
basehaz156<-basehaz[basehaz$time==156,]$hazard

## Generate survival function
surv<-exp(-basehaz156*exp(lp))

## generate risks
risks.uncent<-1-surv


### Now extract the linear predictor using the predict function (centred linear predictor)
### And use the centred baseline hazard

## Do it using predict function
fit.cent<-predict(fit,se.fit=TRUE,type="lp")

## Extract linear predictor and standard error
p.cent<-fit.pred[[1]]
se.cent<-fit.pred[[2]]

## Calculate the relevant t statistic
t<-qt(0.975,24)


# Create dataset with linear predictor, upper and lower limit
lp.cent<-data.frame(risk=p.cent,low.lim=p.cent-t*se.cent,upp.lim=p.cent+t*se.cent)


### Now get cumulative baseline hazard, centred
basehaz.cent<-basehaz(fit,centered=TRUE)
head(basehaz.cent)


## Extract hazard at relevant time
## Must be a time at which an event happened, choose time=156 (3rd event)
basehaz156.cent<-basehaz.cent[basehaz.cent$time==156,]$hazard

## Generate survival function
surv.cent<-exp(-basehaz156.cent*exp(lp.cent))

## generate risks
risks.cent<-1-surv.cent


### Compare results
head(risks.uncent)
head(risks.cent)

> head(risks.uncent)
        risk      low.lim   upp.lim
1 0.51509705 5.322379e-04 1.0000000
2 0.63037120 5.975637e-04 1.0000000
3 0.26288533 3.883639e-04 1.0000000
4 0.01961519 4.553868e-05 0.9998191
5 0.02794971 1.615327e-04 0.9930874
6 0.06717218 2.255081e-04 1.0000000

> head(risks.cent)
        risk    low.lim    upp.lim
1 0.51509705 0.13947540 0.96942843
2 0.63037120 0.15703534 0.99696746
3 0.26288533 0.09773654 0.59527761
4 0.01961519 0.01014882 0.03774142
5 0.02794971 0.01121265 0.06878584
6 0.06717218 0.03569621 0.12455084

I have also done the maths, to see whether the confidence intervals should match up or not. According to my maths, they should not match up, but this goes against my intuition that confidence interval should be independent on whether baseline hazard is centred or not.

Maths sheet 1:

Maths sheet 2:

Maths sheet 3:

So either:

1) My maths is wrong, and my code is wrong and the confidence intervals differing on how baseline hazard is defined should match up

2) My understanding that confidence interval around the risk should be independent on whether the baseline hazard is centred or not, is wrong.


Get this bounty!!!