#StackBounty: #r #regression #estimation #piecewise-linear How to estimate parameters of a (almost) linear model from unpaired observat…

Bounty: 50

I have this model:

$a_i=mod(lfloor icdot T+Normal(0,sigma_a)rfloor,Q)$

$b_i=mod(lfloor a_i+D+Normal(0,sigma_b)rfloor,Q)$

with

$i=1..N$

$Ninmathbb{N}$

$Qinmathbb{N}$

$Tinmathbb{R}, Tgt0$

$Dinmathbb{R}, Dgt0$

$Dgt T$

$Normal(mu,sigma)$ is a random number drawn from a Normal distribution with mean $mu$ and variance $sigma^2$.

$mod(x,y)$ is the modulo operator, in some programming language it is x % y.

$lfloor x rfloor$ is the floor function.

Given the $N$ paired observations $(a_i,b_i)$ and $Q$, I need to find $T$ and $D$ and I would like to have an idea about $sigma_a$ and $sigma_b$.

My naïve solution is:

  1. Ignore $sigma_a$ and $sigma_b$ setting them to $0$;
  2. an estimate for $T$ is the median of the list composed by the $N-1$ values $a_i-a_{i-1}$;
  3. an estimate for $D$ is the median of the list composed by the $N$ values $b_i-a_{i}$.

This R code:

N=1000
T=200
D=3000
Q=2^16
sigma_a=1
sigma_b=2

i=seq(1,N)

a=floor(i*T+rnorm(N,0,sigma_a))%%Q

b=floor(a+D+rnorm(N,0,sigma_b))%%Q

print(sprintf("Estimate for T: %f",median(diff(a))))
print(sprintf("Estimate for D: %f",median(b-a)))

gives this output:

[1] "Estimate for T: 200.000000"
[1] "Estimate for D: 2999.000000"

Now, I would like to remove the assumption that the observations are paired, i.e., I will just have all the $a_j$ and all the $b_k$ but I will ignore the correspondences between them, so for example, with $N=3$, $T=200$, $D=3000$, $sigma_a=0$ and $sigma_b=0$ I will have $a={200,400,600}$ and $b={3400,3600,3200}$.

Is the problem still solvable? How?


Get this bounty!!!

#StackBounty: #r #shiny #dt #sparklines Add label to sparkline plot in datatable

Bounty: 50

Is it possible to add a custom label to a sparkline plot?

For example, in the code below, I would like to label each bar with the corresponding letter in the label column.

Building from a previous

require(sparkline)
require(DT)
require(shiny)
require(tibble)

# create data


spark_data1<-tribble(
  ~id,  ~label,~spark,
  "a", c("C,D,E"),c("1,2,3"),
  "b", c("C,D,E"),c("3,2,1")
)

ui <- fluidPage(
  sparklineOutput("test_spark"),
  DT::dataTableOutput("tbl")
)

server <- function(input, output) {

  output$tbl <- DT::renderDataTable({
    line_string <- "type: 'bar'"
    cd <- list(list(targets = 2, render = JS("function(data, type, full){ return '<span class=sparkSamples>' + data + '</span>' }")))
    cb = JS(paste0("function (oSettings, json) {n  $('.sparkSamples:not(:has(canvas))').sparkline('html', { ", 
                   line_string, " });n}"), collapse = "")
    dt <-  DT::datatable(as.data.frame(spark_data1),  rownames = FALSE, options = list(columnDefs = cd,fnDrawCallback = cb))

  })

}

shinyApp(ui = ui, server = server)


Get this bounty!!!

#StackBounty: #r #dynamic-loading #install.packages How to specify (non-R) library path for dynamic library loading in R?

Bounty: 200

I keep getting the following error when attempting to install readxl or haven in R (both dependencies of tidyverse) post-compilation, when the installer runs the loading test:

** testing if installed package can be loaded
Error in dyn.load(file, DLLpath = DLLpath, ...) :
  unable to load shared object '<my_lib_Path>/readxl/libs/readxl.so':
  <my_lib_path>/readxl/libs/readxl.so: undefined symbol: libiconv
Error loading failed

I have libiconv.so in a local lib path (not for R packages) that is included in LD_LIBRARY_PATH and I’ve verified in my R session that Sys.getenv("LD_LIBRARY_PATH") has that directory.
Why can’t R’s dynamic library loader find this shared object? Is there a different R-specific environment variable I need to define to have the dynamic library loader in R search my local lib path?

Please note that this is not an issue with an R library path, but instead for a non-R dependency that an R package has. If I were compiling and linking C++ code, gcc would use ld, and hence LD_LIBRARY_PATH to track down dynamic dependencies. R doesn’t appear to respect this rather common approach, and I can’t seem to find any documentation on how to manage these more fine-grained dependency issues.


Additional Details

!> sessionInfo()
 R version 3.3.3 (2017-03-06)
 Platform: x86_64-pc-linux-gnu (64-bit)
 Running under: CentOS Linux 7 (Core)

 locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
 [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

 attached base packages:
 [1] stats     graphics  grDevices utils     datasets  methods   base
 > 

I had previously compiled libiconv because it was a dependency for something else (don’t recall what now – likely not an R package given current problems). I tried reinstalling it, but made no difference.


Edit

I have also tried manually loading the library prior to installation:

> dyn.load(".local/lib/libiconv.so")
> is.loaded("libiconv")
[1] TRUE
> install.packages("tidyverse")

but it fails just as above.


Get this bounty!!!

#StackBounty: #r #mixed-model #multilevel-analysis #meta-analysis #lme4-nlme Specifying and interpreting multilevel nested model for me…

Bounty: 50

I’m looking at a metadataset of studies of the yields of genetically modified crops. For each study, I have one or more reported findings (increases in yield due to genetic modification), and some study-level metadata: crop species (one of maize, cotton, or soybeans), country type (developed or developing), GM event type (either insect-resistant or herbicide-resistant), whether the study used regression, study type (either experimental field trial or farmer survey), whether the study was sponsored by industry, and whether the study was published in a peer-reviewed journal.

For each of those study-level variables, I’d like to get an estimate of its effect on the findings, taking into account the fact that findings are nested into studies to control for study bias. This means I want a 3-level model, with findings on the bottom, study in between, and the variables of interest on the top.

Reading around about the syntax lme4::lmer uses for nesting, it seems like the following formula will do what I want:

endpoint ~ 0 + (1|crop/study) + (1|event/study) + 
                (1|regression/study) + (1|country/study) + 
                (1|study_type/study) + (1|industry/study) +
                (1|journal/study)

Specifically, I can hand this formula to either lme4::lmer or rstanarm::stan_lmer (the latter has a more convenient tidier in broom). The fitted model gives estimates for interaction terms between the variables of interest and the individual studies — which I believe I don’t care about — and the following — which I believe that I do care about:

         level      group        term    estimate std.error
1       cotton       crop (Intercept)  0.63136245  2.484556
2        maize       crop (Intercept) -0.01947010  2.284912
3      soybean       crop (Intercept) -0.07913682  2.425395
4        FALSE    journal (Intercept) -2.77214601  7.643130
5         TRUE    journal (Intercept)  9.37080482  8.372829
6        FALSE   industry (Intercept)  0.86931739  4.280512
7         TRUE   industry (Intercept)  0.23544865  4.232657
8       survey study_type (Intercept)  2.22262385  4.495824
9  field_trial study_type (Intercept) -0.32424589  4.108007
10   developed    country (Intercept) -2.31048095  7.121554
11  developing    country (Intercept)  7.69356622  8.594535
12       FALSE regression (Intercept) -0.35844890  3.650564
13        TRUE regression (Intercept)  2.03837270  4.111827
14          HT      event (Intercept) -0.04189164  3.508054
15          IR      event (Intercept)  1.17481409  3.732732

I want to interpret these rows as estimates and standard errors of the effect/group value for each level of my covariates. For instance, cotton is associated with a finding (yield increase due to genetic modification) of about 0.6 units (percentage points), with a standard error of about 2.5 units (percentage points).

I’ve pieced together several different tutorials to get to this point, so I don’t have very high confidence that I’m doing things correctly.


Get this bounty!!!

#StackBounty: #r #regression #lm R – Transformations of continuous predictors, using predicted values

Bounty: 50

In the past, I have been assessing the relationship between outcome and continuous predictors without taking other predictors into account. I have also been playing around with a way to determine that same relationship when taking other model predictors into account using the predict function…but can’t get my head around a couple of things. Probably not the best example, but I’ve replicated the problem with the IRIS dataset (using Sepal.Length as the outcome variable):

library(ggplot2)
irisdata <- iris 

Here is what I might use to explore the relationship between sepal.length and petal.width; and determine whether a transformation is required (in this case I might just keep as linear).

  ggplot(irisdata, aes(x=Sepal.Length, y=Petal.Width)) +
    geom_point(shape=1) +    
    stat_smooth(method = "loess", color = 'red', size = 1.3, span = 0.5)   +
    stat_smooth(method = "lm", formula = y ~ poly(x, 3), size = 1, color = 'magenta', se = FALSE) +
    geom_smooth(method = "lm", color = 'purple', se = FALSE)

I’m interested in whether that relationship will change when including my other model variables. Here’s the final model excluding petal.width:

irismodel <- lm(Sepal.Length~Sepal.Width+Petal.Length+Species, data=irisdata)
summary(irismodel)
irisdata$predictedlength <- predict(irismodel, irisdata, type = "response")

And here is what I may use to see if the relationship has changed (in this case, both relationships look similar):

  ggplot(irisdata, aes(x=predictedlength, y=Petal.Width)) +
    geom_point(shape=1) +    
    stat_smooth(method = "loess", color = 'red', size = 1.3, span = 0.5)   +
    stat_smooth(method = "lm", formula = y ~ poly(x, 3), size = 1, color = 'magenta', se = FALSE) +
    geom_smooth(method = "lm", color = 'purple', se = FALSE)

Finally, when I include petal.width in the final model pedal.width is a significant variable:

  irismodel2 <- lm(Sepal.Length ~ Petal.Width + Sepal.Width+Petal.Length+Species, data= irisdata)
  summary(irismodel2)

However, when I include it as a predictor with ‘predictedlength’, it becomes non-significant:

  irismodel3 <- lm(Sepal.Length ~ Petal.Width + predictedlength, data= irisdata)
  summary(irismodel3)

I guess there are two questions here:

  1. Why does petal.width ‘lose’ statistical significance when included in the model with the predictedvalue (ie. model 2)?
  2. What is a reasonable approach for determining a correct continuous transformation? When considering transformations, should the impact of other predictors be taken into account? In this example petal.width more or less looks the same but in more complex models I’ve built the transformation requirements have changed (maybe I need to add a degree to a polynomial etc). I guess this is mostly due to a collinearity issue.

Thanks

Output from code below:

summary(irismodel)

                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        2.39039    0.26227   9.114 5.94e-16 ***
Sepal.Width        0.43222    0.08139   5.310 4.03e-07 ***
Petal.Length       0.77563    0.06425  12.073  < 2e-16 ***
Speciesversicolor -0.95581    0.21520  -4.442 1.76e-05 ***
Speciesvirginica  -1.39410    0.28566  -4.880 2.76e-06 ***

Multiple R-squared:  0.8633,    Adjusted R-squared:  0.8595 

summary(irismodel2)

                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        2.17127    0.27979   7.760 1.43e-12 ***
Petal.Width       -0.31516    0.15120  -2.084  0.03889 *  
Sepal.Width        0.49589    0.08607   5.761 4.87e-08 ***
Petal.Length       0.82924    0.06853  12.101  < 2e-16 ***
Speciesversicolor -0.72356    0.24017  -3.013  0.00306 ** 
Speciesvirginica  -1.02350    0.33373  -3.067  0.00258 ** 

Multiple R-squared:  0.8673,    Adjusted R-squared:  0.8627 

summary(irismodel3)

                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -0.30055    0.35236  -0.853    0.395    
Petal.Width     -0.07546    0.07406  -1.019    0.310    
predictedlength  1.06692    0.07337  14.541   <2e-16 ***

Multiple R-squared:  0.8643,    Adjusted R-squared:  0.8624 

enter image description here

enter image description here


Get this bounty!!!

#StackBounty: #r #cox-model #time-varying-covariate How to make predictions with time-dependent covariates with Cox regression

Bounty: 50

I learned about time-dependent covariates in Cox regression in R using the function survSplit of the package survival.
I use this as an interaction term for covariates which do not follow the Cox proportionality assumption which works fine.

The problem is now, that I want to make predictions for new datapoints, but how can I do this as I do not know how long they will survive? How do I know which time-value to use in my calculations? If it helps, I wish to calculate the probability of one-year survival. Also, how can I calculate the calibration/discrimination since certain patients will be duplicated in the dataset?

Code:

rm(list = ls(all=T))
library("rms")
library("pec")
data(veteran)

vet2 <- survSplit(Surv(time, status) ~ ., data= veteran, cut=c(90, 180),
                  episode= "tgroup", id="id")
ddist <- datadist(vet2); options(datadist='ddist')
vfit2 <- cph(Surv(tstart, time, status) ~ trt + prior + karno*strat(tgroup), data=vet2,surv = T,X=T,Y=T)
predictSurvProb(vfit2,newdata=vet2[vet2$id==2,],times = c(121,190))


Get this bounty!!!

#StackBounty: #r #windows #directory #rstudio #network-drive RStudio: unexpected call to `dir.create()` with the first instruction with…

Bounty: 50

First, apologies for a lack of a reproducible example, but I cannot really provide one as I believe the problem lies within my network settings. Please treat this question as a call for help in debugging the issue…

After opening in RStudio a project stored on a network drive and running the very first instruction (being it a package load or even a <- 1) I am seeing a really weird output in the console:

> a <- 1
Warning message:
In dir.create(tempPath, recursive = TRUE) :
  cannot create dir 'F:Marketing', reason 'Permission denied'

I have all possible temp dirs set up in user environment variables (TEMP, TMP, TMPDIR) and Sys.getenv() is printing them correctly.

“F:Marketing” is a valid path on my network drive and it is a root directory of the project.

I have tried to debugonce(dir.create) in .RProfile to see what the tempPath is, but unfortunately this resulted in an “invalid ‘envir’ argument” error.

After copying the project to a local drive the problem disappears, so this is clearly a network drive/network setup problem, but I do not know where to dig more and my IT dept. is not really useful here…

Any ideas how to debug this warning?


Get this bounty!!!

#StackBounty: #mysql #r #jdbc #timeout #dbi Is there a way to timeout a MySql query when using DBI and dbGetQuery?

Bounty: 50

I realize that

dbGetQuery comes with a default implementation that calls dbSendQuery, then dbFetch, ensuring that the result is always freed by dbClearResult.

and

dbClearResult frees all resources (local and remote) associated with a result set. In some cases (e.g., very large result sets) this can be a critical step to avoid exhausting resources (memory, file descriptors, etc.)

But my team just experienced a locked table that we went into MySQL to kill pid and I’m wondering – is there a way to timeout a query submitted using the DBI package?

I’m looking for and can’t find the equivalent of

dbGetQuery(conn = connection, 'select stuff from that_table', timeout = 90)

I tried this, and profiled the function with and without the parameter set and it doesn’t appear it does anything; why would it, if dbClearResult is always in play?


Get this bounty!!!

#StackBounty: #r #logistic #lme4-nlme Problem fitting glmer model with factor and numeric predictors

Bounty: 50

I’m trying to fit logistic regression with formula like this:

mod <- glmer(response ~  factor1+factor2+numeric1+numeric2+numeric3+numeric4 +(1|factor3),
            data=myDataset,family = binomial,
            control=glmerControl(optimizer="bobyqa"))

Factor1 and factor2 are categorical variables with (5 and 2 categories). Factor3 is id of subject. The rest of predictors are numeric variables – all of them scaled with scale().

I am getting following error/warning:

Warning messages:
1: Some predictor variables are on very different scales: consider rescaling 
2: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model failed to converge with max|grad| = 0.406353 (tol = 0.001, component 1)
3: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model is nearly unidentifiable: very large eigenvalue
 - Rescale variables?;Model is nearly unidentifiable: large eigenvalue ratio
 - Rescale variables?

When I am trying to fit the model without factor1 and factor2 variables, model fits without complain, so I am assuming that problem is in my factor predictors. But I have no idea how to fix it. Should I recode my factor variables to dummy variables myself and then scale them? Will it help? Any idea will be very appreciated.


Get this bounty!!!

#StackBounty: #r #bigdata How to use biglm with more than 2^31 observations

Bounty: 50

I am working with a large set of data that contains more than 2^31 observations. The actual number of observations is close to 3.5 billion observations.

I am using the R package “biglm” to run a regression with approximately 70 predictors. I read in the data one million rows at a time and update the regression results. The data have been saved in the ffdf format using the R library “ffdf” to load quickly and avoid using up all my RAM.

Here is the basic outline of the code I am using:

library(ff,ffbase,biglm)
load.ffdf(dir='home')

dim(data) #the ffdf contains about 70 predictors and 3.5 billion rows

chunk_1 <- data[1:1000000,]
rest_of_data <- data[1000000:nrow(data),]

# Running biglm for first chunk
b <- biglm(y~x1+x2+...+x70, chunk_1)

chunks <- ceiling((nrow(rest_of_data)/1000000)

# Updating biglm results by iterating through the rest of the data chunks
for (i in seq(1,chunks)){
      start <- 1+((i-1))*1000000
      end <- min(i*1000000,nrow(d))
      d_chunk <- d[start:end,]
      b<-update(b,d_chunk)
}

The results look great and everything is running smoothly until the cumulative number of observations from updating the model with each chunk of the data exceeds 2^31 observations. Then, I get an error that reads

In object$n + NROW(mm) : NAs produced by integer overflow

How do I get around this overflow issue? Thanks in advance for your help!


Get this bounty!!!