#StackBounty: #r #self-study #bayesian #residuals #deviance Negative deviance and deviance residuals

Bounty: 50

I fitted a regression model with JAGS and need to calculate the deviance residuals multiple times to do a simulated envelope plot. The beta model is

regression_model = "
model{
for(i in 1:n) {
y[i] ~ dbeta(alpha[i], beta[i])
alpha[i] <- mu[i] * phi[i]
beta[i]  <- (1-mu[i]) * phi[i]
log(phi[i])<- -inprod(X2[i,],delta[])
cloglog(mu[i]) <- inprod(X1[i,],B[])
}

for (j in 1:p){
B[j] ~ dnorm(0,.001)
}

for(k in 1:s){
delta[k] ~ dnorm(0,.001)
}
}"

I tried to calculate the deviance residuals as $r^D_i=sign(y_i-hat{mu}_i)sqrt{2*l_i}$
where $l_i$ is the log-likelihood contribution of observation $i$, but I’m getting roots of negative numbers.

How is this residual calculated?

EDIT: Since the deviance for this model is negative, I think that the deviance residuals can’t be calculated because
$$text{Deviance}=sum (r_i^D)^2$$

EDIT2: Would be wrong if I take the absolute value of each contribution in likelihood?


Get this bounty!!!

#StackBounty: #r #self-study #bayesian #residuals #jags Negative deviance and deviance residuals

Bounty: 50

I fitted a regression model with JAGS and need to calculate the deviance residuals multiple times to do a simulated envelope plot. The model is

regression_model = "
model{
for(i in 1:n) {
y[i] ~ dbeta(alpha[i], beta[i])
alpha[i] <- mu[i] * phi[i]
beta[i]  <- (1-mu[i]) * phi[i]
log(phi[i])<- -inprod(X2[i,],delta[])
cloglog(mu[i]) <- inprod(X1[i,],B[])
}

for (j in 1:p){
B[j] ~ dnorm(0,.001)
}

for(k in 1:s){
delta[k] ~ dnorm(0,.001)
}
}"

I tried to calculate the deviance residuals as $r^D_i=sign(y_i-hat{mu}_i)sqrt{2*l_i}$
where $l_i$ is the log-likelihood contribution of observation $i$, but I’m getting roots of negative numbers.

How is this residual calculated?

EDIT: Since the deviance for this model is negative, I think that the deviance residuals can’t be calculated because
$$text{Deviance}=sum (r_i^D)^2$$

EDIT2: Would be wrong if I take the absolute value of each contribution in likelihood?


Get this bounty!!!

#StackBounty: #r #binary-data #distance #similarities #jaccard-similarity Similarity measures for more than 2 variables

Bounty: 50

If I have two binary variables, I can determine the similarity of these variables quite easily with different similarity measures, e.g. with the Jaccard similarity measure:

$J = frac{M_{11}}{M_{01} + M_{10} + M_{11}}$

Example in R:

# Example data
N <- 1000
x1 <- rbinom(N, 1, 0.5)
x2 <- rbinom(N, 1, 0.5)

# Jaccard similarity measure
a <- sum(x1 == 1 & x2 == 1)
b <- sum(x1 == 1 & x2 == 0)
c <- sum(x1 == 0 & x2 == 1)

jacc <- a / (a + b + c)
jacc

However, I have a group of 1.000 binary variables and want to determine the similarity of the whole group.

Question: What is the best way to determine the similarity of more than 2 binary variables?

I had the idea to calculate a similarity measure for each combination and then take the average. You can find an example of this procedure below:

# Example data
N <- 1000 # Observations
N_vec <- 1000 # Amount of vectors
x <- rbinom(N * N_vec, 1, 0.5)
mat_x <- matrix(x, ncol = N_vec)
list_x <- split(mat_x, rep(1:ncol(mat_x), each = nrow(mat_x)))

# Function for calculation of Jaccard similarity
fun_jacc <- function(v1, v2) {

  a <- sum(v1 == 1 & v2 == 1)
  b <- sum(v1 == 1 & v2 == 0)
  c <- sum(v1 == 0 & v2 == 1)

  jacc <- a / (a + b + c)
  return(jacc)
}

# Apply function to all combinations (takes approx. 1 min to calculate)
mat_jacc <- sapply(list_x, function(x) sapply(list_x, function(y) fun_jacc(x,y)))
mat_jacc[upper.tri(mat_jacc)] <- NA
diag(mat_jacc) <- NA
vec_jacc <- as.vector(mat_jacc)
vec_jacc <- vec_jacc[!is.na(vec_jacc)]
median(vec_jacc)

This is very inefficient though and I am also not sure if this is theoretically the best way to measure the similarity of such a group of variables.


Get this bounty!!!

#StackBounty: #r #statistical-significance #repeated-measures #multiple-comparisons how to compare three groups (significant different …

Bounty: 50

I have a Control group with two replicate and two treated group with two replicate. I want to know how I can identify the sample that are significantly different between control and treated 1 (higher expression) while significant different between control and treated 2 (lower expression)

This is an example data

df<-structure(list(C1 = c(0.003926348, 0.001642442, 6.72e-05, 0.000314789, 
0.00031372, 0.000196342, 0.01318432, 8.86e-05, 0.005671017, 0.003616196, 
0.026635645, 0.001136402, 0.000161111, 0.005777738, 0.000145104, 
0.000996546, 4.27e-05, 0.000114159, 0.001152384, 0.002860251, 
0.000284873), C2 = c(0.003901373, 0.001526195, 6.3e-05, 0.000387266, 
0.000312458, 0.000256647, 0.012489205, 0.00013071, 0.005196136, 
0.003059834, 0.024624562, 0.001025486, 0.000144964, 0.005659078, 
0.000105755, 0.000844871, 5.88e-05, 0.000118831, 0.000999354, 
0.002153167, 0.000214486), T1 = c(0.003646894, 0.001484503, 4.93e-05, 
0.00036715, 0.000333378, 0.000244199, 0.010286787, 6.48e-05, 
0.006180042, 0.00387491, 0.025428464, 0.001075376, 0.000122088, 
0.005448152, 0.000103301, 0.000974826, 4.32e-05, 0.000109876, 
0.001030364, 0.002777244, 0.000221169), T2 = c(0.00050388, 0.001135969, 
0.000113829, 2.14e-06, 0.00010293, 0.000315704, 0.01160593, 8.46e-05, 
0.004495437, 0.003062559, 0.018662663, 0.002096675, 0.000214814, 
0.002177849, 8.61e-05, 0.001057254, 3.27e-05, 0.000115822, 0.008133257, 
0.021657018, 0.000261339), G1 = c(0.001496712, 0.001640965, 0.000129124, 
3.02e-06, 0.000122839, 0.000305686, 0.01378774, 0.000199637, 
0.00534668, 0.00300097, 0.023290941, 0.002595433, 0.000262479, 
0.002926346, 0.000125655, 0.001302624, 4.89e-05, 0.000122862, 
0.009851791, 0.017621282, 0.000197561), G2 = c(0.00114337, 0.001285636, 
0.000122848, 2.46e-06, 9.1e-05, 0.000288897, 0.012288087, 0.000122286, 
0.002575368, 0.002158011, 0.022008677, 0.002017026, 0.000241754, 
0.003340175, 0.00013424, 0.001517655, 4.78e-05, 0.000110353, 
0.008293286, 0.018999466, 0.000191129)), .Names = c("C1", "C2", 
"T1", "T2", "G1", "G2"), row.names = c("A", "B", "C", "D", "E", 
"F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "PP", 
"TT", "EE", "FF", "AS"), class = "data.frame")

The first two columns are the control
the second two columns are the treated 1
the third two columns are the treated 2


Get this bounty!!!

#StackBounty: #r #statistical-significance #multiple-comparisons how to compare three groups (significant different higher and signific…

Bounty: 50

I have a Control group with two replicate and two treated group with two replicate. I want to know how I can identify the sample that are significantly different between control and treated 1 (higher expression) while significant different between control and treated 2 (lower expression)

This is an example data

df<-structure(list(C1 = c(0.003926348, 0.001642442, 6.72e-05, 0.000314789, 
0.00031372, 0.000196342, 0.01318432, 8.86e-05, 0.005671017, 0.003616196, 
0.026635645, 0.001136402, 0.000161111, 0.005777738, 0.000145104, 
0.000996546, 4.27e-05, 0.000114159, 0.001152384, 0.002860251, 
0.000284873), C2 = c(0.003901373, 0.001526195, 6.3e-05, 0.000387266, 
0.000312458, 0.000256647, 0.012489205, 0.00013071, 0.005196136, 
0.003059834, 0.024624562, 0.001025486, 0.000144964, 0.005659078, 
0.000105755, 0.000844871, 5.88e-05, 0.000118831, 0.000999354, 
0.002153167, 0.000214486), T1 = c(0.003646894, 0.001484503, 4.93e-05, 
0.00036715, 0.000333378, 0.000244199, 0.010286787, 6.48e-05, 
0.006180042, 0.00387491, 0.025428464, 0.001075376, 0.000122088, 
0.005448152, 0.000103301, 0.000974826, 4.32e-05, 0.000109876, 
0.001030364, 0.002777244, 0.000221169), T2 = c(0.00050388, 0.001135969, 
0.000113829, 2.14e-06, 0.00010293, 0.000315704, 0.01160593, 8.46e-05, 
0.004495437, 0.003062559, 0.018662663, 0.002096675, 0.000214814, 
0.002177849, 8.61e-05, 0.001057254, 3.27e-05, 0.000115822, 0.008133257, 
0.021657018, 0.000261339), G1 = c(0.001496712, 0.001640965, 0.000129124, 
3.02e-06, 0.000122839, 0.000305686, 0.01378774, 0.000199637, 
0.00534668, 0.00300097, 0.023290941, 0.002595433, 0.000262479, 
0.002926346, 0.000125655, 0.001302624, 4.89e-05, 0.000122862, 
0.009851791, 0.017621282, 0.000197561), G2 = c(0.00114337, 0.001285636, 
0.000122848, 2.46e-06, 9.1e-05, 0.000288897, 0.012288087, 0.000122286, 
0.002575368, 0.002158011, 0.022008677, 0.002017026, 0.000241754, 
0.003340175, 0.00013424, 0.001517655, 4.78e-05, 0.000110353, 
0.008293286, 0.018999466, 0.000191129)), .Names = c("C1", "C2", 
"T1", "T2", "G1", "G2"), row.names = c("A", "B", "C", "D", "E", 
"F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "PP", 
"TT", "EE", "FF", "AS"), class = "data.frame")

The first two columns are the control
the second two columns are the treated 1
the third two columns are the treated 2


Get this bounty!!!

#StackBounty: #r #hypothesis-testing #multiple-regression #regression-coefficients #car Probing effects in a multivariate multiple regr…

Bounty: 50

I’m trying to run a multivariate multiple regression in R, i.e. including multiple predictors and multiple outcome variables in the same linear regression model. Does anybody know how to pull out the coefficients and p-values for the relationship between each predictor and outcome pair in a multivariate multiple regression? I cannot seem to work out how to do that (I’ve been trying!).

Let me explain with an example dataset, if it helps:

#Create example dataset
df <- data.frame(pid=factor(501), y1=numeric(501), y2=numeric(501), 
y3=numeric(501), y4=numeric(501), x1=factor(501), x2=factor(501), 
x3=factor(501), x4=numeric(501), x5=numeric(501))
df$pid <- seq(1,501, by=1)
df$y1 <- seq(1,101, by=0.2)
df$y2 <- seq(401,201, by=-0.4)
df$y3 <- sqrt(rnorm(501, 7, 0.5))^3
df$x1 <- c(rep(c("sad","happy"), each=250), "sad")
df$x2 <- c(rep(c("human","vehicle","animal"), each=167))
df$x3 <- c(rep(seq(1,10, by=0.1), each=5), seq(1,46, by=1))
df$x4 <- rnorm(501, 3, .24)
df$x5 <- sqrt(rnorm(501, 23, 3.5))    

I then create the model using this:

#Specify the regression model
model <- lm(cbind(y1, y2, y3) ~ x1 + x2 + x3 + x4 + x5, data=df)

I can’t simply use summary(lm) as doing so runs separate regressions without accounting for familywise error, nor does it account for the dependent variables possibly being correlated.

Reiterating my question. Does anybody know how to pull output so I can work out the coefficient and p-values but doing so in the same model? For example, I want to work out the coefficients and p-values of:

x1, x2, x3, x4 and x5 on y1
x1 and x2 on y2
x1, x2, x3, x4 and x5 on y3
... etc etc

I tried the car package:

modelanova <- car::Anova(model)
summary(modelanova)

However, I couldn’t get it to break down to a particular outcome variable, it’d only produce coefficients overall (as if a composite outcome variable had been created)

Any ideas would be wonderful. I know I could run several univariate multiple regressions but I am particularly interested in running a single multivariate multiple regression.


Get this bounty!!!

#StackBounty: #r #parallel-processing #gpu #statistics-bootstrap GPU computing for bootstrapping using "boot" package

Bounty: 100

I would like to do a large analysis using bootstrapping. I saw that the speed of bootstrapping is increased using parallel computing as in the following code:

# detect number of cpu
library(parallel)
detectCores()

library(boot)
# boot function --> mean
bt.mean <- function(dat, d){
  x <- dat[d]
  m <- mean(x)
  return(m)
}

# obtain confidence intervals
# use parallel computing with 4 cpus
x  <- mtcars$mpg
bt <- boot(x, bt.mean, R = 1000, parallel = "snow", ncpus = 4)
quantile(bt$t, probs = c(0.025, 0.975))

However, as the whole number of calculations is large in my case (10^6 regressions with 10,000 bootstrap samples), I read that there are ways to use GPU computing to increase the speed even more (link1, link2) but it seems to me that the packages can only handle some specific R functions.
My question is therefore:

Is it possible to use GPU computing for bootstrapping using the boot package and other R packages (e.g. quantreg)?


Get this bounty!!!

#StackBounty: #csv #postgresql #r #sql How to select on CSV files like SQL in R?

Bounty: 50

I know the thread How can I inner join two csv files in R which has a merge option, which I do not want.
I have two data CSV files. I am thinking how to query like them like SQL with R.
I really like PostgreSQL so I think it would work here great or similar syntax tools of R.
Two CSV files where primary key is data_id.

data.csv where OK to have IDs not found in log.csv (etc 4)

data_id, event_value
1, 777
1, 666
2, 111
4, 123 
3, 324
1, 245

log.csv where no duplicates in ID column but duplicates can be in name

data_id, name
1, leo
2, leopold
3, lorem

Pseudocode by partial PostgreSQL syntax

  1. Let data_id=1
  2. Show name and event_value from data.csv and log.csv, respectively

Pseudocode like partial PostgreSQL select

SELECT name, event_value 
    FROM data, log
    WHERE data_id=1;

Expected output

leo, 777
leo, 666 
leo, 245

R approach

file1 <- read.table("file1.csv", col.names=c("data_id", "event_value"))
file2 <- read.table("file2.csv", col.names=c("data_id", "name"))

# TODO here something like the SQL query 
# http://stackoverflow.com/a/1307824/54964

Possible approaches where I think sqldf can be sufficient here

  1. sqldf
  2. data.table
  3. dplyr
  4. PostgreSQL database

PostgreSQL thoughts

Schema

DROP TABLE IF EXISTS data, log;    
CREATE TABLE data (
        data_id SERIAL PRIMARY KEY NOT NULL,
        event_value INTEGER NOT NULL
);
CREATE TABLE log (
        data_id SERIAL PRIMARY KEY NOT NULL,
        name INTEGER NOT NULL
);

R: 3.3.3
OS: Debian 8.7


Get this bounty!!!

#StackBounty: #r Get row in paired sublists by ID

Bounty: 50

This might be a tricky one:

I need to restructure a list, containing an an unknown number of sublists (although 2 in the sample data). However, each sublist contains an ID-Column. For each ID in any of the sublists I now need to create a list containing the row where ID matches the ID in the sublist BUT also the corresponding rows in it’s siblings.

This is my initial list:

> str(myList1)
List of 2
 $ 1:'data.frame':  2 obs. of  5 variables:
  ..$ ID     : num [1:2] 13369 13599
  ..$ subject: num [1:2] 2 2
  ..$ gender : num [1:2] 1 1
  ..$ age    : num [1:2] 18 18
  ..$ score  : num [1:2] 30 28
 $ 2:'data.frame':  2 obs. of  5 variables:
  ..$ ID     : num [1:2] 13370 14342
  ..$ subject: num [1:2] 3 3
  ..$ gender : num [1:2] 1 1
  ..$ age    : num [1:2] 28 28
  ..$ score  : num [1:2] 27 32

This is the result I’m hoping to get:

> str(myList2)
List of 4
 $ 13369:List of 2
  ..$ 1:'data.frame':   1 obs. of  5 variables:
  .. ..$ ID     : num 13369
  .. ..$ subject: num 2
  .. ..$ gender : num 1
  .. ..$ age    : num 18
  .. ..$ score  : num 30
  ..$ 2:'data.frame':   1 obs. of  5 variables:
  .. ..$ ID     : num 13599
  .. ..$ subject: num 2
  .. ..$ gender : num 1
  .. ..$ age    : num 18
  .. ..$ score  : num 28
 $ 13370:List of 2
  ..$ 1:'data.frame':   1 obs. of  5 variables:
  .. ..$ ID     : num 14342
  .. ..$ subject: num 3
  .. ..$ gender : num 1
  .. ..$ age    : num 28
  .. ..$ score  : num 27
  ..$ 2:'data.frame':   1 obs. of  5 variables:
  .. ..$ ID     : num 13370
  .. ..$ subject: num 3
  .. ..$ gender : num 1
  .. ..$ age    : num 28
  .. ..$ score  : num 32
 $ 13599:List of 2
  ..$ 1:'data.frame':   1 obs. of  5 variables:
  .. ..$ ID     : num 13369
  .. ..$ subject: num 2
  .. ..$ gender : num 1
  .. ..$ age    : num 18
  .. ..$ score  : num 30
  ..$ 2:'data.frame':   1 obs. of  5 variables:
  .. ..$ ID     : num 13599
  .. ..$ subject: num 2
  .. ..$ gender : num 1
  .. ..$ age    : num 18
  .. ..$ score  : num 28
 $ 14342:List of 2
  ..$ 1:'data.frame':   1 obs. of  5 variables:
  .. ..$ ID     : num 14342
  .. ..$ subject: num 3
  .. ..$ gender : num 1
  .. ..$ age    : num 28
  .. ..$ score  : num 27
  ..$ 2:'data.frame':   1 obs. of  5 variables:
  .. ..$ ID     : num 13370
  .. ..$ subject: num 3
  .. ..$ gender : num 1
  .. ..$ age    : num 28
  .. ..$ score  : num 32

I have absolutely no clue on how to achieve this and don’t even know where to direct my research on this problem.

Reproducible code:

myList1 <- list(
    '1' = data.frame('ID' = c(13369,13599), 'subject' = c(2,2), 'gender' = c(1,1), 'age' = c(18,18), 'score' = c(30,28)),
    '2' = data.frame('ID' = c(13370,14342), 'subject' = c(3,3), 'gender' = c(1,1), 'age' = c(28,28), 'score' = c(27,32))
    )

Reproducible code for the outcome, if needed:

myList2 <- list(
    '13369' = list('1' = data.frame('ID' = 13369, 'subject' = 2, 'gender' = 1, 'age' = 18, 'score' = 30), '2' = data.frame('ID' = 13599, 'subject' = 2, 'gender' = 1, 'age' = 18, 'score' = 28)),
    '13370' = list('1' = data.frame('ID' = 14342, 'subject' = 3, 'gender' = 1, 'age' = 28, 'score' = 27), '2' = data.frame('ID' = 13370, 'subject' = 3, 'gender' = 1, 'age' = 28, 'score' = 32)),
    '13599' = list('1' = data.frame('ID' = 13369, 'subject' = 2, 'gender' = 1, 'age' = 18, 'score' = 30), '2' = data.frame('ID' = 13599, 'subject' = 2, 'gender' = 1, 'age' = 18, 'score' = 28)),
    '14342' = list('1' = data.frame('ID' = 14342, 'subject' = 3, 'gender' = 1, 'age' = 28, 'score' = 27), '2' = data.frame('ID' = 13370, 'subject' = 3, 'gender' = 1, 'age' = 28, 'score' = 32))
    )


Get this bounty!!!

#StackBounty: #r #hypothesis-testing #distributions #statistical-significance How to check if a value doesn't appear by chance with…

Bounty: 50

I have the following randomly generated distribution:

mean=100; sd=15
x <- seq(-4,4,length=100)*sd + mean
hx <- dnorm(x,mean,sd)

plot(x, hx, type="l", lty=2, xlab="x value",
     ylab="Density", main="Some random distribution")

enter image description here

And a “non-random” value

x <- seq(-4,4,length=100)*10 + mean
ux <- dunif(x = x, min=10, max=100)
non_random_value <- ux[1]
non_random_value
# [1] 0.01111111

I’d like to have the statistic that show non_random_value is
significant and doesn’t come up by chance with respect to hx.

What is the reasonable statistics to check that?


Get this bounty!!!