#StackBounty: #correlation #generalized-linear-model #multivariate-analysis #binary-data #lme4-nlme Correlation between two binary vari…

Bounty: 50

The Problem:

I have measured two binary variables within 1 categorical variable with 5 levels.

Initially, I thought I’d be able to use Fisher’s Exact test or some N x M x K version of it. However I have only found references where the dimensions of the table are specifically 2 x 2 x 2 (see the “hypergea” package in R as an example). In my case it is a 5 x 2 x 2.

I was wondering if there are other approaches that would be appropriate.

Here’s an example of what I’m working with in R.

set.seed(66)
group1<-as.data.frame(rbind(cbind(rep("GROUP_1",times=50),sample(x=c(1:50),size=50,replace=TRUE),rbinom(50,1,0.25),rep(NA,times=50)),cbind(rep("GROUP_1",times=50),sample(x=c(51:100),size=50,replace=TRUE),rep(NA,times=50),rbinom(50,1,0.75))))
group2<-as.data.frame(rbind(cbind(rep("GROUP_2",times=50),sample(x=c(101:150),size=50,replace=TRUE),rbinom(50,1,0.05),rep(NA,times=50)),cbind(rep("GROUP_2",times=50),sample(x=c(151:200),size=50,replace=TRUE),rep(NA,times=50),rbinom(50,1,0.95))))
group3<-as.data.frame(rbind(cbind(rep("GROUP_3",times=50),sample(x=c(201:250),size=50,replace=TRUE),rbinom(50,1,0.50),rep(NA,times=50)),cbind(rep("GROUP_3",times=50),sample(x=c(251:300),size=50,replace=TRUE),rep(NA,times=50),rbinom(50,1,0.50))))
group4<-as.data.frame(rbind(cbind(rep("GROUP_4",times=50),sample(x=c(301:350),size=50,replace=TRUE),rbinom(50,1,0.67),rep(NA,times=50)),cbind(rep("GROUP_4",times=50),sample(x=c(351:400),size=50,replace=TRUE),rep(NA,times=50),rbinom(50,1,0.33))))
group5<-as.data.frame(rbind(cbind(rep("GROUP_5",times=50),sample(x=c(301:350),size=50,replace=TRUE),rbinom(50,1,0.20),rep(NA,times=50)),cbind(rep("GROUP_5",times=50),sample(x=c(351:400),size=50,replace=TRUE),rep(NA,times=50),rbinom(50,1,0.20))))

testdata<-rbind(group1,group2,group3,group4)
names(testdata)<-c("GROUP","ID","Var1_","Var2_")

I’ll note here that Var1_ and Var2_ were measured from different individuals, though the individuals were sampled from the same population. That’s why I placed the NA’s in the data frame. Also, note that some of the data may be pseudoreplicated by ID – I’ll come back to that later.

I’m looking for a way of analyzing this problem.

Specifically I want to know:

1. Does the state of Var1_ significantly predict the state of Var2_? What is the relationship and how strong is it?

2. Is this relationship significantly different between GROUP levels? How different is it (effect size)?

My approach so far:

Some searching indicated that a GLM might be appropriate. However, as far as I can tell, the GLM function requires the V1 and V2 variables to be observations from the same sample. So, just to see how it looks, I’ll take the NA’s out and run a GLM.

group1<-as.data.frame(cbind(rep("GROUP_1",times=50),rbinom(50,1,0.25),rbinom(50,1,0.75)))
group2<-as.data.frame(cbind(rep("GROUP_2",times=50),rbinom(50,1,0.05),rbinom(50,1,0.95)))
group3<-as.data.frame(cbind(rep("GROUP_3",times=50),rbinom(50,1,0.50),rbinom(50,1,0.50)))
group4<-as.data.frame(cbind(rep("GROUP_4",times=50),rbinom(50,1,0.67),rbinom(50,1,0.33)))
group5<-as.data.frame(cbind(rep("GROUP_5",times=50),rbinom(50,1,0.20),rbinom(50,1,0.20)))

testdata<-rbind(group1,group2,group3,group4)
names(testdata)<-c("GROUP","Var1_","Var2_")

glm_obj_1 <- glm(Var1_ ~ Var2_, data=testdata, family=binomial(link="logit"))

summary(glm_obj)

Ok… I think that might be what I’m after, but I’d like to know if the relationship is different between GROUPs.

glm_obj_2 <- glm(Var1_ ~ Var2_* GROUP, data=testdata, family=binomial(link="logit"))
summary(glm_obj_2)

Ok! Is this a reasonable approach? Removing the NA’s seems questionable to me, but I’m not sure how else to deal with that in the GLM specification.

Extra Credit:

I mentioned the pseudoreplication before. I assume this could be controlled for with a random effect? I’ll try a GLMM implemented in LME4… This time I’ll add fewer ID levels to increase the odds of pseudoreplication.

library(lme4)
group1<-as.data.frame(cbind(rep("GROUP_1",times=50),sample(x=c(1:5),size=50,replace=TRUE),rbinom(50,1,0.25),rbinom(50,1,0.75)))
group2<-as.data.frame(cbind(rep("GROUP_2",times=50),sample(x=c(6:10),size=50,replace=TRUE),rbinom(50,1,0.05),rbinom(50,1,0.95)))
group3<-as.data.frame(cbind(rep("GROUP_3",times=50),sample(x=c(11:15),size=50,replace=TRUE),rbinom(50,1,0.50),rbinom(50,1,0.50)))
group4<-as.data.frame(cbind(rep("GROUP_4",times=50),sample(x=c(16:20),size=50,replace=TRUE),rbinom(50,1,0.67),rbinom(50,1,0.33)))
group5<-as.data.frame(cbind(rep("GROUP_5",times=50),sample(x=c(21:25),size=50,replace=TRUE),rbinom(50,1,0.20),rbinom(50,1,0.20)))

testdata<-rbind(group1,group2,group3,group4)
names(testdata)<-c("GROUP","ID","Var1_","Var2_")

glmer_obj<-glmer(Var1_ ~ Var2_ * GROUP + (1|ID), data=testdata, family=binomial(link="logit"))

summary(glm_obj)

Well, it fails to converge, but for illustrative purposes, I hope it suffices. Would this be an appropriate way to tackle this type of data?

I’m eager to hear your responses, suggestions, criticisms, etc.

Thanks in advance!


Get this bounty!!!

#StackBounty: #hypothesis-testing #correlation #statistical-significance #anova #multilevel-analysis Correlation or difference? appropr…

Bounty: 50

In a comparative study on human motion and perception in both reality and virtual reality, I am examining differences and relations between specific entities.

In two separate experiments, TTG Estimation and Distance Estimation data were collected from the same set of test persons who had following tasks:

  • *TTG Estimation task: Estimation for a Time-To-Go in [seconds] – test subjects are asked to signal the point of time (in Reality and Virtual Reality) at which they feel there’s just not enough time for securely crossing
    the street until the arrival of a vehicle (at 3 different
    speeds).
  • *Distance Estimation task: Test subjects are asked to estimate the distance in [m] between them and a pylon (in Reality and
    Virtual Reality)

Following table shall give a visualisation of the experiment design with the corresponding variables to be examined:

+-----------------+-----------------------------+---------------------------------+
|                 | TTG Estimation Experiment   | Distance Estimation Experiment  |
+                 +-----------------------------+---------------------------------+
|                 | 30 km/h | 35 km/h | 40 km/h | 8m | 9m | 10m | 11m | 12m | 13m |
+-----------------+---------+---------+---------+----+----+-----+-----+-----+-----+
| Reality         |         |         |         |    |    |     |     |     |     |
+-----------------+---------+---------+---------+----+----+-----+-----+-----+-----+
| Virtual Reality |         |         |         |    |    |     |     |     |     |
+-----------------+---------+---------+---------+----+----+-----+-----+-----+-----+
| Test Variables  |     abs(TTG_R - TTG_VR)     |      abs(Dist_R - Dist_VR)      |
+-----------------+-----------------------------+---------------------------------+

So in general I’m interested in the answer to the following question:
Is there a relation between the estimation ability/quality for TTG and the estimation ability/quality for Distance? Or in other words …
Do test persons with a good estimation for the TTG also give a good estimation for the Distance?

So my questions are:
a) What hypothesis approach is more appropriate? A comparison between two test variables OR a correlation analysis?
I am confused by the fact that the test variables |TTG_R – TTG_VR| and |Dist_R – Dist_VR| have different levels (TTG: 3 different vehicle speeds; Distance: 6 different distances) and that the test variables are measured in different units (TTG: s; Distance: m).

b) What exact method/test can I use to adress my point of interest? Does difference hypothesis make sense? (e.g. ANOVA for multiple factors?)
I have thought of normalising the test variables via zscore and check for bivariate correlation but then immediately I realise there are different test levels for each variable TTG: 3 different vehicle speeds; Distance: 6 different distances).
And regarding a difference hypothesis I am not sure if test levels can be used as factors?

Available software for evaluation:

  • Matlab
  • SPSS


Get this bounty!!!

#HackerRank: Correlation and Regression Lines solutions

import numpy as np
import scipy as sp
from scipy.stats import norm

Correlation and Regression Lines – A Quick Recap #1

Here are the test scores of 10 students in physics and history:

Physics Scores 15 12 8 8 7 7 7 6 5 3

History Scores 10 25 17 11 13 17 20 13 9 15

Compute Karl Pearson’s coefficient of correlation between these scores. Compute the answer correct to three decimal places.

Output Format

In the text box, enter the floating point/decimal value required. Do not leave any leading or trailing spaces. Your answer may look like: 0.255

This is NOT the actual answer – just the format in which you should provide your answer.

physicsScores=[15, 12,  8,  8,  7,  7,  7,  6, 5,  3]
historyScores=[10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
print(np.corrcoef(historyScores,physicsScores)[0][1])
0.144998154581

Correlation and Regression Lines – A Quick Recap #2

Here are the test scores of 10 students in physics and history:

Physics Scores 15 12 8 8 7 7 7 6 5 3

History Scores 10 25 17 11 13 17 20 13 9 15

Compute the slope of the line of regression obtained while treating Physics as the independent variable. Compute the answer correct to three decimal places.

Output Format

In the text box, enter the floating point/decimal value required. Do not leave any leading or trailing spaces. Your answer may look like: 0.255

This is NOT the actual answer – just the format in which you should provide your answer.

sp.stats.linregress(physicsScores,historyScores).slope
0.20833333333333331

Correlation and Regression Lines – A quick recap #3

Here are the test scores of 10 students in physics and history:

Physics Scores 15 12 8 8 7 7 7 6 5 3

History Scores 10 25 17 11 13 17 20 13 9 15

When a student scores 10 in Physics, what is his probable score in History? Compute the answer correct to one decimal place.

Output Format

In the text box, enter the floating point/decimal value required. Do not leave any leading or trailing spaces. Your answer may look like: 0.255

This is NOT the actual answer – just the format in which you should provide your answer.

def predict(pi,x,y):
    slope, intercept, rvalue, pvalue, stderr=sp.stats.linregress(x,y);
    return slope*pi+ intercept

predict(10,physicsScores,historyScores)
15.458333333333332

Correlation and Regression Lines – A Quick Recap #4

The two regression lines of a bivariate distribution are:

4x – 5y + 33 = 0 (line of y on x)

20x – 9y – 107 = 0 (line of x on y).

Estimate the value of x when y = 7. Compute the correct answer to one decimal place.

Output Format

In the text box, enter the floating point/decimal value required. Do not lead any leading or trailing spaces. Your answer may look like: 7.2

This is NOT the actual answer – just the format in which you should provide your answer.

x=[i for i in range(0,20)]

'''
    4x - 5y + 33 = 0
    x = ( 5y - 33 ) / 4
    y = ( 4x + 33 ) / 5
    
    20x - 9y - 107 = 0
    x = (9y + 107)/20
    y = (20x - 107)/9
'''
t=7
print( ( 9 * t + 107 ) / 20 )
8.5

Correlation and Regression Lines – A Quick Recap #5

The two regression lines of a bivariate distribution are:

4x – 5y + 33 = 0 (line of y on x)

20x – 9y – 107 = 0 (line of x on y).

find the variance of y when σx= 3.

Compute the correct answer to one decimal place.

Output Format

In the text box, enter the floating point/decimal value required. Do not lead any leading or trailing spaces. Your answer may look like: 7.2

This is NOT the actual answer – just the format in which you should provide your answer.

http://www.mpkeshari.com/2011/01/19/lines-of-regression/

Q.3. If the two regression lines of a bivariate distribution are 4x – 5y + 33 = 0 and 20x – 9y – 107 = 0,

  • calculate the arithmetic means of x and y respectively.
  • estimate the value of x when y = 7. – find the variance of y when σx = 3.
Solution : –

We have,

4x – 5y + 33 = 0 => y = 4x/5 + 33/5 ————— (i)

And

20x – 9y – 107 = 0 => x = 9y/20 + 107/20 ————- (ii)

(i) Solving (i) and (ii) we get, mean of x = 13 and mean of y = 17.[Ans.]

(ii) Second line is line of x on y

x = (9/20) × 7 + (107/20) = 170/20 = 8.5 [Ans.]

(iii) byx = r(σy/σx) => 4/5 = 0.6 × σy/3 [r = √(byx.bxy) = √{(4/5)(9/20)]= 0.6 => σy = (4/5)(3/0.6) = 4 [Ans.]

variance= σ**2=> 16

What is the difference between linear regression on y with x and x with y?

The Pearson correlation coefficient of x and y is the same, whether you compute pearson(x, y) or pearson(y, x). This suggests that doing a linear regression of y given x or x given y should be the same, but that’s the case.

The best way to think about this is to imagine a scatter plot of points with y on the vertical axis and x represented by the horizontal axis. Given this framework, you see a cloud of points, which may be vaguely circular, or may be elongated into an ellipse. What you are trying to do in regression is find what might be called the ‘line of best fit’. However, while this seems straightforward, we need to figure out what we mean by ‘best’, and that means we must define what it would be for a line to be good, or for one line to be better than another, etc. Specifically, we must stipulate a loss function. A loss function gives us a way to say how ‘bad’ something is, and thus, when we minimize that, we make our line as ‘good’ as possible, or find the ‘best’ line.

Traditionally, when we conduct a regression analysis, we find estimates of the slope and intercept so as to minimize the sum of squared errors. These are defined as follows:

In terms of our scatter plot, this means we are minimizing the sum of the vertical distances between the observed data points and the line.

enter image description here

On the other hand, it is perfectly reasonable to regress x onto y, but in that case, we would put x on the vertical axis, and so on. If we kept our plot as is (with x on the horizontal axis), regressing x onto y (again, using a slightly adapted version of the above equation with x and y switched) means that we would be minimizing the sum of the horizontal distances between the observed data points and the line. This sounds very similar, but is not quite the same thing. (The way to recognize this is to do it both ways, and then algebraically convert one set of parameter estimates into the terms of the other. Comparing the first model with the rearranged version of the second model, it becomes easy to see that they are not the same.)

enter image description here

Note that neither way would produce the same line we would intuitively draw if someone handed us a piece of graph paper with points plotted on it. In that case, we would draw a line straight through the center, but minimizing the vertical distance yields a line that is slightly flatter (i.e., with a shallower slope), whereas minimizing the horizontal distance yields a line that is slightly steeper.

A correlation is symmetrical x is as correlated with y as y is with x. The Pearson product-moment correlation can be understood within a regression context, however. The correlation coefficient, r, is the slope of the regression line when both variables have been standardized first. That is, you first subtracted off the mean from each observation, and then divided the differences by the standard deviation. The cloud of data points will now be centered on the origin, and the slope would be the same whether you regressed y onto x, or x onto y.

enter image description here

Now, why does this matter? Using our traditional loss function, we are saying that all of the error is in only one of the variables (viz., y). That is, we are saying that x is measured without error and constitutes the set of values we care about, but that y has sampling error. This is very different from saying the converse. This was important in an interesting historical episode: In the late 70’s and early 80’s in the US, the case was made that there was discrimination against women in the workplace, and this was backed up with regression analyses showing that women with equal backgrounds (e.g., qualifications, experience, etc.) were paid, on average, less than men. Critics (or just people who were extra thorough) reasoned that if this was true, women who were paid equally with men would have to be more highly qualified, but when this was checked, it was found that although the results were ‘significant’ when assessed the one way, they were not ‘significant’ when checked the other way, which threw everyone involved into a tizzy. See here for a famous paper that tried to clear the issue up.

Here’s another way to think about this that approaches the topic through the formulas instead of visually:

The formula for the slope of a simple regression line is a consequence of the loss function that has been adopted. If you are using the standard Ordinary Least Squares loss function (noted above), you can derive the formula for the slope that you see in every intro textbook. This formula can be presented in various forms; one of which I call the ‘intuitive’ formula for the slope. Consider this form for both the situation where you are regressing y on x, and where you are regressing x on y:

Now, I hope it’s obvious that these would not be the same unless Var(xequals Var(y). If the variances are equal (e.g., because you standardized the variables first), then so are the standard deviations, and thus the variances would both also equal SD(x)SD(y). In this case, β^1 would equal Pearson’s r, which is the same either way by virtue of the principle of commutativity:

 

Source