*Bounty: 50*

*Bounty: 50*

## The Problem:

I have measured two binary variables within 1 categorical variable with 5 levels.

Initially, I thought I’d be able to use Fisher’s Exact test or some N x M x K version of it. However I have only found references where the dimensions of the table are specifically 2 x 2 x 2 (see the “hypergea” package in R as an example). In my case it is a 5 x 2 x 2.

I was wondering if there are other approaches that would be appropriate.

Here’s an example of what I’m working with in R.

```
set.seed(66)
group1<-as.data.frame(rbind(cbind(rep("GROUP_1",times=50),sample(x=c(1:50),size=50,replace=TRUE),rbinom(50,1,0.25),rep(NA,times=50)),cbind(rep("GROUP_1",times=50),sample(x=c(51:100),size=50,replace=TRUE),rep(NA,times=50),rbinom(50,1,0.75))))
group2<-as.data.frame(rbind(cbind(rep("GROUP_2",times=50),sample(x=c(101:150),size=50,replace=TRUE),rbinom(50,1,0.05),rep(NA,times=50)),cbind(rep("GROUP_2",times=50),sample(x=c(151:200),size=50,replace=TRUE),rep(NA,times=50),rbinom(50,1,0.95))))
group3<-as.data.frame(rbind(cbind(rep("GROUP_3",times=50),sample(x=c(201:250),size=50,replace=TRUE),rbinom(50,1,0.50),rep(NA,times=50)),cbind(rep("GROUP_3",times=50),sample(x=c(251:300),size=50,replace=TRUE),rep(NA,times=50),rbinom(50,1,0.50))))
group4<-as.data.frame(rbind(cbind(rep("GROUP_4",times=50),sample(x=c(301:350),size=50,replace=TRUE),rbinom(50,1,0.67),rep(NA,times=50)),cbind(rep("GROUP_4",times=50),sample(x=c(351:400),size=50,replace=TRUE),rep(NA,times=50),rbinom(50,1,0.33))))
group5<-as.data.frame(rbind(cbind(rep("GROUP_5",times=50),sample(x=c(301:350),size=50,replace=TRUE),rbinom(50,1,0.20),rep(NA,times=50)),cbind(rep("GROUP_5",times=50),sample(x=c(351:400),size=50,replace=TRUE),rep(NA,times=50),rbinom(50,1,0.20))))
testdata<-rbind(group1,group2,group3,group4)
names(testdata)<-c("GROUP","ID","Var1_","Var2_")
```

I’ll note here that Var1_ and Var2_ were measured from different individuals, though the individuals were sampled from the same population. That’s why I placed the NA’s in the data frame. Also, note that some of the data may be pseudoreplicated by ID – I’ll come back to that later.

I’m looking for a way of analyzing this problem.

**Specifically I want to know:**

**1. Does the state of Var1_ significantly predict the state of Var2_? What is the relationship and how strong is it?**

**2. Is this relationship significantly different between GROUP levels? How different is it (effect size)?**

## My approach so far:

Some searching indicated that a GLM might be appropriate. However, as far as I can tell, the GLM function requires the V1 and V2 variables to be observations from the same sample. So, just to see how it looks, I’ll take the NA’s out and run a GLM.

```
group1<-as.data.frame(cbind(rep("GROUP_1",times=50),rbinom(50,1,0.25),rbinom(50,1,0.75)))
group2<-as.data.frame(cbind(rep("GROUP_2",times=50),rbinom(50,1,0.05),rbinom(50,1,0.95)))
group3<-as.data.frame(cbind(rep("GROUP_3",times=50),rbinom(50,1,0.50),rbinom(50,1,0.50)))
group4<-as.data.frame(cbind(rep("GROUP_4",times=50),rbinom(50,1,0.67),rbinom(50,1,0.33)))
group5<-as.data.frame(cbind(rep("GROUP_5",times=50),rbinom(50,1,0.20),rbinom(50,1,0.20)))
testdata<-rbind(group1,group2,group3,group4)
names(testdata)<-c("GROUP","Var1_","Var2_")
glm_obj_1 <- glm(Var1_ ~ Var2_, data=testdata, family=binomial(link="logit"))
summary(glm_obj)
```

Ok… I think that might be what I’m after, but I’d like to know if the relationship is different between GROUPs.

```
glm_obj_2 <- glm(Var1_ ~ Var2_* GROUP, data=testdata, family=binomial(link="logit"))
summary(glm_obj_2)
```

Ok! Is this a reasonable approach? Removing the NA’s seems questionable to me, but I’m not sure how else to deal with that in the GLM specification.

## Extra Credit:

I mentioned the pseudoreplication before. I assume this could be controlled for with a random effect? I’ll try a GLMM implemented in LME4… This time I’ll add fewer ID levels to increase the odds of pseudoreplication.

```
library(lme4)
group1<-as.data.frame(cbind(rep("GROUP_1",times=50),sample(x=c(1:5),size=50,replace=TRUE),rbinom(50,1,0.25),rbinom(50,1,0.75)))
group2<-as.data.frame(cbind(rep("GROUP_2",times=50),sample(x=c(6:10),size=50,replace=TRUE),rbinom(50,1,0.05),rbinom(50,1,0.95)))
group3<-as.data.frame(cbind(rep("GROUP_3",times=50),sample(x=c(11:15),size=50,replace=TRUE),rbinom(50,1,0.50),rbinom(50,1,0.50)))
group4<-as.data.frame(cbind(rep("GROUP_4",times=50),sample(x=c(16:20),size=50,replace=TRUE),rbinom(50,1,0.67),rbinom(50,1,0.33)))
group5<-as.data.frame(cbind(rep("GROUP_5",times=50),sample(x=c(21:25),size=50,replace=TRUE),rbinom(50,1,0.20),rbinom(50,1,0.20)))
testdata<-rbind(group1,group2,group3,group4)
names(testdata)<-c("GROUP","ID","Var1_","Var2_")
glmer_obj<-glmer(Var1_ ~ Var2_ * GROUP + (1|ID), data=testdata, family=binomial(link="logit"))
summary(glm_obj)
```

Well, it fails to converge, but for illustrative purposes, I hope it suffices. Would this be an appropriate way to tackle this type of data?

I’m eager to hear your responses, suggestions, criticisms, etc.

Thanks in advance!