#StackBounty: #r #multivariate-analysis #covariance #vegan Covariance and independent variance in multivariate datasets – R

Bounty: 50

I have a dataset which describes field reference sites with different group of environmental indicators (continuous quantitative data). I’m interested in understanding how a group of parameters describe the total statistical heterogeneity (variance, inertia, …) of the dataset. More specifically, I would like to tackle 2 questions:

  • Question 1: How similarly the different groups of indicators describe the total heterogeneity of the entire dataset?

  • Question 2: How differently do these (groups of) indicators contribute to the total heterogeneity within my dataset? i.e: for one group of parameters, how much does its unshared variance explain the total heterogeneity of the entire dataset?

Here below I tried to make a reproducible example with the environmental part of the "doubs" dataset from the ade4 package as toy dataset in R environment. I think I found solutions to address Question 1 (see below a reproducible example) but I’m looking for statistical means to address Question 2.

library(ade4)
# This data set gives environmental variables, fish species and spatial coordinates for 30 sites.
data("doubs")

# extracting the environmental variables
env_heterogeneity <- doubs$env
head(env_heterogeneity)

# selecting 2 groups of environmental parameters
multivariate_dataset_1 <- env_heterogeneity[,1:4] # physical/morphology parameters
multivariate_dataset_2 <- env_heterogeneity[,5:11] # chemical parameters

# how similar to each other the two multivariate datasets?
RV.rtest(multivariate_dataset_1,multivariate_dataset_2) 

# RV.rtest(multivariate_dataset_1,multivariate_dataset_2) 
# Monte-Carlo test
# Call: RV.rtest(df1 = multivariate_dataset_1, df2 = multivariate_dataset_2)
# 
# Observation: 0.3940863 
# 
# Based on 99 replicates
# Simulated p-value: 0.01 
# Alternative hypothesis: greater 
# 
# Std.Obs Expectation    Variance 
# 6.578328982 0.043988310 0.002832357 

The RV test from ade4 package is a multivariate generalization of Pearson correlation coefficient. It provides a good estimate of the shared variance among multivariate_dataset_1 and multivariate_dataset_2.
I could also used variance portioning approach based on redundancy analysis in vegan package which here below tells me that 71% of the variance of multivariate_dataset_2 can be explained by multivariate_dataset_1:

# how  multivariate_dataset_1 can explain the variance of multivariate_dataset_2?
library(vegan)
RDA_1 <- rda(X = multivariate_dataset_1 , Y = multivariate_dataset_2)
summary(RDA_1)

# summary(RDA_1)
# 
# Call:
#   rda(X = multivariate_dataset_1, Y = multivariate_dataset_2) 
# 
# Partitioning of variance:
#   Inertia Proportion
# Total         5300660     1.0000
# Constrained   3786549     0.7144
# Unconstrained 1514111     0.2856

I think I have satisfactory solutions for Q1, but I’m completely in the dark about Q2. As my wording is not helping, I also tried the Venn diagram below representing the variance and covariance of the datasets. For me, Q1 is about the light grey area and Q2 is rather about the pure white, or pure grey areas (variance dataset 1- covariance1 / 2). Don’t hesitate to help me improve my wording through comments.

enter image description here


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.