I have a dataset which describes field reference sites with different group of environmental indicators (continuous quantitative data). I’m interested in understanding how a group of parameters describe the total statistical heterogeneity (variance, inertia, …) of the dataset. More specifically, I would like to tackle 2 questions:
Question 1: How similarly the different groups of indicators describe the total heterogeneity of the entire dataset?
Question 2: How differently do these (groups of) indicators contribute to the total heterogeneity within my dataset? i.e: for one group of parameters, how much does its unshared variance explain the total heterogeneity of the entire dataset?
Here below I tried to make a reproducible example with the environmental part of the "doubs" dataset from the ade4 package as toy dataset in
R environment. I think I found solutions to address Question 1 (see below a reproducible example) but I’m looking for statistical means to address Question 2.
library(ade4) # This data set gives environmental variables, fish species and spatial coordinates for 30 sites. data("doubs") # extracting the environmental variables env_heterogeneity <- doubs$env head(env_heterogeneity) # selecting 2 groups of environmental parameters multivariate_dataset_1 <- env_heterogeneity[,1:4] # physical/morphology parameters multivariate_dataset_2 <- env_heterogeneity[,5:11] # chemical parameters # how similar to each other the two multivariate datasets? RV.rtest(multivariate_dataset_1,multivariate_dataset_2) # RV.rtest(multivariate_dataset_1,multivariate_dataset_2) # Monte-Carlo test # Call: RV.rtest(df1 = multivariate_dataset_1, df2 = multivariate_dataset_2) # # Observation: 0.3940863 # # Based on 99 replicates # Simulated p-value: 0.01 # Alternative hypothesis: greater # # Std.Obs Expectation Variance # 6.578328982 0.043988310 0.002832357
The RV test from
ade4 package is a multivariate generalization of Pearson correlation coefficient. It provides a good estimate of the shared variance among multivariate_dataset_1 and multivariate_dataset_2.
I could also used variance portioning approach based on redundancy analysis in
vegan package which here below tells me that 71% of the variance of multivariate_dataset_2 can be explained by multivariate_dataset_1:
# how multivariate_dataset_1 can explain the variance of multivariate_dataset_2? library(vegan) RDA_1 <- rda(X = multivariate_dataset_1 , Y = multivariate_dataset_2) summary(RDA_1) # summary(RDA_1) # # Call: # rda(X = multivariate_dataset_1, Y = multivariate_dataset_2) # # Partitioning of variance: # Inertia Proportion # Total 5300660 1.0000 # Constrained 3786549 0.7144 # Unconstrained 1514111 0.2856
I think I have satisfactory solutions for Q1, but I’m completely in the dark about Q2. As my wording is not helping, I also tried the Venn diagram below representing the variance and covariance of the datasets. For me, Q1 is about the light grey area and Q2 is rather about the pure white, or pure grey areas (variance dataset 1- covariance1 / 2). Don’t hesitate to help me improve my wording through comments.