I have data that is non-IID, and I want to estimate if the dependence is bad enough that it will have a noticeable effect on a fitted classifier. I don’t think the exact model type will matter in this case, but for argument’s sake let’s say I’m using elastic-net logistic regression. In this case the dependence takes the form of clustering among observations, in that if $Y_k = 1$ has a high probability then the probability that $Y_j = 1$ is very low for all $jneq k$ within the data cluster.
Ideally, I would like to be able to compare the fitted model from the non-IID data set to a “comparable” or “similar” IID data set. So I’m thinking I could just simulate such a data set, fit the model on the fake data and the real data,
- Is there a formal or rigorous definition of “similarity” that makes sense in this case? I certainly know a dissimilar data set when I see one, but it’s hard to quantify exactly how I know.
- Is there a straightforward way to generate an IID dataset from a non-IID data set that otherwise preserves some structure from the joint distribution of features?
- Is this an X-Y problem? Is there a better way to evaluate the effect of data dependence on my estimates?
- For a purely predictive task, does non-IID data even make a difference as long as the cross-validation procedure is constructed correctly? This answer suggests the answer is “no”