#StackBounty: #modeling #simulation #independence #non-independent #iid Simulating the impact of non-IID data on a model

Bounty: 200

I have data that is non-IID, and I want to estimate if the dependence is bad enough that it will have a noticeable effect on a fitted classifier. I don’t think the exact model type will matter in this case, but for argument’s sake let’s say I’m using elastic-net logistic regression. In this case the dependence takes the form of clustering among observations, in that if $Y_k = 1$ has a high probability then the probability that $Y_j = 1$ is very low for all $jneq k$ within the data cluster.

Ideally, I would like to be able to compare the fitted model from the non-IID data set to a “comparable” or “similar” IID data set. So I’m thinking I could just simulate such a data set, fit the model on the fake data and the real data,

  1. Is there a formal or rigorous definition of “similarity” that makes sense in this case? I certainly know a dissimilar data set when I see one, but it’s hard to quantify exactly how I know.
  2. Is there a straightforward way to generate an IID dataset from a non-IID data set that otherwise preserves some structure from the joint distribution of features?
  3. Is this an X-Y problem? Is there a better way to evaluate the effect of data dependence on my estimates?

edit:

  1. For a purely predictive task, does non-IID data even make a difference as long as the cross-validation procedure is constructed correctly? This answer suggests the answer is “no”


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.