I have an interesting question, my code needs to be able to handle structured data where I don’t know much about the structure at development time. I know the samples follow a schema that can be nested and the leafs contain some basic primitives like floats, integers and categoricals. I will have access to the schema at training time. I want to train a model that can detect anomalies or more directly whether or not a sample comes from the original distribution of data.
One way would be to just flatten the data and train an isolation forest. From my intuition, this breaks with highly correlated features. Another approach would be to build a neural network architecture that contains some of the structure and primitive types of the passed schema. You could build an autoencoder approach and measure reconstruction error, which I feel like would handle correlated features a lot better because these correlations are captured by the encoder-decoder network. A possible issue would be how to build this reconstruction error metric with different primitives.
Does anyone have any ideas on this or possibly papers that handle related challenges?