I have a question regarding the normality of predictors. I have 100,000 observations in my data. The problem I am analysing is a classification problem so 5% of the data is assigned to class 1, 95,000 observations assigned to class 0, so the data is highly imbalanced. However the observations of the class 1 data is expected to have extreme values.
- What I have done is, trim the top 1% and bottom 1% of the data removing, any possible mistakes in the entry of such data)
- Winsorised the data at the 5% and 95% level (which I have checked and is an accepted practise when dealing with such data that I have).
My question is how should I approach this problem.
First question, should I just leave the data alone at trimming it? or should I continue to winsorise to further condense the extreme values into more meaningful values (since even after trimming the data I am still left with what I feel are extreme values). If I just leave the data after trimming it, I am left with long tails in the distribution like the following (however the observations that I am trying to classify mostly fall at the tail end of these plots).
Second question, since decisions trees and gradient boosted trees decide on splits, does the distribution matter? What I mean by that is if the tree splits on a variable at (using the plots above) <= -10. Then according to plot 2 (after trimming the data) and plot 3 (after winsorisation) all firms <= -10 will be classified as class 1.
Consider the decision tree I created below.
My argument is, irregardless of the spikes in the data (made from winsorisation) the decision tree will make the classification at all observations <= 0. So the distribution of that variable should not matter in making the split? It will only affect at what value that split will occur at? and I do not loose too much predictive power in these tails?