# #StackBounty: #normal-distribution #descriptive-statistics #outliers #boosting #extreme-value Decision trees, Gradient boosting and nor…

### Bounty: 50

I have a question regarding the normality of predictors. I have 100,000 observations in my data. The problem I am analysing is a classification problem so 5% of the data is assigned to class 1, 95,000 observations assigned to class 0, so the data is highly imbalanced. However the observations of the class 1 data is expected to have extreme values.

• What I have done is, trim the top 1% and bottom 1% of the data removing, any possible mistakes in the entry of such data)
• Winsorised the data at the 5% and 95% level (which I have checked and is an accepted practise when dealing with such data that I have).

So;
I plot a density plot of one variable after no outlier manipulation

Here is the same variable after trimming the data at the 1% level

Here is the variable after being trimmed and after being winsorised

My question is how should I approach this problem.

First question, should I just leave the data alone at trimming it? or should I continue to winsorise to further condense the extreme values into more meaningful values (since even after trimming the data I am still left with what I feel are extreme values). If I just leave the data after trimming it, I am left with long tails in the distribution like the following (however the observations that I am trying to classify mostly fall at the tail end of these plots).

Second question, since decisions trees and gradient boosted trees decide on splits, does the distribution matter? What I mean by that is if the tree splits on a variable at (using the plots above) <= -10. Then according to plot 2 (after trimming the data) and plot 3 (after winsorisation) all firms <= -10 will be classified as class 1.

Consider the decision tree I created below.

My argument is, irregardless of the spikes in the data (made from winsorisation) the decision tree will make the classification at all observations <= 0. So the distribution of that variable should not matter in making the split? It will only affect at what value that split will occur at? and I do not loose too much predictive power in these tails?

Get this bounty!!!

This site uses Akismet to reduce spam. Learn how your comment data is processed.