#StackBounty: #machine-learning #normal-distribution #descriptive-statistics #outliers #extreme-value Decision trees, Gradient boosting…

Bounty: 50

I have a question regarding the normality of predictors. I have 100,000 observations in my data. The problem I am analysing is a classification problem so 5% of the data is assigned to class 1, 95,000 observations assigned to class 0, so the data is highly imbalanced. However the observations of the class 1 data is expected to have extreme values.

  • What I have done is, trim the top 1% and bottom 1% of the data removing, any possible mistakes in the entry of such data)
  • Winsorised the data at the 5% and 95% level (which I have checked and is an accepted practise when dealing with such data that I have).

So;
I plot a density plot of one variable after no outlier manipulation
enter image description here

Here is the same variable after trimming the data at the 1% level
enter image description here

Here is the variable after being trimmed and after being winsorised
enter image description here

My question is how should I approach this problem.

First question, should I just leave the data alone at trimming it? or should I continue to winsorise to further condense the extreme values into more meaningful values (since even after trimming the data I am still left with what I feel are extreme values). If I just leave the data after trimming it, I am left with long tails in the distribution like the following (however the observations that I am trying to classify mostly fall at the tail end of these plots).
enter image description here

Second question, since decisions trees and gradient boosted trees decide on splits, does the distribution matter? What I mean by that is if the tree splits on a variable at (using the plots above) <= -10. Then according to plot 2 (after trimming the data) and plot 3 (after winsorisation) all firms <= -10 will be classified as class 1.

Consider the decision tree I created below.

enter image description here

My argument is, irregardless of the spikes in the data (made from winsorisation) the decision tree will make the classification at all observations <= 0. So the distribution of that variable should not matter in making the split? It will only affect at what value that split will occur at? and I do not loose too much predictive power in these tails?


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.