I have MC and data each having events in two classes 0 and 1. I am trying to write an algorithm such that I can match the number of events in class 0 and 1 of MC to data i.e I want to correct MC events by moving them from one class to other such that the ratio of events in the two classes for both data and MC is same. The way I proceeded is:
- Train a GradientBoostingClassifier from scikit ensemble for both data and MC individually(say data_clf and mc_clf)
mc_clf.fit(X_mc, Y_mc) data_clf.fit(X_data , Y_data)
where Y_mc and Y_data is the corresponding class “mc_class” and “data_class” having values 0 or 1 depending on which class they belong to.
- Now, if X_mc is my input variable, use predict_proba to predict the probability of classifier of data and MC using MC inputs ONLY i.e
y_mc = mc_clf.predict_proba(X_mc) y_data = data_clf.predict_proba(X_mc)
- After this, I try to move the events of MC from one class to another by comparing their probability in data and MC.
for i in range(0, len(mc)): if (mc.loc[i]['mc_class'] == 0): wgt = y_data[i]/ y_mc[i] if (wgt<1): mc.loc[i]['mc_class_corrected'] = 1 else: mc.loc[i]['mc_class_corrected'] = mc.loc[i]['mc_class'] if (mc.loc[i]['mc_class'] == 1): wgt = y_data[i]/ y_mc[i] if (wgt<1) : mc.loc[i]['mc_class_corrected'] = 0 else: mc.loc[i]['mc_class_corrected'] = mc.loc[i]['mc_class']
In the end what happens is that initially suppose I had more events in class 0 than 1 in MC as compared to data. So I expect events from class 0 to move to class 1. However, I see that almost >95% of my events in class 0 of MC are moving to class 1 while I was expecting only about 30% of events to move (when compared to the number of events in data and MC)?
Is there any mistake in this ideology of working?
Thanks a lot:)