#StackBounty: Classification using independent models for each class

Bounty: 50

One way to explain my data is to use the example data below. Here, I use the iris dataset to depict the four independent scores for each instance. My task is to classify each instance into one of the four classes.

> data(iris)
> iris2 <- as.data.frame(scale(iris[,1:4]))
> colnames(iris2) <- c("class_1","class2","class_3","class_4")
> head(iris2)
     class_1      class2   class_3   class_4
1 -0.8976739  1.01560199 -1.335752 -1.311052
2 -1.1392005 -0.13153881 -1.335752 -1.311052
3 -1.3807271  0.32731751 -1.392399 -1.311052
4 -1.5014904  0.09788935 -1.279104 -1.311052
5 -1.0184372  1.24503015 -1.335752 -1.311052
6 -0.5353840  1.93331463 -1.165809 -1.048667

However, the underlying scoring method/logic differs from class to class. Sure, when looking at one class only, a higher score means the instance is more likely of this class, but the difficulty arises when comparing the four scores:

Looking at their distributions, class_1 might have a significant skew, and class_2 a widely different value range. This means I cannot simply use the maximum value, when selecting the final class:

> iris3 <- cbind(
+   iris2,
+   lable_num=max.col(iris2,ties.method="first")
+ )
> head(iris3)
     class_1      class2   class_3   class_4 lable_num
1 -0.8976739  1.01560199 -1.335752 -1.311052         2
2 -1.1392005 -0.13153881 -1.335752 -1.311052         2
3 -1.3807271  0.32731751 -1.392399 -1.311052         2
4 -1.5014904  0.09788935 -1.279104 -1.311052         2
5 -1.0184372  1.24503015 -1.335752 -1.311052         2
6 -0.5353840  1.93331463 -1.165809 -1.048667         2

How should I go about and “level the playing field” in order to select the final class?

Is removing the mean enough? What if I also divide each column by its standard deviation? Or is this taking it one step too far? Am I loosing information by standardizing?

What about the skew issue?

I’m having difficulty discerning between what should be treated as an indication of a popular class (due to a heavy skew, or just larger values), and what traits should be fixed by scaling/transforming.

Are there any other types of approaches I should try?

I cannot change the way the scores have been calculated, and there is no training set I can use to model the final label using the four model scores as input (there are no prior final labels).


Get this bounty!!!

Leave a Reply