#StackBounty: #machine-learning #classification #xgboost #multiclass-classification How to approach All vs All classification problem

Bounty: 50

Let’s say you are building a Star trek style medical tricorder which can diagnose any medical condition. It needs to be able to detect comorbidities where a patient has multiple conditions (e.g. perhaps the patient has COVID, diabetes and lung cancer all at the same time).

How would you build a classification system to detect the most likely set of conditions?

I see two approaches:

  1. Build one model per disease, run predictions using every model and report that the patient is suffering from any disease for which the prediction probability exceeded a threshold (e.g. 0.95)
likely_diseases = []
THRESHOLD = 0.95
for disease in disease_columns:
  model = xgb.XGBClassifier()
  model.fit(X_train, y_train[disease])
  pred_proba = model.predict_proba(patient_data)[:, 1]
  if (pred_proba > THRESHOLD):
    likely_diseases.append(disease)
  1. Build one model per combination of diseases and choose the combo with the highest probability
df['has_covid_lung_cancer_and_stroke'] = df.apply(lambda patient: patient['has_lung_cancer'] and patient['has_covid'] and patient['has_stroke'])

# create all other possible permutations of diseases

highest_probability = 0.0
most_likely_disease_combination = None
for disease_combination in disease_combinations: 
  model = xgb.XGBClassifier()
  model.fit(X_train, y_train[disease])
  pred_proba = model.predict_proba(patient_data)[:, 1]
  if (pred_proba > highest_probability):
    most_likely_disease_combination = disease_combination
    highest_probability = pred_proba

It strikes me that approach two would probably be more accurate but might be so computationally expensive that it is intractable. Perhaps some pruning would take place where combinations that have exceedingly low occurrences in the training data set are discarded.


Get this bounty!!!

#StackBounty: #classification #multiclass-classification #supervised-learning How to deal with broad and narrow variance within classes…

Bounty: 50

Let’s say I’m doing an animal image classification task (it doesn’t have to be image classification – this is just my example), and the training and test data is balanced across classes. The classes might be ['gorilla', 'giraffe', 'dog', 'donkey']. Now we all know that there is relatively a lot of variance within the 'dog' class compared to the other three classes.

So, is there any way one would treat this problem vs another problem where all classes have about the same amount of variance (where I might replace 'dog' with 'sheep' for instance)?


Get this bounty!!!

#StackBounty: #machine-learning #multiclass-classification #association-rules Association rule learning for multi-classification sugges…

Bounty: 50

The task is the following: given a training set of medical symptoms and an associated diagnosis, output a list of the most likely diagnosises for a combination of symptoms.
As of now, a solution exists which makes use of association rule learning methods: we find rules on the attributes of the training dataset and infer the likely diagnosises and their probability from the confidence we have in said rules to hold true.

However, this method doesn’t seem to be able to scale for large datasets because of the sheer number of different attributes ($10^4$) and possible classes ($10^3$). Hence my question: are association rules a viable solution for this problem? Are there any alternatives?


Get this bounty!!!

#StackBounty: #python #classification #multiclass-classification #evaluation #pycm Is there any strategy for validating the result of a…

Bounty: 50

Disclaimer:

Recently we have developed a python library named PyCM specialized for analyzing multi-class confusion matrices. A compare system has been added in version 2 of this module in order to generally (not considering the subject of the problem) compare the resulted confusion matrices from different classification methods over a unique data-set.

Now, we are searching for a strategy which can validate the result of this option.

This strategy can be either a mathematical proof or a counterexample.

For example suggesting two close confusion matrices which had been compared and the comparison is validated can be helpful to answer this question.


P.S.1. In order to find out how this module works please read the Compare section at this document.


P.S.2. For further information visit the following links or ask your questions as a comment.

Website: http://www.pycm.ir/

Github: https://github.com/sepandhaghighi/pycm

Paper: https://www.theoj.org/joss-papers/joss.00729/10.21105.joss.00729.pdf


Get this bounty!!!

#StackBounty: #multiclass-classification #online-learning SGDClassifier: Online Learning/partial_fit with a previously unknown label

Bounty: 100

My training set contains about 50k entries with which I do an initial learning. On a weekly basis, ~ 5k entries are added; but the same amount “disappears” (as it is user data which has to be deleted after some time).

Therefore I use online learning because I do not have access to the full dataset at a later time. Currently I’m using an SGDClassifier which works, but my big problem: new categories are appearing and now I can’t use my model any more as they were not in the initial fit.

Is there a way with SGDClassifier or some other model? Deep learning?

It doesn’t matter if I have to start from scratch NOW (i.e. use something other than SGDClassifier), but I need something which enables online learning with new labels.


Get this bounty!!!

#StackBounty: #deep-learning #keras #tensorflow #multiclass-classification #class-imbalance Deep network not able to learn imbalanced d…

Bounty: 50

I have data with 5 output classes. The training data has the following no of samples for these 5 classes:
[706326, 32211, 2856, 3050, 901]

I am using the following keras (tf.keras) code:

class_weights = class_weight.compute_class_weight('balanced',
                                                 np.unique(y_train),
                                                 y_train)

model = tf.keras.Sequential([
    tf.keras.layers.Dense(50, input_shape=(dataX.shape[1],)),
    tf.keras.layers.Dropout(rate = 0.5),
    tf.keras.layers.Dense(50, activation=tf.nn.relu),
    tf.keras.layers.Dropout(rate = 0.5),
    tf.keras.layers.Dense(50, activation=tf.nn.relu),
    tf.keras.layers.Dropout(rate = 0.5),
    tf.keras.layers.Dense(50, activation=tf.nn.relu),
    tf.keras.layers.Dropout(rate = 0.5),
    tf.keras.layers.Dense(5, activation=tf.nn.softmax) ])
     adam = tf.keras.optimizers.Adam(lr=0.5)

model.compile(optimizer=adam, 
              loss='sparse_categorical_crossentropy',
              metrics=[metrics.sparse_categorical_accuracy])    
     model.fit(X_train,y_train, epochs=5, batch_size=32, class_weight=class_weights)

y_pred = np.argmax(model.predict(X_test), axis=1)

The first line on class_weight is taken from one of the answers in to this question: How to set class weights for imbalanced classes in Keras?

I know about this answer: Multi-class neural net always predicting 1 class after optimization . The difference is that in that case, the class weights wasn’t used whereas I am using it.

I am using sparse_categorical_crossentropy which accepts categories as integers (don’t need to convert them to one-hot encoding), but I also tried categorical_crossentropy and still the same problem.

I have of course tried different learning rate, batch_size, no of epochs, optimizer, and depth/length of the network. But it always is stuck at ~0.94 accuracy which is essentially I would get if I predict the first class all the time.

Not sure what is missing here. Any error? Or should I use some other specialized deep network?


Get this bounty!!!