Let’s say you are building a Star trek style medical tricorder which can diagnose any medical condition. It needs to be able to detect comorbidities where a patient has multiple conditions (e.g. perhaps the patient has COVID, diabetes and lung cancer all at the same time).
How would you build a classification system to detect the most likely set of conditions?
I see two approaches:
- Build one model per disease, run predictions using every model and report that the patient is suffering from any disease for which the prediction probability exceeded a threshold (e.g. 0.95)
likely_diseases =  THRESHOLD = 0.95 for disease in disease_columns: model = xgb.XGBClassifier() model.fit(X_train, y_train[disease]) pred_proba = model.predict_proba(patient_data)[:, 1] if (pred_proba > THRESHOLD): likely_diseases.append(disease)
- Build one model per combination of diseases and choose the combo with the highest probability
df['has_covid_lung_cancer_and_stroke'] = df.apply(lambda patient: patient['has_lung_cancer'] and patient['has_covid'] and patient['has_stroke']) # create all other possible permutations of diseases highest_probability = 0.0 most_likely_disease_combination = None for disease_combination in disease_combinations: model = xgb.XGBClassifier() model.fit(X_train, y_train[disease]) pred_proba = model.predict_proba(patient_data)[:, 1] if (pred_proba > highest_probability): most_likely_disease_combination = disease_combination highest_probability = pred_proba
It strikes me that approach two would probably be more accurate but might be so computationally expensive that it is intractable. Perhaps some pruning would take place where combinations that have exceedingly low occurrences in the training data set are discarded.