I was wondering if it makes sense to cluster a dataset to find closely related data points and train a binary classification model for each of this clusters as they would be minidatasets. I’ll ellaborate a bit more.
I have a dataset with information about an infrequent event (e.g. a crime in one part of the city). The dataset looks like this. Note how the features are not item-specific but timestamp specific (as in the 2020-03-12 17:00 timestamp) as they represent the state of the city at that timestamp: weather, hour of the day, visibility, etc…
Timestamp StreetID Feature1 Feature2, Feature3,... FeatureN 2020-03-12 17:00 12 4 C 5 100 2020-03-12 17:00 145 4 C 5 100 2020-03-08 19:00 145 8 D 6 98 ... 2020-03-06 18:15 76 6 C 8 110
I want to train a machine learning model that predicts if this rare event is going to happen or not in a given street or set of streets. I have two options here:
- Train a multiclass classification model where you will get a probability for each of the streets. The problem is that there are thousands of streets and some of them have few positive cases
- Train one binary classification model for each street. Also impractical for the reasons mentioned above
- Group several streets and train a binary classification model for each of these groups.
My question is related to the third option. How can we build these groups?. I assume that it would help to group together streets that share a feature distribution that explains if this unfrequent event is happening or not. Could I make that by clustering the dataset shown above (ignoring the StreetID feature obviously) and checking if inside those clusters there is a high concentration of one or more StreetID’s?. Then I can train a binary model with a dataset for the StreetID’s of one cluster containing the positive cases/timestamps where the event happened together with the negative cases/timestamps where the event didn’t happen like in the example below (assuming we have 15 minutes timestamps and Streets 145 and 76 belong to the same cluster i.e have a high concentration of events inside one cluster). Hypothetically this would make more sense than just grouping streets by some other criteria.
Timestamp StreetID Feature1 Feature2, Feature3,... FeatureN EventHappened 2020-03-12 16:45 - 3 F 9 130 no 2020-03-12 17:00 145 4 C 5 100 yes 2020-03-12 17:15 - 9 F 19 30 no .... 2020-03-08 19:00 145 8 D 6 98 yes ... 2020-03-06 18:15 76 6 C 8 110 yes 2020-03-06 18:30 - 7 D 20 80 no