#StackBounty: #machine-learning #clustering #binary-data #anomaly-detection Clustering a dataset and creating a model per each cluster

Bounty: 50

I was wondering if it makes sense to cluster a dataset to find closely related data points and train a binary classification model for each of this clusters as they would be minidatasets. I’ll ellaborate a bit more.

I have a dataset with information about an infrequent event (e.g. a crime in one part of the city). The dataset looks like this. Note how the features are not item-specific but timestamp specific (as in the 2020-03-12 17:00 timestamp) as they represent the state of the city at that timestamp: weather, hour of the day, visibility, etc…

Timestamp         StreetID Feature1 Feature2, Feature3,... FeatureN
2020-03-12 17:00  12       4         C         5            100
2020-03-12 17:00  145      4         C         5            100
2020-03-08 19:00  145      8         D         6            98
...
2020-03-06 18:15  76       6         C         8            110

I want to train a machine learning model that predicts if this rare event is going to happen or not in a given street or set of streets. I have two options here:

  • Train a multiclass classification model where you will get a probability for each of the streets. The problem is that there are thousands of streets and some of them have few positive cases
  • Train one binary classification model for each street. Also impractical for the reasons mentioned above
  • Group several streets and train a binary classification model for each of these groups.

My question is related to the third option. How can we build these groups?. I assume that it would help to group together streets that share a feature distribution that explains if this unfrequent event is happening or not. Could I make that by clustering the dataset shown above (ignoring the StreetID feature obviously) and checking if inside those clusters there is a high concentration of one or more StreetID’s?. Then I can train a binary model with a dataset for the StreetID’s of one cluster containing the positive cases/timestamps where the event happened together with the negative cases/timestamps where the event didn’t happen like in the example below (assuming we have 15 minutes timestamps and Streets 145 and 76 belong to the same cluster i.e have a high concentration of events inside one cluster). Hypothetically this would make more sense than just grouping streets by some other criteria.

Timestamp         StreetID Feature1 Feature2, Feature3,... FeatureN  EventHappened
2020-03-12 16:45  -        3         F         9            130      no
2020-03-12 17:00  145      4         C         5            100      yes
2020-03-12 17:15  -        9         F         19           30       no
....
2020-03-08 19:00  145      8         D         6            98       yes
...
2020-03-06 18:15  76       6         C         8            110      yes
2020-03-06 18:30  -        7         D         20           80      no


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.