#StackBounty: #k-means #anomaly-detection K-Prototype for anomaly detection

Bounty: 50

I have logs of the form (e.g. from a gym login).. the representational case is so:

UserName, Login time, timeSpend_on_weights, time_spent_on_elliptical

Ava, 5jan 12pm, 10 mins, 20 mins,
Bob, 5jan 2pm, 30 min, 20 mins,
Cecila, 6jan 10am, 40min, 0 mins

Now I’ve converted the above time column to HourOfDay and day of month to get:

UserName, DOM, HOD, #weights, #elliptical
Ava, 5, 12, 10, 20
Bob, 5, 14, 30, 20
Cecilia, 6, 10, 40, 0

I treat the first 3 columns as categorical data and the last two as numerical, and I run K-Prototypes with N=2 (anomalous or non-anomalous). The final predictions I get can be filtered on each user to find anomalies specific to the username. The anomalous cluster is the one with lesser elements.

However, for some of the users, the cluster partitions on the Login time (HOD/DOM).. E.g. everything before 12am is one cluster and everything after 12am is another one. That doesn’t convey any information.

What is the best way to handle these scenarios?

Is there a better way to do anomaly prediction on these kinds of dataset?

Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.