I have a 3 class(1/0/unclassified) classification problem where my training data is created using a bunch of rules.
Problem: Classify whether a person owns a vehicle or travels by public transport.
Dataset: Person’s expense journal entries in csv format (around 2 lakh entries from 20 people for a range of 3 years).
person_id,date of payment, category, shop, expense, summary 1, 2020-01-01 , fuel , fuel_stop,$20, 'paid for refilling' 2, 2020-01-01 , ticket, `bus`, $10, 'took a bus to Treasa's house'
Training data generation: No labelling is done here.
Instead some rules are used for tagging the data.
Rules for vehicle owners:
- Maintenance fee records
- Fuel transactions
- Few transactions in public transport
- Driver salary payments
Rules for non vehicle owners:
- Multiple transactions in public transport(bus, train, subway etc.)
- No fuel transactions
- No maintenance transactions
Nuances like people with vehicles travelling by public transport etc. could be ignored.
I used an XG Boost model for modelling this data.
During cross validation, I can see that the errors are always 0.00, even though logloss is dropping.
 validation_0-merror:0.00000 validation_0-mlogloss:0.12917 validation_1-merror:0.00000 validation_1-mlogloss:0.12983  validation_0-merror:0.00000 validation_0-mlogloss:0.12524 validation_1-merror:0.00000 validation_1-mlogloss:0.12577  validation_0-merror:0.00000 validation_0-mlogloss:0.12138 validation_1-merror:0.00000 validation_1-mlogloss:0.12201
The model identifies the vehicle owners in a different test bunch almost correctly, with roughly 96% accuracy.
However, I do not know if the model will be able to identify other cases correctly, or generalise across other features it has not seen.
Could anyone please shed some light on this.