I have a 3 class(1/0/unclassified) classification problem where my training data is created using a bunch of rules.

Problem: Classify whether a person owns a vehicle or travels by public transport.

Dataset: Person’s expense journal entries in csv format (around 2 lakh entries from 20 people for a range of 3 years).

Fields are:

             person_id,date of payment, category, shop,    expense, summary
              1,      2020-01-01    , fuel , fuel_stop,$20,    'paid for refilling'
              2,      2020-01-01    , ticket, `bus`,     $10,    'took a bus to Treasa's house'

Training data generation: No labelling is done here.

Instead some rules are used for tagging the data.

For ex.
Rules for vehicle owners:

  1. Maintenance fee records
  2. Fuel transactions
  3. Few transactions in public transport
  4. Driver salary payments

Rules for non vehicle owners:

  1. Multiple transactions in public transport(bus, train, subway etc.)
  2. No fuel transactions
  3. No maintenance transactions

Nuances like people with vehicles travelling by public transport etc. could be ignored.

I used an XG Boost model for modelling this data.

During cross validation, I can see that the errors are always 0.00, even though logloss is dropping.

[62]    validation_0-merror:0.00000 validation_0-mlogloss:0.12917   validation_1-merror:0.00000 validation_1-mlogloss:0.12983
[63]    validation_0-merror:0.00000 validation_0-mlogloss:0.12524   validation_1-merror:0.00000 validation_1-mlogloss:0.12577
[64]    validation_0-merror:0.00000 validation_0-mlogloss:0.12138   validation_1-merror:0.00000 validation_1-mlogloss:0.12201

The model identifies the vehicle owners in a different test bunch almost correctly, with roughly 96% accuracy.

However, I do not know if the model will be able to identify other cases correctly, or generalise across other features it has not seen.

Could anyone please shed some light on this.


