I have posted a question related to this problem over a year ago and we still were not able to figure this out.
We have two groups, A and B that we want to train on to separate them. Both have numerous observations of “text” so for example:
Notably, our original datasets are much larger, with around 4,000 observations for A (with really specific patterns) and around 20,000 for group B.
We want is a model that sees things like:
- if there is a
Cat position 1 we see a
Bon the end in group A (2/3), and we do not see this in group B (0/3)
- That we only find the motif
AAABBBin group A
- if we see
AAABBBwe also saw a
Cat the end in group A (1/3) but we did not see this in group B (0/3)
We tried LDA now (after converting this data to binary vectors), however, this would score each letter independently. To illustrate if group A would have two subgroups:
sub1: position1 = A + position10 = C
sub2: position2 = A + position15 = B
and both are not common in group B then a method like LDA would also score position1 = A (
sub1) + position15 = B (
sub2) extremely high even tho they are actually part of different dependencies within group A, so we are looking for an alternative taking care of such dependencies when differentiating groups.
We really hope someone here can help us out!