#StackBounty: #classification #model #group-differences #supervised-learning A method to separate classes while taking variable depende…

Bounty: 100

I have posted a question related to this problem over a year ago and we still were not able to figure this out.

We have two groups, A and B that we want to train on to separate them. Both have numerous observations of “text” so for example:

group A:

  • AAABBBC*CAAAAAAAC
  • CCCBBBC*CAAAAAAAB
  • CBBBBBC*CAAAAAAAB

group B:

  • AAACCCC*CAAAAAAAA
  • CCCCCCC*CAAAAAAAA
  • CBBCCCC*CAAAAAAAA

Notably, our original datasets are much larger, with around 4,000 observations for A (with really specific patterns) and around 20,000 for group B.
We want is a model that sees things like:

  • if there is a C at position 1 we see a B on the end in group A (2/3), and we do not see this in group B (0/3)
  • That we only find the motif AAABBB in group A
  • if we see AAABBB we also saw a C at the end in group A (1/3) but we did not see this in group B (0/3)

We tried LDA now (after converting this data to binary vectors), however, this would score each letter independently. To illustrate if group A would have two subgroups:

  • sub1: position1 = A + position10 = C
  • sub2: position2 = A + position15 = B

and both are not common in group B then a method like LDA would also score position1 = A (sub1) + position15 = B (sub2) extremely high even tho they are actually part of different dependencies within group A, so we are looking for an alternative taking care of such dependencies when differentiating groups.

We really hope someone here can help us out!


Get this bounty!!!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.