#StackBounty: #time-series #probability #classification #bernoulli-distribution #sequential-pattern-mining Sequential classification, c…

Bounty: 50

What is the best way to combine outputs from a binary classifier, which outputs probabilities, and is applied to a sequence of non-iid inputs?

Here’s a scenario: Say I have a classifier which does an OK, but not great, job of classifying whether or not a cat is in an image. I feed the classifier frames from a video, and get as output a sequence of probabilities, near one if a cat is present, near zero if not.

Each of the inputs is clearly not independent. If a cat is present in one frame, it’s most likely it will be present in the next frame as well. Say I have the following sequence of predictions from the classifier (obviously there are more than six frames in one hour of video)

  • 12pm to 1pm: $[0.1, 0.3, 0.6, 0.4, 0.2, 0.1]$
  • 1pm to 2pm: $[0.1, 0.2, 0.45, 0.45, 0.48, 0.2]$
  • 2pm and 3pm: $[0.1, 0.1, 0.2, 0.1, 0.2, 0.1]$

The classifier answers the question, “What is the probability a cat is present in this video frame”. But can I use these outputs to answer the following questions?

  1. What is the probability there was a cat in the video between 12 and 1pm? Between 1 and 2pm? Between 2pm and 3pm?
  2. Given say, a day of video, what is the probability that we have seen a cat at least once? Probability we have seen a cat exactly twice?

My first attempts at this problem are to simply threshold the classifier at say, 0.5. In which case, for question 1, we would decide there was a cat between 12 and 1pm, but not between 1 to 3pm, despite the fact that between 1 and 2pm the sum of the probabilities is much higher than between 2 and 3pm.

I could also imagine this as a sequence of Bernoulli trials, where one sample is drawn for each probability output from the classifier. Given a sequence, one could simulate this to answer these questions. Maybe this is unsatisfactory though, because it treats each frame as iid? I think a sequence of high probabilities should provide more evidence for the presence of a cat than the same high probabilities in a random order.


Get this bounty!!!

#StackBounty: #python #r #data-mining #sequential-pattern-mining Which one to choose to identify patterns of user activity: Sequence an…

Bounty: 50

I have the following user activity data that where for each user the activity type they were engaged are recorded along with the phase:

User |  Phase |  ActivityType  |  Date
321     1        A                12/20/2020 15:00
321     1        B                12/20/2020 16:00
321     2        A                12/21/2020 12:00
321     1        C                12/21/2020 13:00
321     3        B                12/22/2020 11:00
322     1        A                12/20/2020 15:00
322     1        A                12/20/2020 16:00
322     2        B                12/21/2020 12:00
322     1        C                12/21/2020 13:00
322     3        D                12/22/2020 11:00

For each user, I also have the satisfaction score about the application.

User | Satisfaction
321    90
321    60

What I want to see is if there are any emerging groups of users with a certain pattern of activities. Then, I want to compare the satisfaction scores across these groups to check if the specific pattern of activities yield higher satisfaction or not.

To perform this analysis, I identified two approaches: process mining (with PM4PY, python library), and sequence analysis (with TraMineR, an R library).

However I am not sure which one would fit better to my needs. I am a total beginner in both areas. Any insights to help me make a good decision here?


Get this bounty!!!