## #StackBounty: #time-series #probability #classification #bernoulli-distribution #sequential-pattern-mining Sequential classification, c…

### Bounty: 50

What is the best way to combine outputs from a binary classifier, which outputs probabilities, and is applied to a sequence of non-iid inputs?

Here’s a scenario: Say I have a classifier which does an OK, but not great, job of classifying whether or not a cat is in an image. I feed the classifier frames from a video, and get as output a sequence of probabilities, near one if a cat is present, near zero if not.

Each of the inputs is clearly not independent. If a cat is present in one frame, it’s most likely it will be present in the next frame as well. Say I have the following sequence of predictions from the classifier (obviously there are more than six frames in one hour of video)

• 12pm to 1pm: $$[0.1, 0.3, 0.6, 0.4, 0.2, 0.1]$$
• 1pm to 2pm: $$[0.1, 0.2, 0.45, 0.45, 0.48, 0.2]$$
• 2pm and 3pm: $$[0.1, 0.1, 0.2, 0.1, 0.2, 0.1]$$

The classifier answers the question, “What is the probability a cat is present in this video frame”. But can I use these outputs to answer the following questions?

1. What is the probability there was a cat in the video between 12 and 1pm? Between 1 and 2pm? Between 2pm and 3pm?
2. Given say, a day of video, what is the probability that we have seen a cat at least once? Probability we have seen a cat exactly twice?

My first attempts at this problem are to simply threshold the classifier at say, 0.5. In which case, for question 1, we would decide there was a cat between 12 and 1pm, but not between 1 to 3pm, despite the fact that between 1 and 2pm the sum of the probabilities is much higher than between 2 and 3pm.

I could also imagine this as a sequence of Bernoulli trials, where one sample is drawn for each probability output from the classifier. Given a sequence, one could simulate this to answer these questions. Maybe this is unsatisfactory though, because it treats each frame as iid? I think a sequence of high probabilities should provide more evidence for the presence of a cat than the same high probabilities in a random order.

Get this bounty!!!

## #StackBounty: #classification #cross-validation #scikit-learn #hyperparameter #ensemble Should I perform nested CV with Grid Search to …

### Bounty: 50

I’m doing classification of 8 types of hand gestures with stacking models. For that I initially split the data into training and test sets. Then I used `GridSerachCV` to tune the hyper-parameters.

Here’s the code :

``````param_grid = [

{
#Random forest
'bootstrap': [True, False],
'max_depth': [40, 50, 60, 70, 80],
#'max_features': [2, 3],
'min_samples_leaf': [3, 4, 5],
'min_samples_split': [8, 10, 12],
'n_estimators': [10, 15, 20, 25],
'criterion' : ['gini', 'entropy'],
'random_state' : 
},

{
#K Nearest Neighbours
'n_neighbors':[5,6,7,9,11],
'leaf_size':[1,3,5,7],
'algorithm':['auto', 'ball_tree', 'kd_tree', 'brute'],
'metric':['euclidean', 'manhattan']

},

{
#SVM
'C': list(np.arange(1, 5, 0.01)),
'gamma': ['scale', 'auto'],
'kernel': ['rbf', 'poly', 'sigmoid', 'linear'],
'decision_function_shape': ['ovo', 'ovr'],
'random_state' : 
}
]

models_to_train = [RandomForestClassifier(), KNeighborsClassifier(), svm.SVC()]

final_models = []
for i, model in enumerate(models_to_train):
params = param_grid[i]

clf = GridSearchCV(estimator=model, param_grid=params, cv=20, scoring = 'accuracy').fit(data_train, label_train)
final_models.append(clf.best_estimator_)
``````

Now, I trained the best models, output by `GridSearchCV`, on the training data and evaluated it on the test data:

``````estimators = [
('rf', final_models),
('knn', final_models)
]
clf = StackingClassifier(
estimators=estimators, final_estimator=final_models
)

category_predicted = clf.fit(data_train, label_train).predict(data_test)
acc = accuracy_score(label_test, category_predicted) * 100
``````

My doubt is:

I performed train-test split in the beginning and I didn’t use nested CV because I thought it would increase time complexity a lot as I was using ensemble model. The model produced very good accuracy, more than 95%. Is there a high possibility that the model may give very low accuracy if the train-test split changes? So, should I stop doing train-test split in the beginning and should perform nested CV with Grid Search on the entire data (like what is described here )?

Get this bounty!!!

## #StackBounty: #classification #cross-validation #hyperparameter #ensemble #tuning Should I perform nested CV with Grid Search to make m…

### Bounty: 50

I’m doing classification of 8 types of hand gestures with stacking models. For that I initially split the data into training and test sets. Then I used `GridSerachCV` to tune the hyper-parameters.

Here’s the code :

``````param_grid = [

{
#Random forest
'bootstrap': [True, False],
'max_depth': [40, 50, 60, 70, 80],
#'max_features': [2, 3],
'min_samples_leaf': [3, 4, 5],
'min_samples_split': [8, 10, 12],
'n_estimators': [10, 15, 20, 25],
'criterion' : ['gini', 'entropy'],
'random_state' : 
},

{
#K Nearest Neighbours
'n_neighbors':[5,6,7,9,11],
'leaf_size':[1,3,5,7],
'algorithm':['auto', 'ball_tree', 'kd_tree', 'brute'],
'metric':['euclidean', 'manhattan']

},

{
#SVM
'C': list(np.arange(1, 5, 0.01)),
'gamma': ['scale', 'auto'],
'kernel': ['rbf', 'poly', 'sigmoid', 'linear'],
'decision_function_shape': ['ovo', 'ovr'],
'random_state' : 
}
]

models_to_train = [RandomForestClassifier(), KNeighborsClassifier(), svm.SVC()]

final_models = []
for i, model in enumerate(models_to_train):
params = param_grid[i]

clf = GridSearchCV(estimator=model, param_grid=params, cv=20, scoring = 'accuracy').fit(data_train, label_train)
final_models.append(clf.best_estimator_)
``````

Now, I trained the best models, output by `GridSearchCV`, on the training data and evaluated it on the test data:

``````estimators = [
('rf', final_models),
('knn', final_models)
]
clf = StackingClassifier(
estimators=estimators, final_estimator=final_models
)

category_predicted = clf.fit(data_train, label_train).predict(data_test)
acc = accuracy_score(label_test, category_predicted) * 100
``````

My doubt is:

I performed train-test split in the beginning and I didn’t use nested CV because I thought it would increase time complexity a lot as I was using ensemble model. The model produced very good accuracy, more than 95%. Is there a high possibility that the model may give very low accuracy if the train-test split changes? So, should I stop doing train-test split in the beginning and should perform nested CV with Grid Search on the entire data (like what is described here )?

Get this bounty!!!

## #StackBounty: #classification #natural-language #active-learning Implementing active learning in practice

### Bounty: 150

I am working on a project where we are coding written survey responses pertaining to a persons mother tongue. For example if the person wrote in "English" it would get coded to 0001 and "Spanish" would get coded to 0002 etc. To do this we have created a reference file that will catch everything we expect to see. For instance the reference file will have English and Spanish etc.

The issue is we have potentially millions of responses written in that may not match to the reference file. For example, spelling mistakes or maybe colloquially written terms, sometimes just nonsense is written in etc. We would like to use machine learning to process these write-ins that the reference file does not catch. The problem is that we do not have "true" values from which to train from beyond the reference file.

We could try using the reference file as a training set but the performance will likely be poor. We do have "experts" that can look at a write-in and assign the correct code so I was wondering if we could build an initial model from the reference file and use active learning to improve it in the following way.

1. Build initial model on the reference file
2. Select two samples from records not matching reference file; first sample a SRS from the population to be used to analyze performance and the second sample selected from records that were particularly difficult for the model to predict (i.e. those with equal probabilities between classes).
3. Have our expert code both samples
4. Calculate performance on the first sample
5. Retrain model with reference file data plus data from both samples
6. Repeat 1-5 until performance stops increasing significantly

Does this approach sound valid? Am I leaking data somehow by doing this? Is there a better approach?

Get this bounty!!!

## #StackBounty: #classification #scikit-learn #computer-vision #preprocessing #dimensionality-reduction Reducing the size of a dataset

### Bounty: 100

I am trying to classify gestures. I am using Python’s scikit learn library classification algorithms for that. I have collected depth images for this purpose. 200 samples are collected for each gesture. Each gesture is made up of 25 frames and each frame is of size 240×420. I tried PCA for dimensionality framewise for reducing the size of each gesture (200 samples each) to make it easy to run on the machine. Still the large size of the data make it difficult to run in my machine when the number of gestures to classify are larger than 4. I am looking for methods to make it run on my machine.

Get this bounty!!!

## #StackBounty: #classification significance test and sample size estimation for classifiers

### Bounty: 50

What is the test to tell if e.g. an F1 score of 0.69 for classifier A and 0.72 for classifier B is truly different and not just by chance? (for mean-values one would use a "t-test" and obtain a "p-value"). I have access to the underlying data and not only to the F1 scores.

… and how can one estimate the sample size needed in the test-set in order not to miss a true difference between the F1 scores? (as in the example above) (for mean-values one would use a "power analysis"). Or in other words, if I want to know which classifier (A or B) is truly better (to a certain significance level): how many test cases do I need?

Google just returns some research papers but I would need some type of established standard methods for the sample size and significance test (ideally implemented as a python package).

— EDIT —

Thanks for pointing out this post in the comments – it points in the right direction but unfortunately does not solve my two related problems as stated above.

Get this bounty!!!

## #StackBounty: #machine-learning #classification #probability #bayesian Problem understanding probabilistic generative models for classi…

### Bounty: 50

I am a student and I am studying machine learning. I am focusing on probabilistic generative models for classification and I am having some troubles understanding this topic.

In the slide of my professor it is written the following: which I don’t understand.

So far, I have understood that in the generative probailistic models, we ant to estimate $$P(C_i|x)$$, which is the probability of having class $$i$$ given a data $$x$$, using the likelihood and the Bayes theorem.

So, it starts by writing the Bayes rule, but the the slides says that we can write this as a sigmoid, but why?

If I have to try to give an answer to it, I would say because the sigmoid gives a number from $$0$$ to $$1$$, and so a probability, but it is just a guess I am doing.

Moreover, it continues by saying that we can use a gaussian distribution for $$P(x|C_i)$$, and so $$P(x|C_i)=N(mu ,sigma )$$, and so : I don’t know if my question is clear so sorry if it is not but I am really confused. If it is not lcear please tell me I will try to edit it. Thanks in advance.

Note: if it can be useful, this has been taken from the Bishop book at page 197

Get this bounty!!!

## #StackBounty: #classification #calibration #scoring-rules #isotonic calibration of classifier scores: isotonic regression

### Bounty: 100

I am investigating the isotonic regression approach to calibrate the scores from a classifier.

If I understand correctly, we do the following. First, we get the calibration plot (or reliability curve), which is the mean predicted values vs. fraction of positives. Then, we want the "fraction of positives" to be a non-decreasing function of "mean predicted values", which is done by isotonic regression.

Here is my confusion: how comes that in some cases "fraction of positives" is not non-decreasing function? For example, here: the calibrated case is not increasing function. The plot is taken from

https://www.svds.com/classifiers2/

One can find other examples with the same issue. I have read the original paper

B. Zadrozny and C. Elkan. Transforming classifier scores into accurate multiclass probability estimates.

In their results the calibrated function is monotone.

Get this bounty!!!

## #StackBounty: #bayesian #classification #prior Priors for discriminative methods?

### Bounty: 50

Say we want to build a classifier for a binary classification problem using a discriminative method (e.g. SVM) and be able to impose a prior on the classes.

For example, let’s assume that we want to use the prior $$text{Beta}(10,20)$$ on the positive class. It would look like this: How can I estimate the posterior probability of classification resulting from combining the output of my discriminative predictor with the above prior? What steps would I need to take to compute this posterior probability?

Get this bounty!!!

## Backrounds

I would like to build a model that predicts a month label $$mathbf{y}$$ from a given set of features $$mathbf{X}$$. Data structure is as follows.

• $$mathbf{X} : N_{samples} times N_{features}$$.
• $$mathbf{y}: N_{samples} times 1$$, which has range of $$1,2,cdots,12$$.

I may find it more helpful to have output as predicted probability of each labels, since I would like to make use of the prediction uncertainty. I may try any multi-class algorithms to build such model. Actually, I tried some of scikit-learn’s multiclass algorithms.

However, I found out that none of them very useful, due to the following problem that I face.

## Problem : I cannot make use of class similarity

By class similarity, I mean the similar characteristics that temporally adjacent months generally share. Most algorithms do not provide any ways to make use of such prior knowledge. In other words, they miss the following requirements:

It is quite okay to predict January(1) for February(2), but very undesirable to predict August(8) for February(2)

For instance, I may try multi-layer perceptron classifier(MLP) to build a model. However, algorithms such as MLP are optimized for problems such as classification of hand-written digits. In these problems, predicting 1 for 2 is equally undesirable to predicting 8 for 2.

In other words, most algorithms are agnostic to the relative similarity across labels. If a classifier could exploit such class similarity as a prior knowledge, it may perform much better. If I were to force such prior in the form of distribution, I may choose cosine-shaped distribution across months.

Some may suggest some algorithms that are based on linear regression, such as all-or-rest logistic regression. However, since months have wrap-around properties, such regression models may not work well. For instance, assuming $$mathbf{y}$$ as continuous variable may miss that January(1) and December(12) are actually very similar.

## Questions

As a beginner to machine learning, I am not very familiar with available algorithms. Any help, including ideas about my problem or recommendations of related papers, threads, or websites, will be welcomed.

Get this bounty!!!