#StackBounty: #machine-learning #maximum-likelihood #interpretation #model-selection #gaussian-process GP: How to select a model for a …

Bounty: 50

I have fitted a Gaussian Process (GP) to perform a binary classification task. The dataset is balanced, so I have an equal number of samples with 0/1 label for the training. The covariance function used is an RBF kernel, which needs the hyperparameter “length scale” to be tuned.

To be sure that I am not overfitting data, and that I am selecting proper kernel hyperparameters, I performed a grid search to find out the best percentage of training data and length scale, obtaining as statistical metrics the overall accuracy (OA) and the log-marginal likelihood (LML) on the test set.

You can see the results in the following image (left for OA, right for LML):

OA and LML after fitting a GP to data

EDIT: I re-uploaded the image with the normalized log-marginal likelihood. Common sense indicates that optimal model should find a
trade-off between model complexity and accuracy metrics. Thus, these
models lay somewhere between 30%-40% of training data and 0.7-0.9 of
length scale of the RBF kernel within the GP. This is great for model
selection, but unfortunately, I think I still cannot answer the
questions below… Any new insights on the interpretation of the LML?

Overall accuracy and normalized log-marginal likelihood

After exploring the effect of training size and the hyperparameter on the statistical metrics, I think it would be safe to select a model using at least 30% of data for training and a length scale for RBF of 0.1. However, I do not understand the role of LML to select the model (or even whether it needs to be considered), but common sense suggests that it should be as small as possible (i.e. around -400, represented in yellow). This means my best model is located at training size = 10-20% and length_scale=0.1.

I have seen that other people (here and here) have (somewhat) similar questions regarding LML, but I can’t find ideas that help me understanding the link between good error OA metrics and LML. In other words, I am having trouble at interpreting the LML.

In concrete, I would like to get more insights on:

  1. What is the impact of a high/low LML on the predictive power of the
    GP?
  2. How much better is a model with LML=-400 compared to one with
    LML=-700?
  3. What does it mean to have a LML of -400? Isn’t -400 a lot
    for a statistical metric?
  4. Did I really found a solution for my problem with these LML metrics?

Thanks for your help!


Get this bounty!!!

#StackBounty: #machine-learning #classification #statistical-significance How to find a statistical significance difference of classifi…

Bounty: 50

I am trying to compare some metrics on the same data set. So, I calculated some performance measures and I got the results for each metric.

My question is how to know if there are significant differences between the results. Is there any statistical test can help to find the statistically significant difference of each row in the table below. Do the t-test and ANOVA works for that.

For example, in the table below. Is there a statistically significant difference between accuracy 95.43, 95.78, 96.66 ,… and so on for other performance measures such as Sensitivity, F1 score etc. I am also not familiar with Kappa and Mcnemar’s Test p-value from classification results.

Note: I have checked other related questions, but I did not find a helpful answer. Also, my question is not only about accuracy but also for other performance measures.

I will really appreciate an informative detailed answer with an application.

enter image description here


Get this bounty!!!

#StackBounty: #machine-learning #classification #statistical-significance Statistical significance of classification results

Bounty: 50

I am trying to compare some metrics on the same data set. So, I calculated some performance measures and I got the results for each metric.

My question is how to know if there are significant differences between the results. Is there any statistical test can help to find the statistically significant difference of each row in the table below. Do the t-test and ANOVA works for that.

For example, in the table below. Is there a statistically significant difference between accuracy 95.43, 95.78, 96.66 ,… and so on for other performance measures such as Sensitivity, F1 score etc. I am also not familiar with Kappa and Mcnemar’s Test p-value from classification results.

Note: I have checked other related questions, but I did not find a helpful answer. Also, my question is not only about accuracy but also for other performance measures.

I will really appreciate an informative detailed answer with an application.

enter image description here


Get this bounty!!!

#StackBounty: #machine-learning #cnn #computer-vision Classification problem with many images per instance

Bounty: 50

I am working in the following kind of classification problem: I have to classify every instance as class A or class B using many images of the instance. That is, every training example has not one image (which is the usual thing in image classification), but many images, and the number of images for every training instance is not fixed. That is, instance 1 can have 3 images, and according to these images we have to classify it as A or B, and instance 2 can have instead 5 images.

As any machine learning problem, I am provided with many labelled images and I have to build a classifier.

Although ideas are also welcome, I am looking for a documented way to attack this kind of problem (Kaggles, papers or books, mainly).

My main idea was the following: train a model $f$ that given one image gives a probability of that image being of class A. Then, for every training instance, evaluate $f$ in every image of the instance and compute statistics (aggregate) of the distribution of these probabilities, as the mean, median, maximum and minimum. Then, train a model $g$ that has as inputs these aggregates and use the composition of $f$, aggregates and $g$ as the final model. This idea is a bit simple so I am looking for something better.


Get this bounty!!!

#StackBounty: #machine-learning #online #radial-basis #rbf-network #rbf-kernel Adding new center in an RBF network without memorizing p…

Bounty: 100

Suppose we train an RBF by minimizing the LSE on a couple of training points and we are doing it incrementally in an online fashion. So basically we update the QR factorization using e.g. Givens rotations whenever a new training example is presented.

Suppose now at some point, due to some criteria, we decide it is time to add a new center because we judged that a new point is a novelty. This amounts to adding a new feature to the regression problem (the RBF kernel centered at that point) and this can, in theory, be done again by updating the QR factorization using Givens rotations.

The problem is that this would require us to compute the new feature on all of the previous example in order to correctly update the reduced QR factorization, because we would need to append the column corresponding to the new feature and whose elements are the values of that feature on ALL of the training examples before proceeding with the reduced QR update. Is it possible to do this, with some controlled error I presume, without having to store the previous examples at each time ?


Get this bounty!!!

#StackBounty: #algorithm #machine-learning #svm Understanding Support Verctor Regression (SVR)

Bounty: 50

I’m working with SVR, and using this resource. Erverything is super clear, with epsilon intensive loss function (from figure). Prediction comes with tube, to cover most training sample, and generalize bounds, using support vectors.

enter image description here

enter image description here

Then we have this explanation. This can be described by introducing (non-negative) slack variables , to measure the deviation of training samples outside -insensitive zone. I understand this error, outside tube, but don’t know, how we can use this in optimization. Could somebody explain this?

enter image description here

enter image description here


In local source. I’m trying to achieve very simple optimization solution, without libraries. This what I have for loss function.

import numpy as np

# Kernel func, linear by default
def hypothesis(x, weight, k=None):
    k = k if k else lambda z : z
    k_x = np.vectorize(k)(x)
    return np.dot(k_x, np.transpose(weight))

.......

import math

def boundary_loss(x, y, weight, epsilon):
    prediction = hypothesis(x, weight)

    scatter = np.absolute(
        np.transpose(y) - prediction)
    bound = lambda z: z 
        if z >= epsilon else 0

    return np.sum(np.vectorize(bound)(scatter))


Get this bounty!!!

#StackBounty: #machine-learning #cross-validation #sampling #feature-selection Feature Distribution in Cross-Validation

Bounty: 50

In the case of binary classification, stratified cross-validation only ensures that each fold contains roughly the same proportions of the two types of class labels.

When does it make sense to also ensure that the feature distribution is maintained?

(I would expect most algorithms to be biased by not only the class distribution)


Get this bounty!!!

#StackBounty: #machine-learning #poisson-distribution #survey Probability of successful charging based on independent historical or sam…

Bounty: 100

I tried my best to find a solution, but failed to find a decent one.
Imagine I want to charge a customer and I have last three days of random charge try data:

  +----   Date     ------   Time  ------ Amount  ------  Status ------+
  |     2018/05/05    |     08:00    |     500      |       --        |  
  |     2018/05/05    |     12:00    |     500      |       --        |  
  |     2018/05/05    |     16:00    |     500      |       --        |  
  |     2018/05/05    |     20:00    |     500      |       OK        | <-
  +-------------------+--------------+--------------+-----------------+
  |     2018/05/06    |     08:00    |     500      |       --        |  
  |     2018/05/06    |     12:00    |     500      |       --        |  
  |     2018/05/06    |     16:00    |     500      |       OK        |  
  +-------------------+--------------+--------------+-----------------+
  |     2018/05/07    |     08:00    |     500      |       --        |  
  |     2018/05/07    |     12:00    |     500      |       --        |  
  |     2018/05/07    |     16:00    |     500      |       OK        |  <-
  +-------------------+--------------+--------------+-----------------+
  |     2018/05/08    |     08:00    |     500      |       --        |  
  |     2018/05/08    |     12:00    |     500      |       --        |  
  |     2018/05/08    |     20:00    |     500      |       --        |
  |     2018/05/08    |     22:00    |     500      |       OK        |  <-
  +-------------------+--------------+--------------+-----------------+

1- What is the best way to find probability of a success charge on 11:00 O’clock tomorrow?

2- I also have access to 2K user’ historical data. How can I use that data to improve probability accuracy?


Get this bounty!!!

#StackBounty: #machine-learning #neural-network #recommender-system #supervised-learning #k-nn Taking Neural Network's false positi…

Bounty: 50

I am creating a recommendation system and considering two parallel ways of formalizing the problem. One classical, using proximity (recommend the product to the customer if a majority vote of 2k+1 customers closest by has the product), and another one that I have trouble understanding but seems valid to some extent.

The approach I’m thinking about is:

1) Fit a highly regularized neural network (to make sure it doesn’t overfit the training set) for a classification task that can predict if the person does or doesn’t have given product

2) Make sure test accuracy is as close to train accuracy as possible

3) Take false positives (customers who don’t have the product originally but the NN predicted that they have it) predicted on the whole dataset (the training set as well) as the result – the people I should recommend the product to

Now, I am aware of why in general one wouldn’t want to take that approach but I also can’t exactly explain why it wouldn’t return people ‘close by’ to each other that ‘should’ have given product in a similar sense like the KNN-based approach. I’m not sure how to analyse this problem exactly to validate, modify or reject the idea altogether.


Get this bounty!!!