#StackBounty: #machine-learning #classification #binary-data #accuracy Difference between cumulative gains chart and cumulative accurac…

Bounty: 50

I am confused about the following:

Here I find the definition of cumulative gains chart as the plot of

  • x-axis: the rate of predicted positive
  • y-axis: true positive rate

It turns out that we can e.g. order by the probability of a positive outcome and plot the corresponding rate of true positives.

On the other hand there is the cumulative accuracy profile (CAP) explained here where it is said to be constructed as

The CAP of a model represents the cumulative number of positive
outcomes along the y-axis versus the corresponding cumulative number
of a classifying parameter along the x-axis.

Another source for the CAP is “Measuring the Discriminative Power of Rating Systems” by Engelmann, Hayden and Tasche.
Together with the plots I would say that the two concepts are the same. Is this true?


Get this bounty!!!

#StackBounty: #machine-learning #multimodality Confusion about multimodal machine learning

Bounty: 50

I recently browsed through this tutorial on multimodal data.

Attention: Multimodal in the sense of feature of very different type, that express the same thing

-think picture and voice of someone talking

-not in the sense of a probability distribution with multiple modes, according to the slides.

What I do not understand about the whole approach to multimodal machine learning techniques is why they are only applied only in the case of features that obviously express the same thing – and not also in the case of features for which it is not clear if they express the same thing (example: you want to predict which webshop visitors are likely to buy stuff and you measure,e.g., their mouse movements as well as the time they spent looking at products; these features might be correlated too).
For example, the whole search for correlations between the feature that are assumed to express the same thing could also be applied in the case I outlined above. Of course, applying multimodal techniques in the latter case might not yield anything, but it seems no one is even trying.


Get this bounty!!!

#StackBounty: #python #machine-learning #scikit-learn #jupyter #decision-tree Plot Interactive Decision Tree in Jupyter Notebook

Bounty: 50

Is there a way to plot a decision tree in a Jupyter Notebook, such that I can interactively explore its nodes? I am thinking about something like this dt. This is an example from KNIME.

I have found https://planspace.org/20151129-see_sklearn_trees_with_d3/ and https://bl.ocks.org/ajschumacher/65eda1df2b0dd2cf616f and I know you can run d3 in Jupyter, but I have not found any packages, that do that.


Get this bounty!!!

#StackBounty: #machine-learning #optimization #gradient-descent #sgd How to set up a linear system to interpolate the train data perfec…

Bounty: 50

Consider a (consistent) regression problem (i.e. we are trying to predict a real valued function and we don’t have inconsistencies in the way we map x’s to y).

I am trying to perfectly fit/interpolate the (train) data set with Gradient Descent (to understand academically Gradient Descent better) with fixed step size:

$$ w^{(t+1)} = w^{(t)} – eta nabla_w L(w^{(t)})$$

I’ve tried things empirically by minimizing L2 loss:

$$ L(w) = | Xw – y |^2 $$

I noticed that sometimes its hard to find the right step size such that the loss value $L(w)$ is zero within machine precision (fit/interpolate data in this sense https://arxiv.org/abs/1712.06559). I suspect that its highly dependent on the basis/kernels I use since the gradient and hessian are:

$$ nabla L(w) = 2(X^T X w – y)$$

$$ nabla^2 L(w) = X $$

I wanted to only use 1st order methods to solve this problem so I am wondering, how do I figure out a good step size and/or basis/feature matrix $X$ given that I want to solve this problem with first order method?

If I decide to use say, Hermitian polynomials, why would that be better than other polynomials for example if I want to fit/interpolate the data perfect?

What if I used a Gaussian Kernel or Lapacian Kernel? How would $X = K$ kernel matrix change and how would Gradient Descent be affected? How does the curvature change/get affected as I change the kernel matrix? How can I set up the problem so the optimization via (S)GD fits the data perfectly?


Get this bounty!!!

#StackBounty: #machine-learning #neural-networks #object-detection #labeling Do more object classes increase or decrease the accuracy o…

Bounty: 100

Assume you have an object detection dataset (e.g, MS COCO or Pascal VOC) with N images where k object classes have been labeled. You train a neural network (e.g., Faster-RCNN or YOLO) and measure the accuracy (e.g., IOU@0.5).

Now you introduce x additional object classes and add the corresponding labels to your original dataset giving you a dataset with N images where k+x object classes have been labeld.

Will the accuracy of the trained network increase or decrease?

To be more specific, we have a traffic sign dataset with around 20 object classes. Now we are thinking about adding additional traffic sign classes (labeling the new classes, without adding new images or changing our network architecture) and we are wondering if this will increase of decrease performance.

On the one hand I think more object classes will make distinction between classes harder. Additionally, a neural network can only hold a limited amount of information, meaning if the number of classes becomes very large there might just not be enough weights to cope with all classes.

On the other side, more object classes means more labels which may help the neural network. Additionally, transfer learning effects between classes might increase the accuracy of the network.

In my opinion there should be some kind of sweet-spot for each network architecture but I could not find any literature, research or experiments about this topic.


Get this bounty!!!

#StackBounty: #machine-learning #correlation #linear-model #canonical-correlation Canonical correlation analysis with a tiny example an…

Bounty: 150

I’ve tried reading many explanations of CCA, and I don’t understand it. For example, on Wikipedia, it refers to two “vectors” $a$ and $b$ such that $rho = text{corr}(a^{top} X, b^{top} Y)$ is maximal. But if $a$ and $X$ are vectors, isn’t $a^{top} X$ a scalar? What does it mean for two scalars to be correlated?

Other explanations use matrix notation without any dimensions, e.g.

enter image description here

Can someone explain CCA with a small example and all the dimensions provided?


Get this bounty!!!

#StackBounty: #machine-learning #python #boosting #catboost IncToDec Catboost Explained

Bounty: 100

I am struggling to understand how the overfitting detector with catboost works:

https://tech.yandex.com/catboost/doc/dg/concepts/overfitting-detector-docpage/#overfitting-detector

I am finding catboost to work well relative to other options but I would like to understand what is happening. Can anyone explain what the three steps are and why this makes sense?


Get this bounty!!!

#StackBounty: #machine-learning #normal-distribution #descriptive-statistics #outliers #extreme-value Decision trees, Gradient boosting…

Bounty: 50

I have a question regarding the normality of predictors. I have 100,000 observations in my data. The problem I am analysing is a classification problem so 5% of the data is assigned to class 1, 95,000 observations assigned to class 0, so the data is highly imbalanced. However the observations of the class 1 data is expected to have extreme values.

  • What I have done is, trim the top 1% and bottom 1% of the data removing, any possible mistakes in the entry of such data)
  • Winsorised the data at the 5% and 95% level (which I have checked and is an accepted practise when dealing with such data that I have).

So;
I plot a density plot of one variable after no outlier manipulation
enter image description here

Here is the same variable after trimming the data at the 1% level
enter image description here

Here is the variable after being trimmed and after being winsorised
enter image description here

My question is how should I approach this problem.

First question, should I just leave the data alone at trimming it? or should I continue to winsorise to further condense the extreme values into more meaningful values (since even after trimming the data I am still left with what I feel are extreme values). If I just leave the data after trimming it, I am left with long tails in the distribution like the following (however the observations that I am trying to classify mostly fall at the tail end of these plots).
enter image description here

Second question, since decisions trees and gradient boosted trees decide on splits, does the distribution matter? What I mean by that is if the tree splits on a variable at (using the plots above) <= -10. Then according to plot 2 (after trimming the data) and plot 3 (after winsorisation) all firms <= -10 will be classified as class 1.

Consider the decision tree I created below.

enter image description here

My argument is, irregardless of the spikes in the data (made from winsorisation) the decision tree will make the classification at all observations <= 0. So the distribution of that variable should not matter in making the split? It will only affect at what value that split will occur at? and I do not loose too much predictive power in these tails?


Get this bounty!!!

#StackBounty: #machine-learning #time-series #forecasting #cross-validation #lags Cross-validation for timeseries data with regression

Bounty: 50

I am familiar with “regular” cross-validation, but now I want to make timeseries predictions while using cross-validation with a simple linear regression function.
I write down a simple example, to help clarify my two questions: one about the train/test split, one question about how to train/test for models when the aim is to predict for different n, with n the steps of prediction, in advance.

(1) The data

Assume I have data for timepoints 1,…,10 as follows:

timeseries = [0.5,0.3,10,4,5,6,1,0.4,0.1,0.9]

(2) Transforming the data into a format useful for supervised learning

As far as I understand, we can use “lags”, i.e. shifts in the data to create a dataset suited for supervised learning:

input = [NaN,0.5,0.3,10,4,5,6,1,0.4,0.1]
output/response = [0.5,0.3,10,4,5,6,1,0.4,0.1,0.9]

Here I have simply shifted the timeseries by one for creating the output vector.
As far as I understand, I could now use input as the input for a linear regression model, and output for the response (the NaN could be approximated our replaced with a random value).

(3) Question 1: Cross-validation (“backtesting”)

Say I want to do now 2-splits, do I have to shift the train as well as the test sets?

I.e. something like:

Train-set:

Independent variable: [NaN,0.5,0.3,10,4,5]

Output/response variable:[0.5,0.3,10,4,5,6]

Test-set:

Independent variable: [1,0.4,0.1]

Output/response variable:[0.4,0.1,0.9]

(ii) Question 2: Predicting different lags in advance:

As obvious, I have shifted dependent to independent variables by 1. Assuming now I would like to train a model which can predict 5 time steps in advance — can I keep this lag of one, and nevertheless use the model to predict n+1,…,n+5,… or do I change the shift from independent to dependent variable to 5? What exactly is the difference?


Get this bounty!!!

#StackBounty: #machine-learning #conv-neural-network #loss-functions #information-theory #cross-entropy Cross Entropy vs. Sparse Cross …

Bounty: 50

I am playing with convolutional neural networks using Keras+Tensorflow to classify categorical data. I have a choice of two loss functions: categorial_crossentropy and sparse_categorial_crossentropy.

I have a good intuition about the categorial_crossentropy loss function, which is defined as follows:

$$
J(textbf{w}) = -frac{1}{N} sum_{i=1}^{N} left[ y_i text{log}(hat{y}_i) + (1-y_i) text{log}(1-hat{y}_i) right]
$$

where,

  • $textbf{w}$ refer to the model parameters, e.g. weights of the neural network
  • $y_i$ is the true label
  • $hat{y_i}$ is the predicted label

Both labels use the one-hot encoded scheme.

Questions:

  • How does the above loss function change in sparse_categorial_crossentropy?
  • What is the mathematical intuition behind it?
  • When to use one over the other?


Get this bounty!!!