#StackBounty: #machine-learning #feature-selection #supervised-learning Regarding "modification" of feature columns in superv…

Bounty: 50

I have a training set with columns as follows:
I want to know if I should consider x3,x4 and x3/x4 as seperate features (as x3/x4 is derived from other features)? Also I found that dropping x3 as a feature increases my k-fold neg-loss score(marginally!).

The underlying question is how would I know if entertaining 1/x1orx1/x2or even
…(you get the idea!) as another feature column will/won’t increase my score?
Is there any method/algorithm to know which “modified” features will increase my score?

Get this bounty!!!

#StackBounty: #python #machine-learning #tensorflow #neural-network #keras Get audiences insights using Keras and TensorFlow

Bounty: 50

Recently I’ve discovered Keras and TensorFlow and I’m trying to get into ML. I have manually classified train and test data from my users DB like so:

9 features and a label, the features are events in my system like “user added a profile picture” or “user paid X for a service” and the label is positive or negative R.O.I (1 or 0)

enter image description here

I have used the following code to classify the users:

import numpy as np
from keras.layers import Dense
from keras.models import Sequential

train_data = np.loadtxt("train.csv", delimiter=",", skiprows=1)
test_data = np.loadtxt("test.csv", delimiter=",", skiprows=1)

X_train = train_data[:, 0:9]
Y_train = train_data[:, 9]

X_test = test_data[:, 0:9]
Y_test = test_data[:, 9]

model = Sequential()
model.add(Dense(8, input_dim=9, activation='relu'))
model.add(Dense(6, activation='relu'))
model.add(Dense(3, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Fit the model
model.fit(X_train, Y_train, epochs=12000, batch_size=10)

# evaluate the model
scores = model.evaluate(X_test, Y_test)
print("nnnResults: %s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

And got a 89% accuracy. That worked great in order to label a user as a valued customer.

Q : How can I extract the features that contributed for the possitive R.O.I so I can boost their focus in the UX?

Or : What is the approach to find the best combined segment of audiences?

Get this bounty!!!

#StackBounty: #machine-learning #hypothesis-testing #statistical-significance #anova #repeated-measures How to Compare Two Algorithms w…

Bounty: 50

I have two computational methods (A and B) that have a random behavior each, i.e., if you run the same methods 10 times, you get 10 different results (usually with a small variance). To compare both methods, we selected 5 different databases (its hard to get more) and ran method A and method B, 10 times each, on each of the five databases. This resulted in a 10x5 matrix of measurements (a row for each run and a column for each database) for each method. All measurements are paired between the two methods, because we can control the seed for each run and the database can be reused for both methods, i.e., $text{run}_i$ of $text{database}_j$ use the same $text{seed}_i$ for both methods.

Example (the values in the tables are the accuracies of the methods):

Method A

| Run/Database|   1    |   2    |   3    |   4    |   5    |
|           1 | 88.92% | 44.60% | 69.49% | 73.37% | 85.63% |
|           2 | 89.00% | 42.72% | 64.10% | 71.94% | 85.92% |
|           3 | 88.35% | 45.07% | 65.13% | 72.14% | 85.78% |
|           4 | 88.92% | 43.66% | 67.95% | 72.76% | 85.28% |
|           5 | 87.94% | 50.23% | 67.18% | 71.94% | 85.92% |
|           6 | 87.78% | 43.19% | 68.72% | 73.47% | 86.27% |
|           7 | 89.08% | 45.54% | 66.41% | 71.33% | 85.56% |
|           8 | 88.83% | 42.72% | 66.15% | 72.45% | 86.77% |
|           9 | 88.43% | 45.07% | 68.97% | 72.45% | 86.49% |
|          10 | 88.59% | 40.38% | 66.15% | 73.67% | 86.13% |

Method B

| Run/Database|   1    |   2    |   3    |   4    |   5    |
|           1 | 22.73% | 53.99% | 59.74% | 65.20% | 79.59% |
|           2 | 75.97% | 46.95% | 58.46% | 71.63% | 84.42% |
|           3 | 76.94% | 53.05% | 58.97% | 68.37% | 85.06% |
|           4 | 76.54% | 42.25% | 46.67% | 68.67% | 85.92% |
|           5 | 46.60% | 52.11% | 52.82% | 68.98% | 85.14% |
|           6 | 76.78% | 48.83% | 55.90% | 68.27% | 78.38% |
|           7 | 79.37% | 47.89% | 58.72% | 71.12% | 85.06% |
|           8 | 77.83% | 54.93% | 50.77% | 72.14% | 87.06% |
|           9 | 83.01% | 46.95% | 56.15% | 67.96% | 84.92% |
|          10 | 78.24% | 49.30% | 58.21% | 67.96% | 81.29% |

Which statistical method shall I use to find out which method is the best in terms of overall performance? Or to find out if method A is statistically different, in terms of average accuracy, from method B?

I investigated the use of Student T-Test and One and Two-Way ANOVA with Repeated Measurements, but they didn’t seem appropriate for this analysis. Any suggestion of valid statistical analysis is appreciated.

Get this bounty!!!

#StackBounty: #machine-learning #cross-validation #performance LFW face pair-matching performance evaluation, why retrain model on view2?

Bounty: 50

I am trying to understand how performance evaluation works in LFW(Labeled Faces in the Wild) dataset http://vis-www.cs.umass.edu/lfw/.

I am interested in task: pair-matching. However, as I dig deeper, I found myself confused.

Here is a brief summary on evaluating pair-matching performance in LFW dataset:

  1. LFW dataset is divided into View1 and View2. View1 is for development of algorithms, you can use it to select model, tune parameters and choose features. View2 is for reporting accuracy of your model produced by View1.

  2. View1 description:

    For development purposes, we recommend using the below training/testing split, which was generated randomly and independently of the splits for 10-fold cross validation, to avoid unfairly overfitting to the sets above during development. For instance, these sets may be viewed as a model selection set and a validation set. See the tech report below for more details.

    pairsDevTrain.txt, pairsDevTest.txt

  3. View2 description:

    As a benchmark for comparison, we suggest reporting performance as 10-fold cross validation using splits we have randomly generated.

I also found an example of carrying out the experiment with PCA for face pair-matching in the LFW 2008 paper.

Eigenfaces for pair matching. We computed eigenvectors from the training set of View 1 and determined the threshold value for classifying pairs as matched or mismatched that gave the best performance on the test set of View 1. For each run of View 2, the training set was used to compute the eigenvectors, and pairs were classified using the threshold on Euclidian distance from View 1.

State of the art pair matching. To determine the current best performance on pair matching, we ran an implementation of the current state of the art recognition system of Nowak and Jurie [14].11 The Nowak algorithm gives a similarity score to each pair, and View 1 was used to determine the threshold value for classifying pairs as matched or mismatched. For each of the 10 folds of View 2 of the database, we trained on 9 of the sets and computed similarity measures for the held out test set, and classified pairs using the threshold

My questions are:

  1. How to do training with View1 data using 10-fold cross validation?

    The data is already split into pairsDevTrain.txt and pairsDevTest.txt. Does it mean that I need to merge these two file and then do a standard 10-fold cross validation to train my model?

  2. Why is 10-fold cross validation required in View2?

    Since model and parameter is all determined using data in View1, why not just use all View2 data to report performance.

  3. Since 10-fold cross validation is required in View2, there must be a training process. Why retrain another model?

    It is worth mentioning here, both in View1 and View2. train and test data don’t share common identity, i.e. person1 appear in train, will not appear in test.

  4. 10-fold cross validation is recommended for both View1 and View2. 10-fold splits are given for View2 but not View1. Is there a reason why?

Thank you beforehand for helping me understand the performance evaluation for LFW.

Get this bounty!!!

#StackBounty: #machine-learning #ensemble Combining predictions on different inputs

Bounty: 50

I’m making a classifier that predicts the class of an image on my custom dataset. This task is identical to the ImageNet ILSVRC, with the exception of the dataset and classes.

The dataset has an additional property. Multiple images are known to originate from the same class. These subsets of the dataset differ in size from 1 to 50.

I have approached the problem with not including this information at training time (due to the diversity of the size of my subsets). I have used the pre-trained inception-v3 network, retraining it to fit my own dataset.

However, I would like to combine the predictions from the different images in the same subset. This is similar to ensemble methods, but instead of testing the same input on different classifiers, I would like to test different inputs on the same classifier before combining the results.

I have failed to find any relevant literature. Does anyone know about papers that describe this exact problem? Would it be correct to approach the problem in the same way as ensemble methods?

I know it would probably be better to use this information on train time as well, but I have no idea how I would approach this problem.

Get this bounty!!!

#StackBounty: #machine-learning #neural-networks #theano How are convolutional layers connected in Theano?

Bounty: 50

How are feature maps connected between two layers in Theano/Caffe/TensorFlow?

For instance, if we have 32 feature maps in Conv Layer 1, and 64 feature maps in Conv Layer 2, with 64 kernels, how does the implementation connect the two layers? Is it fully connected? And if so, does it do an average across all inputs?

Get this bounty!!!

#StackBounty: #regression #machine-learning #variance #cross-validation #predictive-models Does $K$-fold CV with $K=N$ (LOO) provide th…

Bounty: 50

TL,DR: It appears that, contrary to oft-repeated advice, leave-one-out cross validation (LOO-CV) — that is, $K$-fold CV with $K$ (the number of folds) equal to $N$ (the number of training observations) — yields estimates of the generalization error that are the least variable for any $K$, not the most variable, assuming a certain stability condition on either the model/algorithm, the dataset, or both (I’m not sure which is correct as I don’t really understand this stability condition).

  • Can someone clearly explain what exactly this stability condition is?
  • Is it true that linear regression is one such “stable” algorithm, implying that in that context, LOO-CV is strictly the best choice of CV as far as bias and variance of the estimates of generalization error are concerned?

The conventional wisdom is that the choice of $K$ in $K$-fold CV follows a bias-variance tradeoff, such lower values of $K$ (approaching 2) lead to estimates of the generalization error that have more pessimistic bias, but lower variance, while higher values of $K$ (approaching $N$) lead to estimates that are less biased, but with greater variance. The conventional explanation for this phenomenon of variance increasing with $K$ is given perhaps most prominently in The Elements of Statistical Learning (Section 7.10.1):

With K=N, the cross-validation estimator is approximately unbiased for the true (expected) prediction error, but can have high variance because the N “training sets” are so similar to one another.

The implication being that the $N$ validation errors are more highly correlated so that their sum is more variable. This line of reasoning has been repeated in many answers on this site (e.g., here, here, here, here, here, here, and here) as well as on various blogs and etc. But a detailed analysis is virtually never given, instead only an intuition or brief sketch of what an analysis might look like.

One can however find contradictory statements, usually citing a certain “stability” condition that I don’t really understand. For example, this contradictory answer quotes a couple paragraphs from a 2015 paper which says, among other things, “For models/modeling procedures with low instability, LOO often has the smallest variability” (emphasis added). This paper (section 5.2) seems to agree that LOO represents the least variable choice of $K$ as long as the model/algorithm is “stable.” Taking even another stance on the issue, there is also this paper (Corollary 2), which says “The variance of $k$ fold cross validation […] does not depend on $k$,” again citing a certain “stability” condition.

The explanation about why LOO might be the most variable $K$-fold CV is intuitive enough, but there is a counter-intuition. The final CV estimate of the mean squared error (MSE) is the mean of the MSE estimates in each fold. So as $K$ increases up to $N$, the CV estimate is the mean of an increasing number of random variables. And we know that the variance of a mean decreases with the number of variables being averaged over. So in order for LOO to be the most variable $K$-fold CV, it would have to be true that the increase in variance due to the increased correlation among the MSE estimates outweighs the decrease in variance due to the greater number of folds being averaged over. And it is not at all obvious that this is true.

Having become thoroughly confused thinking about all this, I decided to run a little simulation for the linear regression case. I simulated 10,000 datasets with $N$=50 and 3 uncorrelated predictors, each time estimating the generalization error using $K$-fold CV with $K$=2, 5, 10, or 50=$N$. The R code is here. Here are the resulting means and variances of the CV estimates across all 10,000 datasets (in MSE units):

         k = 2 k = 5 k = 10 k = n = 50
mean     1.187 1.108  1.094      1.087
variance 0.094 0.058  0.053      0.051

These results show the expected pattern that higher values of $K$ lead to a less pessimistic bias, but also appear to confirm that the variance of the CV estimates is lowest, not highest, in the LOO case.

So it appears that linear regression is one of the “stable” cases mentioned in the papers above, where increasing $K$ is associated with decreasing rather than increasing variance in the CV estimates. But what I still don’t understand is:

  • What precisely is this “stability” condition? Does it apply to models/algorithms, datasets, or both to some extent?
  • Is there an intuitive way to think about this stability?
  • What are other examples of stable and unstable models/algorithms or datasets?
  • Is it relatively safe to assume that most models/algorithms or datasets are “stable” and therefore that $K$ should generally be chosen as high as is computationally feasible?

Get this bounty!!!

#StackBounty: #machine-learning #python #ranking From pairwise comparisons to ranking – python

Bounty: 50

I have to solve a ranking ML issue. To start with, I have successfully applied the pointwise ranking approach.

Now, I’m playing around with pairwise ranking algorithms. I’ve created the pairwise probabilities (i.e. probability of item i being above item j) but I’m not sure how I can transform this to rankings.

For the historical data (let’s assume these are queries), I have their pairwise probs AND the actual ranking (the ideal one). I want a solution that will provide a ranking for a new query as well (i.e. the ideal ranking is what I’m looking for here).

Any python package that has, at least partially, the functionality I’m looking for?

Get this bounty!!!

#StackBounty: #machine-learning Comparison between Helmholtz machines and Boltzmann machines

Bounty: 50

Today I started reading about Helmholtz machines. So far they seem very similar to – though clearly not the same as – Boltzmann machines, and I feel that my learning process would be much easier if I clearly understood what the key differences were. I come from a statistical physics background and understand Boltzmann machines very well (I’ve developed several of my own variations on the Boltzmann machine concept for various purposes), so I’m really looking for a brief explanation of the basic idea behind Helmholtz machines, assuming prior knowledge of Boltzmann machines and stat mech, but not necessarily much knowledge about belief nets or other types of neural network. (Though I do understand the difference between directed and undirected models, which seems like it should be relevant.)

To be specific, I suppose my questions are: How do Helmholtz machines and Boltzmann machines relate to each other? Is one a special case of the other, or are they just different; if the latter, what is the key difference in the assumptions they’re built on? Is the difference to do with the difference between directed and undirected models, and if so, how exactly does that difference translate into the two different architectures?

Get this bounty!!!

#StackBounty: #machine-learning #matlab #distance #distance-functions #metric Distance Metric Learning not returning Positive Matrix

Bounty: 50

I’m using the MATLAB code released by Eric P. Xing, related to their NIPS 2002 paper (pdf): “Distance metric learning, with application to clustering with side-information. Eric P. Xing, Andrew Y. Ng, Michael I. Jordan and Stuart Russell”.

The code is available for download (.tar.gz) at this webpage.

When using the Newton-Rhapson method (Newton.m file), it is supposed to return a diagonal matrix, and when using the projections method (opt_sphere.m file), it is supposed to return a full matrix with entries greater than or equal to zero. Please see the paper for more on this.

However, when I try this on a sample dataset (Iris dataset), sometimes I get a matrix with negative entries when using the latter method. Similarly, with the former method, I sometimes get a matrix with two zero diagonals (that results in the transformed features being collapsed to a point).

Has anyone else experienced this before? Do you know what I am doing wrong?

As an example, consider the following code snippet (I have extracted the matlab code into the directory “code_metric_online”; these pairs of rows have the same labels, hence are similar: 30th and 42nd, 78th and 83rd, 9th and 49th; these pairs of rows have different labels, hence are dissimilar: 23rd and 61st, 96th and 150th, 45th and 80th):

load fisheriris;

[N,d] = size(meas);
data = meas;
S = sparse(N, N);
D = sparse(N, N);

%S(9,49) = 1;
S(30,42) = 1;
S(78,83) = 1;

%D(45,80) = 1;
D(23,61) = 1;
D(96,150) = 1;

A = Newton(meas, S, D, 1);

%A = opt_sphere(meas, S, D, 100);

transformed_data = data * (A^1/2)';
scatter(transformed_data(:, 1), transformed_data(:, 2));

The resulting matrix A in the above example will have two diagonal entries equal to zero, resulting in the plot being a single point. Similarly, if you comment out the Newton method and use opt_sphere instead, you will get a matrix A with negative elements.

If however, you add two new constraints (by un-commenting S(9,49) = 1; and D(45,80) = 1;, then the plot will be a straight line.

I cannot understand this strange behavior, while in the paper, it is clearly said that A is greater than or equal to zero.

Get this bounty!!!