#StackBounty: #machine-learning #neural-networks #theano How are convolutional layers connected in Theano?

Bounty: 50

How are feature maps connected between two layers in Theano/Caffe/TensorFlow?

For instance, if we have 32 feature maps in Conv Layer 1, and 64 feature maps in Conv Layer 2, with 64 kernels, how does the implementation connect the two layers? Is it fully connected? And if so, does it do an average across all inputs?


Get this bounty!!!

#StackBounty: #regression #machine-learning #variance #cross-validation #predictive-models Does $K$-fold CV with $K=N$ (LOO) provide th…

Bounty: 50

TL,DR: It appears that, contrary to oft-repeated advice, leave-one-out cross validation (LOO-CV) — that is, $K$-fold CV with $K$ (the number of folds) equal to $N$ (the number of training observations) — yields estimates of the generalization error that are the least variable for any $K$, not the most variable, assuming a certain stability condition on either the model/algorithm, the dataset, or both (I’m not sure which is correct as I don’t really understand this stability condition).

  • Can someone clearly explain what exactly this stability condition is?
  • Is it true that linear regression is one such “stable” algorithm, implying that in that context, LOO-CV is strictly the best choice of CV as far as bias and variance of the estimates of generalization error are concerned?

The conventional wisdom is that the choice of $K$ in $K$-fold CV follows a bias-variance tradeoff, such lower values of $K$ (approaching 2) lead to estimates of the generalization error that have more pessimistic bias, but lower variance, while higher values of $K$ (approaching $N$) lead to estimates that are less biased, but with greater variance. The conventional explanation for this phenomenon of variance increasing with $K$ is given perhaps most prominently in The Elements of Statistical Learning (Section 7.10.1):

With K=N, the cross-validation estimator is approximately unbiased for the true (expected) prediction error, but can have high variance because the N “training sets” are so similar to one another.

The implication being that the $N$ validation errors are more highly correlated so that their sum is more variable. This line of reasoning has been repeated in many answers on this site (e.g., here, here, here, here, here, here, and here) as well as on various blogs and etc. But a detailed analysis is virtually never given, instead only an intuition or brief sketch of what an analysis might look like.

One can however find contradictory statements, usually citing a certain “stability” condition that I don’t really understand. For example, this contradictory answer quotes a couple paragraphs from a 2015 paper which says, among other things, “For models/modeling procedures with low instability, LOO often has the smallest variability” (emphasis added). This paper (section 5.2) seems to agree that LOO represents the least variable choice of $K$ as long as the model/algorithm is “stable.” Taking even another stance on the issue, there is also this paper (Corollary 2), which says “The variance of $k$ fold cross validation […] does not depend on $k$,” again citing a certain “stability” condition.

The explanation about why LOO might be the most variable $K$-fold CV is intuitive enough, but there is a counter-intuition. The final CV estimate of the mean squared error (MSE) is the mean of the MSE estimates in each fold. So as $K$ increases up to $N$, the CV estimate is the mean of an increasing number of random variables. And we know that the variance of a mean decreases with the number of variables being averaged over. So in order for LOO to be the most variable $K$-fold CV, it would have to be true that the increase in variance due to the increased correlation among the MSE estimates outweighs the decrease in variance due to the greater number of folds being averaged over. And it is not at all obvious that this is true.

Having become thoroughly confused thinking about all this, I decided to run a little simulation for the linear regression case. I simulated 10,000 datasets with $N$=50 and 3 uncorrelated predictors, each time estimating the generalization error using $K$-fold CV with $K$=2, 5, 10, or 50=$N$. The R code is here. Here are the resulting means and variances of the CV estimates across all 10,000 datasets (in MSE units):

         k = 2 k = 5 k = 10 k = n = 50
mean     1.187 1.108  1.094      1.087
variance 0.094 0.058  0.053      0.051

These results show the expected pattern that higher values of $K$ lead to a less pessimistic bias, but also appear to confirm that the variance of the CV estimates is lowest, not highest, in the LOO case.

So it appears that linear regression is one of the “stable” cases mentioned in the papers above, where increasing $K$ is associated with decreasing rather than increasing variance in the CV estimates. But what I still don’t understand is:

  • What precisely is this “stability” condition? Does it apply to models/algorithms, datasets, or both to some extent?
  • Is there an intuitive way to think about this stability?
  • What are other examples of stable and unstable models/algorithms or datasets?
  • Is it relatively safe to assume that most models/algorithms or datasets are “stable” and therefore that $K$ should generally be chosen as high as is computationally feasible?


Get this bounty!!!

#StackBounty: #machine-learning #python #ranking From pairwise comparisons to ranking – python

Bounty: 50

I have to solve a ranking ML issue. To start with, I have successfully applied the pointwise ranking approach.

Now, I’m playing around with pairwise ranking algorithms. I’ve created the pairwise probabilities (i.e. probability of item i being above item j) but I’m not sure how I can transform this to rankings.

For the historical data (let’s assume these are queries), I have their pairwise probs AND the actual ranking (the ideal one). I want a solution that will provide a ranking for a new query as well (i.e. the ideal ranking is what I’m looking for here).

Any python package that has, at least partially, the functionality I’m looking for?


Get this bounty!!!

#StackBounty: #machine-learning Comparison between Helmholtz machines and Boltzmann machines

Bounty: 50

Today I started reading about Helmholtz machines. So far they seem very similar to – though clearly not the same as – Boltzmann machines, and I feel that my learning process would be much easier if I clearly understood what the key differences were. I come from a statistical physics background and understand Boltzmann machines very well (I’ve developed several of my own variations on the Boltzmann machine concept for various purposes), so I’m really looking for a brief explanation of the basic idea behind Helmholtz machines, assuming prior knowledge of Boltzmann machines and stat mech, but not necessarily much knowledge about belief nets or other types of neural network. (Though I do understand the difference between directed and undirected models, which seems like it should be relevant.)

To be specific, I suppose my questions are: How do Helmholtz machines and Boltzmann machines relate to each other? Is one a special case of the other, or are they just different; if the latter, what is the key difference in the assumptions they’re built on? Is the difference to do with the difference between directed and undirected models, and if so, how exactly does that difference translate into the two different architectures?


Get this bounty!!!

#StackBounty: #machine-learning #matlab #distance #distance-functions #metric Distance Metric Learning not returning Positive Matrix

Bounty: 50

I’m using the MATLAB code released by Eric P. Xing, related to their NIPS 2002 paper (pdf): “Distance metric learning, with application to clustering with side-information. Eric P. Xing, Andrew Y. Ng, Michael I. Jordan and Stuart Russell”.

The code is available for download (.tar.gz) at this webpage.

When using the Newton-Rhapson method (Newton.m file), it is supposed to return a diagonal matrix, and when using the projections method (opt_sphere.m file), it is supposed to return a full matrix with entries greater than or equal to zero. Please see the paper for more on this.

However, when I try this on a sample dataset (Iris dataset), sometimes I get a matrix with negative entries when using the latter method. Similarly, with the former method, I sometimes get a matrix with two zero diagonals (that results in the transformed features being collapsed to a point).

Has anyone else experienced this before? Do you know what I am doing wrong?

As an example, consider the following code snippet (I have extracted the matlab code into the directory “code_metric_online”; these pairs of rows have the same labels, hence are similar: 30th and 42nd, 78th and 83rd, 9th and 49th; these pairs of rows have different labels, hence are dissimilar: 23rd and 61st, 96th and 150th, 45th and 80th):

addpath('code_metric_online/');
clear;
load fisheriris;

[N,d] = size(meas);
data = meas;
S = sparse(N, N);
D = sparse(N, N);

%S(9,49) = 1;
S(30,42) = 1;
S(78,83) = 1;

%D(45,80) = 1;
D(23,61) = 1;
D(96,150) = 1;

A = Newton(meas, S, D, 1);
A

%A = opt_sphere(meas, S, D, 100);
%A

transformed_data = data * (A^1/2)';
figure;
scatter(transformed_data(:, 1), transformed_data(:, 2));

The resulting matrix A in the above example will have two diagonal entries equal to zero, resulting in the plot being a single point. Similarly, if you comment out the Newton method and use opt_sphere instead, you will get a matrix A with negative elements.

If however, you add two new constraints (by un-commenting S(9,49) = 1; and D(45,80) = 1;, then the plot will be a straight line.

I cannot understand this strange behavior, while in the paper, it is clearly said that A is greater than or equal to zero.


Get this bounty!!!

#StackBounty: #machine-learning #fitting #image-processing #parameter-optimization #segmentation Machine learning to find an optimal se…

Bounty: 50

Using machine learning to find an optimal set of parameters for a given segmentation algorithm.

In the “classical” case of machine learning, in the training phase, the data set is constant and the model fit — produces a weight vector that maps the data set to the tag of each image label.

Now let’s assume a given segmentation problem X, which is done using a given classic segmentation algorithm Y (classic, not Deep-Learning). The goal is to find an optimal parameter set for the Y algorithm under the set of ground truth segmentation. (Motivation: every segmentation algo. Have tuned, parameters, we want to learn them not to fine tune them)

I think about the two approaches of making the this:

  1. Offline – Extraction of General properties – let’s say Haralick texture features and try to fit a model connecting between the parameters of the segmentation algorithm Y to the Haralick texture features.
  2. Online – Select random parameters for the Segmentation Y algorithm. Perform a segmentation for those parameters and a specific delta. Calculate the error and then updated the parameters accordingly.

Any example/reference(paper) for the “online” approach would be welcome.


Get this bounty!!!

#StackBounty: #matlab #machine-learning #signal-processing #least-squares #moving-average Extending Least squares for sparse coefficien…

Bounty: 50

My observation y is obtained from the model, y(n) = sum_{i=0}^{p-1} r(i) x(n-i) + v(n) where r is the sparse channel coefficients, x is the one dimensional input and v is additive White Gaussian noise of zero mean. Y = filter(.) command is used to model the above equation and thus creating an FIR filter or a moving average model. The order of the FIR filter is p.

So, y = [y(1),y(2),....,y(100)] is a vector of 100 elements. I am generating noise of variance 0.1 I want to estimate the sparse channel coefficients using LASSO. As there are p channel coefficients, I should get p estimates.
According to the equation of LASSO, ||rx - y||_2^2 + lambda * ||r||_1 I am estimating the sparse coefficients, r. As the true coefficient array contains p elements, I should get p estimated elements. I am not quite sure if this is the way to do. I have not found any example on LASSO being applied to univariate time series model such as ARMA. I don’t know how to estimate the sparse coefficients using the appropriate algorithm and need help.

The first part of the Equation : ||rx - y||_2^2 is a least squares formulation which I can solve using Least Squares Approach. In order to implement LS, I have to arrange the input in terms of regressors. However, if the coefficients are sparse then I should use LASSO approach. I have tried using Matlab’s LASSO function. For LASSO, I rearranged the input data in terms of regressors, but I don’t know if this the correct approach.

I need help. Is there an approach to include the sparsity term in the LS?

I don’t quite understand the difference between lassoglm (https://www.mathworks.com/help/stats/lassoglm.html) and LASSO. I have not found any example on LASSO being applied to univariate time series model such as ARMA. I don’t know how to estimate the sparse coefficients using the appropriate algorithm and need help.

Please find below the code for LASSO using Matlab function. As a toy example I am just assuming model order to be of lag 3 but I know that LASSO can be applied efficiently to a large model.

% Code for LASSO estimation technique for 
%MA system, assuming L = 3 is the order,  

%Generate input
 x = -5:.1:5;

r = [1    0.0   0.0];% L elements of the channel coefficients     
%Data preparation into regressors    
X1 = [ ones(length(x),1) x' x']; %first column treated as all ones since    x_1=1

y = filter(r,1,x); % Generate the MA model
[r_hat_lasso, FitInfo] = lasso(X1, y, 'alpha', 1, 'Lambda', 1, 'Standardize', 1);

OUTPUT :

The estimates returned are r_hat_lasso = 0, 0.657002829714982, 0

This differs very much from the actual r.
Where am I going wrong? Am I doing it correctly? Please correct me if the implementation is wrong.


Get this bounty!!!

#StackBounty: #machine-learning #sampling #predictive-models #oversampling How to interpret results of a predictive model when an exter…

Bounty: 50

I have a prediction task, in which I use DecisionTreeRegressor of scikit-learn to predict a target label, which is about a certain user behaviour in a web platform (and it has a range of 0-4). The features are generated based on users’ other activities in the web platform.

I have separate training and test sets. The training set is from the 2nd week activities, and the test is from the 4th week activities of the users. So, I want to train a model using the 2nd week data, and test it on the 3rd week. In both sets, the target labels are imbalanced. The reason for the imbalance is that users are encouraged to participate at a certain level, which is 3 times. Thus, in both sets there is an accumalation at 3. For example, the number of samples with 3-times participation is 400 whereas the number of users with 1-participation is 65, and the number of users with 0-participation is 55.

To obtain a balanced target labels in the training set, we oversampled it to have equal numbers at each participation level (e.g., 0-participation:250, 1-participation: 250, 2-participation:250, 3-participation:250, 4-participation: 250). Just to explore, splitting the training set into train & test, the prediction results are very good (Mean absolute error is around: 0.20) -See Figure 1.

After we trained the model (using the whole training set), we make predictions on the test set (which is imbalanced itself), the results do not seem to be as promising (Mean absolute error is around: 0.55) -See Figure 2. When I oversample the test set as well, the prediction performance worsens (MAE increases to 0.80) -See Figure 3.

The figures actually tells the story:

Figure 1

enter image description here

Figure 2

enter image description here

Figure 3

enter image description here

At this point I do not know how to proceed. So, I should just go with the results in Figure 2, and discuss the effects of external factors (being required to do 3-times) on user behavior. This is because no matter users have different activity patterns (which were used to generate features), they may just participate on an activity because they are required. I wonder what would be a good approach to understand these results. This is going to be for an academic work.


Get this bounty!!!

#StackBounty: #machine-learning #neural-networks #lags #state-space-models How to determine appropriate lagged features for learning sy…

Bounty: 50

In much of machine learning literature, the systems being modelled are instantaneous. Inputs -> outputs, with no notion of impact from past values.

In some systems, inputs from previous time-steps are relevant, e.g. because the system has internal states/storage. For example, in a hydrological model, you have inputs (rain, sun, wind), and outputs (streamflow), but you also have surface- and soil-storage at various depths. In a physically-based model, you might model those states as discrete buckets, with inflow, out-flow, evaporation, leakage, etc. all according to physical laws.

If you want to model streamflow in a purely empirical sense, e.g. with a neural network, you could just create an instantaneous model, and you’d get OK first-approximation results (and actually in land surface modelling, you could easily do better than a physically based model…). But you would be missing a lot of relevant information – stream flow in inherently lagged relative to rainfall, for instance.

One way to get around this would be to include lagged variants of input features. e.g. if your data is hourly, then include rain over the last 2 days, rain over the last month. These inputs do improve model results in my experience, but it’s basically a matter of experience and trial-and-error as to how you chose the appropriate lags. There are a huge array of possible lagged variables to include (straight lagged data, lagged averages, exponential moving windows, etc.; multiple variables, with interactions, and often with high covariances). I guess theoretically a grid-search for the best model is possible, but this would be prohibitively expensive.

I’m wondering a) if there is a reasonable, cheapish, and relatively objective way to select the best lags to include from the almost infinite choices, or b) if there is a better way of representing storage pools in a purely empirical machine-learning model.


Get this bounty!!!

#StackBounty: #machine-learning #feature-selection #scikit-learn Being able to detect the important features sklearn.make_classificatio…

Bounty: 50

I am trying to learn about feature selection, and I thought using make_classification in sklearn would be helpful. I’m confused though because the number of informative features I’m able to find isn’t as many as expected.

I am using SelectKBest to determine the number of features, and the ones selected by this (either via chi2 or f_classif) correlates well to which features are useful via training by RandomForestClassifier or any other classifier.

I have been able to determine by adding repeated features, and seeing which ones repeat, that it is the first n features (n = number of intended informative) that are generated by make_classification as being informative.

However in many cases, the number of actually helpful features is less than the intended informative. (I have noticed the number of clusters has an impact.) For instance, n_informative might be 3, but I’m only able to see that one is useful via SelectKBest or actually training a classifier.

So my two questions are:

1.) How can I detect the importance of the features make_classification is intending to be important?

2.) What distinguishes the important features chi2/fclassif are able to detect from the important features they are unable to detect?

The code I am using (output is below):

from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest, chi2
import pandas as pd
import numpy as np

np.random.seed(10)
def illustrate(n_informative, n_clusters_per_class):
    data_set = make_classification(n_samples = 500,
                                   n_features = 10,
                                   n_informative = n_informative,
                                   n_redundant=0,
                                   n_repeated=0,
                                   n_classes=2,
                                   n_clusters_per_class = n_clusters_per_class,
                                   weights=None,
                                   flip_y=0.0,
                                   class_sep=1.0,
                                   hypercube=True,
                                   shift=0.0,
                                   scale=1.0,
                                   shuffle = False,
                                   random_state = 6)

    X,Y  = pd.DataFrame(data_set[0]), pd.Series(data_set[1],name='class')
    X = X + abs(X.min().min())
    sel1 = SelectKBest(k=1)
    sel1.fit(X,Y)
    sel2 = SelectKBest(chi2, k=1)
    sel2.fit(X,Y)
    res = pd.concat([pd.Series(sel1.scores_,name='f_classif_score'),
                     pd.Series(sel1.pvalues_,name='f_classif_p_value'),
                     pd.Series(sel2.scores_, name='chi2_score'),
                     pd.Series(sel2.pvalues_,name='chi2_pvalue')],
                    axis=1).sort_values('f_classif_score',ascending=False)
    print res

for n_informative in [1,2,3,4]:
    for n_clusters_per_class in range(1, n_informative):
        print 'Informative Features: {} Clusters Per Class : {}'.format(
            n_informative, n_clusters_per_class)
        illustrate(n_informative, n_clusters_per_class)

Output of Above Code:

Informative Features: 2 Clusters Per Class : 1
   f_classif_score  f_classif_p_value  chi2_score   chi2_pvalue
0      1016.973810      2.130399e-122  134.325167  4.638173e-31
1       772.724765      2.300631e-103  146.799731  8.679832e-34
5         4.078865       4.395792e-02    1.105015  2.931682e-01
8         1.979141       1.601046e-01    0.554276  4.565756e-01
7         1.374163       2.416583e-01    0.372371  5.417147e-01
3         0.443690       5.056552e-01    0.113065  7.366816e-01
4         0.197154       6.572205e-01    0.060201  8.061782e-01
9         0.186371       6.661408e-01    0.056129  8.127227e-01
6         0.169497       6.807367e-01    0.050526  8.221512e-01
2         0.054381       8.157042e-01    0.016877  8.966354e-01
Informative Features: 3 Clusters Per Class : 1
   f_classif_score  f_classif_p_value  chi2_score   chi2_pvalue
0       687.446137       7.661852e-96  162.798076  2.769074e-37
2       568.414329       2.215744e-84  175.119185  5.638711e-40
9         4.233500       4.015367e-02    1.353756  2.446226e-01
4         2.181651       1.402967e-01    0.649694  4.202221e-01
6         0.416503       5.189845e-01    0.127764  7.207621e-01
5         0.250830       6.167129e-01    0.067124  7.955711e-01
7         0.225946       6.347547e-01    0.068300  7.938284e-01
3         0.210548       6.465381e-01    0.065311  7.982908e-01
8         0.149100       6.995618e-01    0.046806  8.287169e-01
1         0.011565       9.144025e-01    0.003235  9.546456e-01
Informative Features: 3 Clusters Per Class : 2
   f_classif_score  f_classif_p_value  chi2_score   chi2_pvalue
2       812.090540      1.144207e-106  150.031081  1.706735e-34
0       106.629707       8.813981e-23   31.707663  1.792137e-08
7         3.907313       4.862763e-02    1.165847  2.802561e-01
5         1.941582       1.641185e-01    0.634154  4.258357e-01
9         1.456108       2.281233e-01    0.449901  5.023821e-01
6         1.010343       3.153089e-01    0.317138  5.733325e-01
3         0.918498       3.383347e-01    0.278306  5.978138e-01
4         0.892927       3.451437e-01    0.285967  5.928169e-01
1         0.206608       6.496370e-01    0.098889  7.531666e-01
8         0.106946       7.437854e-01    0.029129  8.644814e-01
Informative Features: 4 Clusters Per Class : 1
   f_classif_score  f_classif_p_value  chi2_score   chi2_pvalue
2       823.390874      1.344646e-107  126.561785  2.316755e-29
5         4.964055       2.632530e-02    1.234543  2.665253e-01
4         2.088944       1.489976e-01    0.511490  4.744944e-01
3         2.048932       1.529403e-01    0.812675  3.673306e-01
9         1.234054       2.671562e-01    0.254791  6.137213e-01
1         0.315991       5.742796e-01    0.041092  8.393598e-01
6         0.043817       8.342805e-01    0.010935  9.167180e-01
8         0.033963       8.538599e-01    0.007824  9.295150e-01
7         0.012199       9.120972e-01    0.002627  9.591195e-01
0         0.002108       9.634011e-01    0.000199  9.887401e-01
Informative Features: 4 Clusters Per Class : 2
   f_classif_score  f_classif_p_value  chi2_score   chi2_pvalue
2        59.446089       6.882444e-14   20.306324  6.598215e-06
3        45.413607       4.422173e-11   25.331602  4.827347e-07
6         4.355442       3.739881e-02    0.965005  3.259291e-01
7         2.444909       1.185419e-01    0.490491  4.837084e-01
9         1.508166       2.199992e-01    0.366551  5.448901e-01
5         1.438351       2.309767e-01    0.303560  5.816592e-01
1         0.956231       3.286131e-01    0.176588  6.743222e-01
8         0.886270       3.469467e-01    0.215632  6.423882e-01
4         0.175559       6.753984e-01    0.042743  8.362091e-01
0         0.064596       7.994786e-01    0.025981  8.719465e-01
Informative Features: 4 Clusters Per Class : 3
   f_classif_score  f_classif_p_value  chi2_score  chi2_pvalue
0        37.608756       1.762369e-09   15.340979     0.000090
3        35.104866       5.834908e-09   17.716788     0.000026
5         7.474495       6.480748e-03    1.632879     0.201305
8         6.424434       1.156120e-02    1.636956     0.200744
6         0.566897       4.518503e-01    0.130881     0.717521
4         0.225665       6.349655e-01    0.057623     0.810293
7         0.149020       6.996387e-01    0.031846     0.858367
2         0.033591       8.546550e-01    0.015237     0.901759
1         0.028674       8.656032e-01    0.011647     0.914058
9         0.004558       9.461984e-01    0.001164     0.972785
Informative Features: 2 Clusters Per Class : 1
   f_classif_score  f_classif_p_value  chi2_score   chi2_pvalue
0      1016.973810      2.130399e-122  134.325167  4.638173e-31
1       772.724765      2.300631e-103  146.799731  8.679832e-34
5         4.078865       4.395792e-02    1.105015  2.931682e-01
8         1.979141       1.601046e-01    0.554276  4.565756e-01
7         1.374163       2.416583e-01    0.372371  5.417147e-01
3         0.443690       5.056552e-01    0.113065  7.366816e-01
4         0.197154       6.572205e-01    0.060201  8.061782e-01
9         0.186371       6.661408e-01    0.056129  8.127227e-01
6         0.169497       6.807367e-01    0.050526  8.221512e-01
2         0.054381       8.157042e-01    0.016877  8.966354e-01
Informative Features: 3 Clusters Per Class : 1
   f_classif_score  f_classif_p_value  chi2_score   chi2_pvalue
0       687.446137       7.661852e-96  162.798076  2.769074e-37
2       568.414329       2.215744e-84  175.119185  5.638711e-40
9         4.233500       4.015367e-02    1.353756  2.446226e-01
4         2.181651       1.402967e-01    0.649694  4.202221e-01
6         0.416503       5.189845e-01    0.127764  7.207621e-01
5         0.250830       6.167129e-01    0.067124  7.955711e-01
7         0.225946       6.347547e-01    0.068300  7.938284e-01
3         0.210548       6.465381e-01    0.065311  7.982908e-01
8         0.149100       6.995618e-01    0.046806  8.287169e-01
1         0.011565       9.144025e-01    0.003235  9.546456e-01
Informative Features: 3 Clusters Per Class : 2
   f_classif_score  f_classif_p_value  chi2_score   chi2_pvalue
2       812.090540      1.144207e-106  150.031081  1.706735e-34
0       106.629707       8.813981e-23   31.707663  1.792137e-08
7         3.907313       4.862763e-02    1.165847  2.802561e-01
5         1.941582       1.641185e-01    0.634154  4.258357e-01
9         1.456108       2.281233e-01    0.449901  5.023821e-01
6         1.010343       3.153089e-01    0.317138  5.733325e-01
3         0.918498       3.383347e-01    0.278306  5.978138e-01
4         0.892927       3.451437e-01    0.285967  5.928169e-01
1         0.206608       6.496370e-01    0.098889  7.531666e-01
8         0.106946       7.437854e-01    0.029129  8.644814e-01
Informative Features: 4 Clusters Per Class : 1
   f_classif_score  f_classif_p_value  chi2_score   chi2_pvalue
2       823.390874      1.344646e-107  126.561785  2.316755e-29
5         4.964055       2.632530e-02    1.234543  2.665253e-01
4         2.088944       1.489976e-01    0.511490  4.744944e-01
3         2.048932       1.529403e-01    0.812675  3.673306e-01
9         1.234054       2.671562e-01    0.254791  6.137213e-01
1         0.315991       5.742796e-01    0.041092  8.393598e-01
6         0.043817       8.342805e-01    0.010935  9.167180e-01
8         0.033963       8.538599e-01    0.007824  9.295150e-01
7         0.012199       9.120972e-01    0.002627  9.591195e-01
0         0.002108       9.634011e-01    0.000199  9.887401e-01
Informative Features: 4 Clusters Per Class : 2
   f_classif_score  f_classif_p_value  chi2_score   chi2_pvalue
2        59.446089       6.882444e-14   20.306324  6.598215e-06
3        45.413607       4.422173e-11   25.331602  4.827347e-07
6         4.355442       3.739881e-02    0.965005  3.259291e-01
7         2.444909       1.185419e-01    0.490491  4.837084e-01
9         1.508166       2.199992e-01    0.366551  5.448901e-01
5         1.438351       2.309767e-01    0.303560  5.816592e-01
1         0.956231       3.286131e-01    0.176588  6.743222e-01
8         0.886270       3.469467e-01    0.215632  6.423882e-01
4         0.175559       6.753984e-01    0.042743  8.362091e-01
0         0.064596       7.994786e-01    0.025981  8.719465e-01
Informative Features: 4 Clusters Per Class : 3
   f_classif_score  f_classif_p_value  chi2_score  chi2_pvalue
0        37.608756       1.762369e-09   15.340979     0.000090
3        35.104866       5.834908e-09   17.716788     0.000026
5         7.474495       6.480748e-03    1.632879     0.201305
8         6.424434       1.156120e-02    1.636956     0.200744
6         0.566897       4.518503e-01    0.130881     0.717521
4         0.225665       6.349655e-01    0.057623     0.810293
7         0.149020       6.996387e-01    0.031846     0.858367
2         0.033591       8.546550e-01    0.015237     0.901759
1         0.028674       8.656032e-01    0.011647     0.914058
9         0.004558       9.461984e-01    0.001164     0.972785


Get this bounty!!!