#StackBounty: #classification #cart #boosting #xgboost Gradient boosting – extreme predictions vs predictions close to 0.5

Bounty: 50

Let’s say you train two different Gradient Boosting Classifier models on two different datasets. You use leave-one-out cross-validation, and you plot the histograms of predictions that the two models output. The histograms look like this:
enter image description here

and this:

enter image description here

So, in one case, predictions (on out-of-sample / validation sets) are mostly extreme (close to 0 and 1), and in the other case predictions are close to 0.5.

What, if anything, can be inferred from each graph? How could one explain the difference? Can anything be said about the dataset/features/model?

My gut feeling is that in the first case, the features explain the data better so the model gets a better fit to the data (and possibly overfits it, but not necessarily – the performance on the validation/test sets could still be good if the features actually explain the data well). In the second case, the features do not explain the data well and so the model does not fit too closely to the data. The performance of the two models could still be the same in terms of precision and recall, however. Would that be correct?


Get this bounty!!!

#StackBounty: #classification #supervised-learning Incremental learning for classification models in R

Bounty: 50

Assume, I have a classifier (It could be any of the standard classifiers like decision tree, random forest, logistic regression .. etc.) for fraud detection using the below code

library(randomForest)
rfFit = randomForest(Y ~ ., data = myData, ntree = 400) # A very basic classifier 

Say, Y is a binary outcome - Fraud/Not-Fraud

Now, I have predicted on a unseen data set.

pred = predict(rfFit, newData)

Then I have obtained the feedback from the investigation team on my classification and found that I have made a mistake of classifying a fraud as Non-Fraud (i.e. One False Negative). Is there anyway that I can let my algorithm understand that it has made a mistake? i.e. Any way of adding a feedback loop to the algorithm so that it can correct the mistakes?

One option I can think from top of my head is build an adaboost classifier so that the new classifier corrects the mistake of the old one. or I have heard something of Incremental Learning or Online learning. Are there any existing implementations (packages) in R?

Is it the right approach? or Is there any other way to tweak the model instead of building it from the scratch?


Get this bounty!!!

#StackBounty: #classification #keras #convnet #training #audio-recognition CNN for phoneme recognition

Bounty: 50

I am currently studying this paper, in which CNN is applied for phoneme recognition using visual representation of log mel filter banks, and limited weight sharing scheme.

The visualisation of log mel filter banks is a way representing and normalizing the data. They suggest to visualize as a spectogram with RGB colors, which the closest I could come up with would be to plot it using matplotlibs colormap cm.jet. They (being the paper) also suggest each frame should be stacked with its [static delta delta_delta]
filterbank energies. This looks like this:

enter image description here

The input of the consist of an image patch of 15 frames set [static delta delta_detlta] input shape would be (40,45,3)

The limited weight sharing consist of limiting the weight sharing to a specific filter bank area, as speech is interpreted differently in different frequency area, thus will a full weight sharing as normal convolution apply, would not work.

Their implementation of limited weight sharing consist of controlling the weights in the weight matrix associated with each convolutional layer. So they apply a convolution on the complete input.
The paper only applies only one convolutional layer as using multiple would destroy the locality of the feature maps extracted from the convolutional layer.
The reason why they use filter bank energies rather than the normal MFCC coefficient
is because DCT destroys the locality of the filter banks energies.

enter image description here

Instead of controlling the weight matrix associated with convolution layer, I choose to implement the CNN with multiple inputs. so each input consist of a (small filter bank range, total_frames_with_deltas, 3). So for instance the paper state that a filter size of 8 should be good, so I decided a filter bank range of 8. So each small image patch is of size (8,45,3). Each of the small image patch is extracted with a sliding window of with a stride of 1 – so there is a lot of overlap between each input – and each input has its own convolutional layer.

enter image description here

(input_3 , input_3, input3, should have been input_1, input_2, input_3 …)

Doing this way makes it possible to use multiple convolutional layer, as the locality is not a problem any more, as it applied inside a filter bank area, this is my theory.

The paper don’t explicitly state it but i guess the reason why they do phoneme recognition on multiple frames is to have some some of left context and right context, so only the middle frame is being predicted/trained for. So in my case is the first 7 frames set the left context window – the middle frame is being trained for and last 7 frames set is the right context window. So given multiple frames, will only one phoneme be recognised being the middle.

My neural network currently looks like this:

def model3():

    #stride = 1
    #dim = 40
    #window_height = 8
    #splits = ((40-8)+1)/1 = 33
    next(test_generator())
    next(train_generator(batch_size))

    kernel_number = 200#int(math.ceil(splits))
    list_of_input = [Input(shape = (window_height,total_frames_with_deltas,3)) for i in range(splits)]
    list_of_conv_output = []
    list_of_conv_output_2 = []
    list_of_conv_output_3 = []
    list_of_conv_output_4 = []
    list_of_conv_output_5 = []
    list_of_max_out = []
    for i in range(splits):
        #list_of_conv_output.append(Conv2D(filters = kernel_number , kernel_size = (15,6))(list_of_input[i]))
        #list_of_conv_output.append(Conv2D(filters = kernel_number , kernel_size = (window_height-1,3))(list_of_input[i]))
        list_of_conv_output.append(Conv2D(filters = kernel_number , kernel_size = (window_height,3), activation = 'relu')(list_of_input[i]))
        list_of_conv_output_2.append(Conv2D(filters = kernel_number , kernel_size = (1,5))(list_of_conv_output[i]))
        list_of_conv_output_3.append(Conv2D(filters = kernel_number , kernel_size = (1,7))(list_of_conv_output_2[i]))
        list_of_conv_output_4.append(Conv2D(filters = kernel_number , kernel_size = (1,11))(list_of_conv_output_3[i]))
        list_of_conv_output_5.append(Conv2D(filters = kernel_number , kernel_size = (1,13))(list_of_conv_output_4[i]))
        #list_of_conv_output_3.append(Conv2D(filters = kernel_number , kernel_size = (3,3),padding='same')(list_of_conv_output_2[i]))
        list_of_max_out.append((MaxPooling2D(pool_size=((1,11)))(list_of_conv_output_5[i])))

    merge = keras.layers.concatenate(list_of_max_out)
    print merge.shape
    reshape = Reshape((total_frames/total_frames,-1))(merge)

    dense1 = Dense(units = 1000, activation = 'relu',    name = "dense_1")(reshape)
    dense2 = Dense(units = 1000, activation = 'relu',    name = "dense_2")(dense1)
    dense3 = Dense(units = 145 , activation = 'softmax', name = "dense_3")(dense2)
    #dense4 = Dense(units = 1, activation = 'linear', name = "dense_4")(dense3)


    model = Model(inputs = list_of_input , outputs = dense3)
    model.compile(loss="categorical_crossentropy", optimizer="SGD" , metrics = [metrics.categorical_accuracy])

    reduce_lr=ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=3, verbose=1, mode='auto', epsilon=0.001, cooldown=0)
    stop  = EarlyStopping(monitor='val_loss', min_delta=0, patience=5, verbose=1, mode='auto')

    print model.summary()

    raw_input("okay?")
    hist_current = model.fit_generator(train_generator(batch_size),
                        steps_per_epoch=10,
                        epochs = 10000,
                        verbose = 1,
                        validation_data = test_generator(),
                        validation_steps=1)
                        #pickle_safe = True,
                        #workers = 4)

So.. now comes the issue..

I been training the network and have only been able to get a validation_accuracy of highest being 0.17, and the accuracy after a lot of epochs end up being 1.0.
enter image description here
(Plot is currently being made)

I am not sure why I am not getting better results.. Why this high error rate?
I am using the TIMIT dataset which the other ones also use.. so why am I getting worse results?

And sorry for the long post – hope more information of my design decision could be useful – and help understand how I understood the paper versus how i’ve applied would help pinpoint where my mistake would be.


Get this bounty!!!

#StackBounty: #code-golf #decision-problem #classification #audio Lossy or Lossless?

Bounty: 200

Given an audio file, determine whether it is encoded in a lossy format or a lossless format. For the purposes of this challenge, only the following formats need to be classified:

Rules

  • If input is taken in the form of a filename, no assumptions should be made about the filename (e.g. the extension is not guaranteed to be correct for the format, or even present).
  • There will be no ID3 or APEv2 metadata present in input files.
  • Any two unique and distinguishable outputs may be used, such as 0 and 1, lossy and lossless, foo and bar, etc.

Test Cases

The test cases for this challenge consist of a zip file located here which contains two directories: lossy and lossless. Each directory contains several audio files that are all 0.5-second 440 Hz sine waves, encoded in various formats. All of the audio files have extensions matching the formats above, with the exception of A440.m4a (which is AAC audio in a MPEG Layer 4 container).


Get this bounty!!!

#StackBounty: #classification #random-forest #unbalanced-classes #discriminant-analysis multiclass classification when x variable is no…

Bounty: 100

I have a dataset as below. It is a classification problem with multiple x variables and a y variable. Y variable has 5 levels. The below picture has y variable on x axis and one x variable on y axis. As you can see from the picture below, for each value of Y, a x variable distribution is not very different (i plotted similar chart for all x variables- one x variable at a time vs y and seeing the same trend). Due to this issue, most of the observations are getting classified as class 0 or 4. I built a randomforest with 500 trees. How could i improve accuracy? I thought of taking square or cube of each x so that distance between class means would increase but it won’t help with variance of each class

distribution of 5 classes in my data is as below

   0    1    2    3    4 
5104 2639 2322 2661 5274 

enter image description here

Column means and standard deviations of my x variables are very similar 🙁


Get this bounty!!!

#StackBounty: #r #hypothesis-testing #classification #experiment-design #double-blind Design a double-blind expert test when there is i…

Bounty: 50

I would like to submit to an expert group a set of images from two genetically different types of plants to see if they can find a difference. I have 20 images for each of the two types. Images are images of cells that are so specific that only a small group of 10 experts are able to see the potential differences (without knowing which type it is).

To have a powerful test, I choose to work with a two-out-of-five test, which means that I show experts five images in which 3 are from type1 and 2 are from type2. The have to define two groups of images. I chose this test because of its power. The probability to group the images correctly by chance is only 1/10. Because the number of experts is really small, using this test instead of a triangular test or a simple two images test, is in my point of view more powerful.

The variability of intra-type images is quite high. If there is a difference between the two types, it will be small, within this set of 2*20 images. Hence, if there is a difference, I would like to know which images are the most different. I cannot show more than ~50 sets of 5 images to my experts as they do not have a lot of time to perform the analysis. The problem is that there is about 400000 possibilities if sets of 5 images (3 type1 – 2 type2).

I think that choosing randomly my 50 sets among 400000 for each expert is not really representative. It would be more interesting to choose the sets so that each image has been compared to each other, so that I can define if there are similar or different (which can be infered from the 2-out-of-5 grouping). Randomly, I will not be sure that I tested all images against all (at least indirectly) with my only 10 experts. Thus, I decided to create a sample of 50 sets, different for each expert, that maximises the number of couples compared either directly or indirectly. However, with this sampling, some couples of images are more compared than others, which is why I forced my selection so that each couple is compared at least 3 times directly or indirectly.

This seems to be a quite complicated sampling procedure. For me, this is the only one that makes me sure I will be able to correctly compare all my images, in a few shot for each expert, knowing that once I have shown it to my experts, I will not have any other chance to find another expert group. And they won’t be able to take more time to do it again.

To summarize, I want to know :

  1. Are my two types different ?
  2. Which cell images are more different than the others ?

Do you think I am going in a too complicated way ? Is random a better solution even if I will be able to only test 50 sets * 10 experts among 400000 possible sets ? How can I be sure not to bias my sets selection procedure ?

By the way, I work with R, but I don’t think this is really important for this question


Get this bounty!!!

#StackBounty: #machine-learning #classification #multinomial #matching Multi-label classification: Predict product category

Bounty: 50

I want to predict to which product category a product belongs. A total of 400k products need to be translated from the old (less refined) to the new product category tree. (E.g. alarm clock used to fall under ‘Electronics’ and will now belong to ‘Alarm clocks’.) So far 36k products have already been partly allocated to ~400 (out of 800) new product categories. The filling rate ranges from 1% to 95%.

Product data (among others) contains variables: name, description, price, dimensions, color and the old label . The idea was to construct features out of the unstructured variables through tokenisation -> TF-IDF.

Proposed Approach:

  1. Train one multi-label prediction model (e.g. Ridge classification + stratified CV) on the labeled data. Then predict the category only for subset that, based on the old product tree, contains all possible products. (e.g. predict if unlabelled ‘Electronics’ products are ‘Alarm clocks’)
  2. Based on the predicted probability present the unlabelled product to a content manager that, if labelled, would result in the highest information gain.
  3. Propose to which extend the remaining 400 categories should be filled (e.g. 60%) and which products to label first.

What would your preferred approach be?


Get this bounty!!!

#StackBounty: #classification #subsampling Choosing subsample size (helping a friend analysing a smaller data set)

Bounty: 100

A friend of mine is working analysing 2000 twits per day and categorize them as postive, negative or neutral.
This is a really boring task but the algorithms that do this classification are not very good because they can’t detect sarcasm.
A simple solution to make more easy this task is to do a subsample of the original $N = 2000$ data points.

Doing some tests we saw that with $30%$ of the data the normalized histograms of the subsample and the original data points look very similar but we need to know a better estimation of the error of doing this subsample.

Theoretically the data points are an i.i.d. sequence $(X_i){i=1}^N$ (big assumption) in the space $A = {0,1,2}$ (positive,negative,neutral). Let $(X{(i)}){i=1}^n$ be a subsample of size $n leq N$ (draw $n$ elements uniformly without replacement).
In some sense I want to characterize the distribution of $(X
{(i)}){i=1}^n$ in order to choose a $n$ such that the empirical distribution of $(X{(i)}){i=1}^n$ is close to the empirical distribution of $(X_i){i=1}^N$.

Any help will be appreciated


Get this bounty!!!

#StackBounty: Classification using independent models for each class

Bounty: 50

One way to explain my data is to use the example data below. Here, I use the iris dataset to depict the four independent scores for each instance. My task is to classify each instance into one of the four classes.

> data(iris)
> iris2 <- as.data.frame(scale(iris[,1:4]))
> colnames(iris2) <- c("class_1","class2","class_3","class_4")
> head(iris2)
     class_1      class2   class_3   class_4
1 -0.8976739  1.01560199 -1.335752 -1.311052
2 -1.1392005 -0.13153881 -1.335752 -1.311052
3 -1.3807271  0.32731751 -1.392399 -1.311052
4 -1.5014904  0.09788935 -1.279104 -1.311052
5 -1.0184372  1.24503015 -1.335752 -1.311052
6 -0.5353840  1.93331463 -1.165809 -1.048667

However, the underlying scoring method/logic differs from class to class. Sure, when looking at one class only, a higher score means the instance is more likely of this class, but the difficulty arises when comparing the four scores:

Looking at their distributions, class_1 might have a significant skew, and class_2 a widely different value range. This means I cannot simply use the maximum value, when selecting the final class:

> iris3 <- cbind(
+   iris2,
+   lable_num=max.col(iris2,ties.method="first")
+ )
> head(iris3)
     class_1      class2   class_3   class_4 lable_num
1 -0.8976739  1.01560199 -1.335752 -1.311052         2
2 -1.1392005 -0.13153881 -1.335752 -1.311052         2
3 -1.3807271  0.32731751 -1.392399 -1.311052         2
4 -1.5014904  0.09788935 -1.279104 -1.311052         2
5 -1.0184372  1.24503015 -1.335752 -1.311052         2
6 -0.5353840  1.93331463 -1.165809 -1.048667         2

How should I go about and “level the playing field” in order to select the final class?

Is removing the mean enough? What if I also divide each column by its standard deviation? Or is this taking it one step too far? Am I loosing information by standardizing?

What about the skew issue?

I’m having difficulty discerning between what should be treated as an indication of a popular class (due to a heavy skew, or just larger values), and what traits should be fixed by scaling/transforming.

Are there any other types of approaches I should try?

I cannot change the way the scores have been calculated, and there is no training set I can use to model the final label using the four model scores as input (there are no prior final labels).


Get this bounty!!!