#StackBounty: #classification #accuracy #precision-recall #multi-class Summarising Precision/Recall Measures in Multi-class Problem

Bounty: 50

I have a hierarchical multi-class classification system, that classifies records into about 500 different categories. I want to summarise the performance of the classifier in a simple way.

A measure of accuracy on validation data is easy to implement: correctly coded/all coded. For each class, we can look at binary measures of precision and recall to summarise the performance relative to that class.

However, there doesn’t seem to be a generally accepted way to combine binary precision and recalls into summaries of precision and recall across the entire set of classes. There appear to be a few ways to approach this summary:

  1. Take a simple average (arithmetic/geometric/harmonic) of each class’s precision/recall.

  2. Take a weighted average (weighted by number of examples, etc) of each class’s precision/recall.

  3. Use bookmaker’s informedness/markedness which seems to have a natural generalisation in the multiclass context.

Are there advantages to using one of these approaches particularly? Is there a generally accepted way to do this that I’ve just been missing?


Get this bounty!!!

#StackBounty: #classification #scikit #apache-spark #preprocessing #sentiment-analysis Extracting individual emails from an email thread

Bounty: 50

Most of the open source datasets are well formatted i.e each email message is separated well like the enron email dataset. But out in the real world it is highly difficult to separate a top email message from a thread of emails.

For example consider the below message.

Hi,

Can you offer me a better discount.

Thanks,
Mr.X
Customer Relations.

---- On Wed, 10 May 2017 04:05:16 -0700 someone@somewhere.com wrote ------

Hello Mr.X,

Does the below work out. Do let us know your thoughts.

Thanks,
Mr.Y

Sales Manager.

Now the reason why we want to split the emails is because we want to do sentiment analysis. When we fail to split the email then the results will be wrong.

I searched around and found this very comprehensive research paper. Also found an implementation by mail gun called mail gun talon. But unfortunately it does not work well for certain kind of patters.

For example when the second message in the email thread breaks like

---------- Forwarded message ---------- 

instead of the above

---- On Wed, 10 May 2017 04:05:16 -0700 someone@somewhere.com wrote ------

My question is many people who are trying to do such stuff would have definitely faced such problems, but yet the area remains pretty shady. Is there any pretty solid implementation of the paper or something else that splits email pretty well.


Get this bounty!!!

#StackBounty: #classification #overfitting #hyperparameter #bagging #xgboost XGBoost feature subsampling

Bounty: 50

I have a dataset with ~30k samples and 35 features (after feature selection; these seem to be the most important features for this dataset and they have low correlation between each other).

After doing grid search with 10-fold CV on the hyperparameters, to my surprise I get the lowest validation error when colsample_bytree is such that only 1 feature is sampled for each tree! (Edit: actually, with 2 features sampled per tree it works slightly better – but if I increase the number of features sampled per tree the performance keeps getting worse). The depth of each tree is 3 and I am building 2000 trees. That is, for each tree, a feature is randomly selected, and then xgboost tries to fit to residuals using only that feature.

That seems to be very unusual. How should I interpret this? If I have feature interactions in my trees, I start to overfit? But then I would expect performance with trees of depth 1 and no feature subsampling to perform just as good, yet they don’t. In fact, in the grid search, nearly all models with such extreme feature subsampling did better than models without feature subsampling.

Edit: is it possible that I have some features that fit well to the training set but generalize very poorly, and such individual feature sampling helps to avoid those features dominating the model? I am struggling to see what else this could mean.

Edit2: Tried removing individual features, performance does not improve, which suggests that my hypothesis from the previous edit is unlikely. On the other hand, I found that the optimal performance is actually when I sample 2 features per tree. At least now my features are interacting, but still, I am not sure how to explain this gain in performance.


Get this bounty!!!

#StackBounty: #classification #cart #boosting #xgboost Gradient boosting – extreme predictions vs predictions close to 0.5

Bounty: 50

Let’s say you train two different Gradient Boosting Classifier models on two different datasets. You use leave-one-out cross-validation, and you plot the histograms of predictions that the two models output. The histograms look like this:
enter image description here

and this:

enter image description here

So, in one case, predictions (on out-of-sample / validation sets) are mostly extreme (close to 0 and 1), and in the other case predictions are close to 0.5.

What, if anything, can be inferred from each graph? How could one explain the difference? Can anything be said about the dataset/features/model?

My gut feeling is that in the first case, the features explain the data better so the model gets a better fit to the data (and possibly overfits it, but not necessarily – the performance on the validation/test sets could still be good if the features actually explain the data well). In the second case, the features do not explain the data well and so the model does not fit too closely to the data. The performance of the two models could still be the same in terms of precision and recall, however. Would that be correct?


Get this bounty!!!

#StackBounty: #classification #supervised-learning Incremental learning for classification models in R

Bounty: 50

Assume, I have a classifier (It could be any of the standard classifiers like decision tree, random forest, logistic regression .. etc.) for fraud detection using the below code

library(randomForest)
rfFit = randomForest(Y ~ ., data = myData, ntree = 400) # A very basic classifier 

Say, Y is a binary outcome - Fraud/Not-Fraud

Now, I have predicted on a unseen data set.

pred = predict(rfFit, newData)

Then I have obtained the feedback from the investigation team on my classification and found that I have made a mistake of classifying a fraud as Non-Fraud (i.e. One False Negative). Is there anyway that I can let my algorithm understand that it has made a mistake? i.e. Any way of adding a feedback loop to the algorithm so that it can correct the mistakes?

One option I can think from top of my head is build an adaboost classifier so that the new classifier corrects the mistake of the old one. or I have heard something of Incremental Learning or Online learning. Are there any existing implementations (packages) in R?

Is it the right approach? or Is there any other way to tweak the model instead of building it from the scratch?


Get this bounty!!!

#StackBounty: #classification #keras #convnet #training #audio-recognition CNN for phoneme recognition

Bounty: 50

I am currently studying this paper, in which CNN is applied for phoneme recognition using visual representation of log mel filter banks, and limited weight sharing scheme.

The visualisation of log mel filter banks is a way representing and normalizing the data. They suggest to visualize as a spectogram with RGB colors, which the closest I could come up with would be to plot it using matplotlibs colormap cm.jet. They (being the paper) also suggest each frame should be stacked with its [static delta delta_delta]
filterbank energies. This looks like this:

enter image description here

The input of the consist of an image patch of 15 frames set [static delta delta_detlta] input shape would be (40,45,3)

The limited weight sharing consist of limiting the weight sharing to a specific filter bank area, as speech is interpreted differently in different frequency area, thus will a full weight sharing as normal convolution apply, would not work.

Their implementation of limited weight sharing consist of controlling the weights in the weight matrix associated with each convolutional layer. So they apply a convolution on the complete input.
The paper only applies only one convolutional layer as using multiple would destroy the locality of the feature maps extracted from the convolutional layer.
The reason why they use filter bank energies rather than the normal MFCC coefficient
is because DCT destroys the locality of the filter banks energies.

enter image description here

Instead of controlling the weight matrix associated with convolution layer, I choose to implement the CNN with multiple inputs. so each input consist of a (small filter bank range, total_frames_with_deltas, 3). So for instance the paper state that a filter size of 8 should be good, so I decided a filter bank range of 8. So each small image patch is of size (8,45,3). Each of the small image patch is extracted with a sliding window of with a stride of 1 – so there is a lot of overlap between each input – and each input has its own convolutional layer.

enter image description here

(input_3 , input_3, input3, should have been input_1, input_2, input_3 …)

Doing this way makes it possible to use multiple convolutional layer, as the locality is not a problem any more, as it applied inside a filter bank area, this is my theory.

The paper don’t explicitly state it but i guess the reason why they do phoneme recognition on multiple frames is to have some some of left context and right context, so only the middle frame is being predicted/trained for. So in my case is the first 7 frames set the left context window – the middle frame is being trained for and last 7 frames set is the right context window. So given multiple frames, will only one phoneme be recognised being the middle.

My neural network currently looks like this:

def model3():

    #stride = 1
    #dim = 40
    #window_height = 8
    #splits = ((40-8)+1)/1 = 33
    next(test_generator())
    next(train_generator(batch_size))

    kernel_number = 200#int(math.ceil(splits))
    list_of_input = [Input(shape = (window_height,total_frames_with_deltas,3)) for i in range(splits)]
    list_of_conv_output = []
    list_of_conv_output_2 = []
    list_of_conv_output_3 = []
    list_of_conv_output_4 = []
    list_of_conv_output_5 = []
    list_of_max_out = []
    for i in range(splits):
        #list_of_conv_output.append(Conv2D(filters = kernel_number , kernel_size = (15,6))(list_of_input[i]))
        #list_of_conv_output.append(Conv2D(filters = kernel_number , kernel_size = (window_height-1,3))(list_of_input[i]))
        list_of_conv_output.append(Conv2D(filters = kernel_number , kernel_size = (window_height,3), activation = 'relu')(list_of_input[i]))
        list_of_conv_output_2.append(Conv2D(filters = kernel_number , kernel_size = (1,5))(list_of_conv_output[i]))
        list_of_conv_output_3.append(Conv2D(filters = kernel_number , kernel_size = (1,7))(list_of_conv_output_2[i]))
        list_of_conv_output_4.append(Conv2D(filters = kernel_number , kernel_size = (1,11))(list_of_conv_output_3[i]))
        list_of_conv_output_5.append(Conv2D(filters = kernel_number , kernel_size = (1,13))(list_of_conv_output_4[i]))
        #list_of_conv_output_3.append(Conv2D(filters = kernel_number , kernel_size = (3,3),padding='same')(list_of_conv_output_2[i]))
        list_of_max_out.append((MaxPooling2D(pool_size=((1,11)))(list_of_conv_output_5[i])))

    merge = keras.layers.concatenate(list_of_max_out)
    print merge.shape
    reshape = Reshape((total_frames/total_frames,-1))(merge)

    dense1 = Dense(units = 1000, activation = 'relu',    name = "dense_1")(reshape)
    dense2 = Dense(units = 1000, activation = 'relu',    name = "dense_2")(dense1)
    dense3 = Dense(units = 145 , activation = 'softmax', name = "dense_3")(dense2)
    #dense4 = Dense(units = 1, activation = 'linear', name = "dense_4")(dense3)


    model = Model(inputs = list_of_input , outputs = dense3)
    model.compile(loss="categorical_crossentropy", optimizer="SGD" , metrics = [metrics.categorical_accuracy])

    reduce_lr=ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=3, verbose=1, mode='auto', epsilon=0.001, cooldown=0)
    stop  = EarlyStopping(monitor='val_loss', min_delta=0, patience=5, verbose=1, mode='auto')

    print model.summary()

    raw_input("okay?")
    hist_current = model.fit_generator(train_generator(batch_size),
                        steps_per_epoch=10,
                        epochs = 10000,
                        verbose = 1,
                        validation_data = test_generator(),
                        validation_steps=1)
                        #pickle_safe = True,
                        #workers = 4)

So.. now comes the issue..

I been training the network and have only been able to get a validation_accuracy of highest being 0.17, and the accuracy after a lot of epochs end up being 1.0.
enter image description here
(Plot is currently being made)

I am not sure why I am not getting better results.. Why this high error rate?
I am using the TIMIT dataset which the other ones also use.. so why am I getting worse results?

And sorry for the long post – hope more information of my design decision could be useful – and help understand how I understood the paper versus how i’ve applied would help pinpoint where my mistake would be.


Get this bounty!!!

#StackBounty: #code-golf #decision-problem #classification #audio Lossy or Lossless?

Bounty: 200

Given an audio file, determine whether it is encoded in a lossy format or a lossless format. For the purposes of this challenge, only the following formats need to be classified:

Rules

  • If input is taken in the form of a filename, no assumptions should be made about the filename (e.g. the extension is not guaranteed to be correct for the format, or even present).
  • There will be no ID3 or APEv2 metadata present in input files.
  • Any two unique and distinguishable outputs may be used, such as 0 and 1, lossy and lossless, foo and bar, etc.

Test Cases

The test cases for this challenge consist of a zip file located here which contains two directories: lossy and lossless. Each directory contains several audio files that are all 0.5-second 440 Hz sine waves, encoded in various formats. All of the audio files have extensions matching the formats above, with the exception of A440.m4a (which is AAC audio in a MPEG Layer 4 container).


Get this bounty!!!

#StackBounty: #classification #random-forest #unbalanced-classes #discriminant-analysis multiclass classification when x variable is no…

Bounty: 100

I have a dataset as below. It is a classification problem with multiple x variables and a y variable. Y variable has 5 levels. The below picture has y variable on x axis and one x variable on y axis. As you can see from the picture below, for each value of Y, a x variable distribution is not very different (i plotted similar chart for all x variables- one x variable at a time vs y and seeing the same trend). Due to this issue, most of the observations are getting classified as class 0 or 4. I built a randomforest with 500 trees. How could i improve accuracy? I thought of taking square or cube of each x so that distance between class means would increase but it won’t help with variance of each class

distribution of 5 classes in my data is as below

   0    1    2    3    4 
5104 2639 2322 2661 5274 

enter image description here

Column means and standard deviations of my x variables are very similar 🙁


Get this bounty!!!

#StackBounty: #r #hypothesis-testing #classification #experiment-design #double-blind Design a double-blind expert test when there is i…

Bounty: 50

I would like to submit to an expert group a set of images from two genetically different types of plants to see if they can find a difference. I have 20 images for each of the two types. Images are images of cells that are so specific that only a small group of 10 experts are able to see the potential differences (without knowing which type it is).

To have a powerful test, I choose to work with a two-out-of-five test, which means that I show experts five images in which 3 are from type1 and 2 are from type2. The have to define two groups of images. I chose this test because of its power. The probability to group the images correctly by chance is only 1/10. Because the number of experts is really small, using this test instead of a triangular test or a simple two images test, is in my point of view more powerful.

The variability of intra-type images is quite high. If there is a difference between the two types, it will be small, within this set of 2*20 images. Hence, if there is a difference, I would like to know which images are the most different. I cannot show more than ~50 sets of 5 images to my experts as they do not have a lot of time to perform the analysis. The problem is that there is about 400000 possibilities if sets of 5 images (3 type1 – 2 type2).

I think that choosing randomly my 50 sets among 400000 for each expert is not really representative. It would be more interesting to choose the sets so that each image has been compared to each other, so that I can define if there are similar or different (which can be infered from the 2-out-of-5 grouping). Randomly, I will not be sure that I tested all images against all (at least indirectly) with my only 10 experts. Thus, I decided to create a sample of 50 sets, different for each expert, that maximises the number of couples compared either directly or indirectly. However, with this sampling, some couples of images are more compared than others, which is why I forced my selection so that each couple is compared at least 3 times directly or indirectly.

This seems to be a quite complicated sampling procedure. For me, this is the only one that makes me sure I will be able to correctly compare all my images, in a few shot for each expert, knowing that once I have shown it to my experts, I will not have any other chance to find another expert group. And they won’t be able to take more time to do it again.

To summarize, I want to know :

  1. Are my two types different ?
  2. Which cell images are more different than the others ?

Do you think I am going in a too complicated way ? Is random a better solution even if I will be able to only test 50 sets * 10 experts among 400000 possible sets ? How can I be sure not to bias my sets selection procedure ?

By the way, I work with R, but I don’t think this is really important for this question


Get this bounty!!!