#StackBounty: #machine-learning #classification #data-mining Predicting column use in a table

Bounty: 50

I have a set of tables $mathcal{T} = {T_1, …, T_n}$, where each $T_i$ is a collection of named columns ${c_0 .. c_{j_{i}}}$. In addition, I have a large sequence of observations $mathcal{D}$ of the form $(T_i, c_k)$, indicating that given access to table $T_i$ a user decided to use column $c_k$ for a particular task (not relevant to the problem formulation). Given a new table $T_j notin mathcal{T}$, I’d like to rank the columns of $T_j$ based on the likelihood that a user would pick that column for the same task.

My first intuition was to expand each observation $(T_i, c_k) in D$ into ${ (c_k, True) } cup { (c_j, False) | c_j in T_i land j neq k }$ and view this as a classification problem, and I can then use the probability of being in the positive class as my sorting metric. My issue with this is that it seems to me that this ignores that there is a relation between columns in a given table.

I also thought perhaps there is a reasonable approach to summarizing $T_i$, call this $phi$ and then making the problem $(phi(T_i), f(c_k))$, where $f$ is some function over the column.

I suspect this is a problem that people have tackled before, but I cannot seem to find good information. Any suggestions would be greatly appreciated.


Here’s an idea I’ve been tossing around and was hoping I could get input from more knowledgeable people. Let’s assume users pick $c_j in T_i$ as a function of how “interesting” this column is. We can estimate the distribution that generated $c_j$, called this $hat{X}_j$. If we assume a normal distribution is “uninteresting”, then define $text{interest}(c_j) = delta(hat{X}_j, text{Normal})$, where we can define $delta$ to be some distance metric (e.g. https://en.wikipedia.org/wiki/Bhattacharyya_distance). The interest level of a table $text{interest}(T_i) = text{op}({text{interest}(hat{X}_j) | c_j in T_i})$, where $op$ is an aggregator (e.g. avg). Now I expand the original $(T_i, c_k) in mathcal{D}$ observations into triplets of $(text{interest}(T_i), text{interest}(c_j), c_j == c_k)$ and treat these as a classification problem. Thoughts?

Get this bounty!!!

#StackBounty: #classification #unbalanced-classes #theory #generative-models Is generating training data for a classifier mathematicall…

Bounty: 50

I have an application with a class imbalance problem: A lot of positive data points, but very few (comparatively) negative points.

One colleague recommended that I train a generative model on the limited subset of the negative data I have, and then generate data from that model to use for training my classifier.

I have the feeling that this shouldn’t work, after all the generative model will only have access to the small sample of negative data, and not the true population.

Is there a way to formally claim this? I was thinking something along the lines of Computational learning theory or the no free lunch theorem.

Get this bounty!!!

#StackBounty: #r #machine-learning #classification #caret comparing caret models with mean or median?

Bounty: 50

I am using caret to evaluate the classification performance of several models on a small dataset (190 obs) with two classes and just a handful of features.

When I inspect the train() object for one of the models, I get what looks to be the mean metric values (ROC, Sens, and Spec).

Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 171, 171, 171, 171, 171, 171, ... 
Resampling results across tuning parameters:

  nIter  method         ROC        Sens       Spec
   50    Adaboost.M1    0.8866667  0.9866667  0.58
   50    Real adaboost  0.5566667  0.9844444  0.50
  100    Adaboost.M1    0.8844444  0.9877778  0.58
  100    Real adaboost  0.5738889  0.9833333  0.52
  150    Adaboost.M1    0.8800000  0.9877778  0.60
  150    Real adaboost  0.5994444  0.9833333  0.52

When I use the resamples() function and put all of the models in a list, I get the means again, but also the median values. (other model results omitted for clarity)

Number of resamples: 50 

            Min. 1st Qu. Median   Mean 3rd Qu. Max. NA's
ADABOOST 0.25000  0.8958 0.9444 0.8867       1    1    0

           Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
ADABOOST 0.8889  1.0000 1.0000 0.9867  1.0000 1.0000    0

         Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
ADABOOST    0       0      1 0.58       1    1    0

The bwplot() function appears to display the median values as the point estimates.

enter image description here

It seems to me like the train() output wants me to evaluate the models based on the means. bwplot() focuses on the median. My first thought was that the median would be a better metric with such spread.

Which would you use, and why?

Get this bounty!!!

#StackBounty: #machine-learning #classification Can The linearly non-separable data be learned using polynomial logistic regression?

Bounty: 50

I know that Polynomial Logistic Regression can easily learn a typical data like the following image:
first image

I was wondering whether the following two data also can be learned using Polynomial Logistic Regression or not.

enter image description here

enter image description here

Get this bounty!!!

#StackBounty: #r #classification #cart #rpart How can I use estimated probabilities of a class from rpart to identify the top N classes?

Bounty: 50

Using the rpart library, I’m trying to predict which class each observation belongs to. Here is a reproducible example explaining the steps I am taking:


# training set
df_train <- data.frame(
  tag = c('123', '123', '124', '124', '125'),
  p1 = c('home', 'work', 'work', 'work', 'home'),
  p2 = c(1, 1, 1, 0, 1)

# testing set
df_test <- data.frame(
  tag = c('123', '124', '125'),
  p1 = c('home', 'work', 'home'),
  p2 = c(1, 1, 0)

# train model
model.rpart = rpart(tag~p1+p2, data=df_train, method="class")

# predict probabilities of class
pred.rpart = predict(model.rpart, data=df_test, method="prob")

# list out results

My problem is that I don’t fully understand the output of the table pred.rpart

> pred.rpart
  123 124 125
1 0.4 0.4 0.2
2 0.4 0.4 0.2
3 0.4 0.4 0.2
4 0.4 0.4 0.2
5 0.4 0.4 0.2

I thought it was giving me a list of probabilities for each class in my test dataset, but I don’t understand why there are five rows, when I am just trying to look at the predictions of the test data set.

Why does pred.rpart contain five rows of data?

My overall objective is to find the top N predictions for a class. So for the first observation in my df_test dataframe, I would like to be able to say:

Top 2 predictions for the first observation:
  #1: '123': 40%
  #2: '124': 40% 

Once I understand the output of rpart.pred I want to summarize this using the following command to give me each class prediction, ordered by probability:

n_classes <- 2
apply(pred.rpart,1,function(xx)head(names(sort(xx, decreasing=T)), n_classes))

Get this bounty!!!

#StackBounty: #machine-learning #classification #clustering What algorithms are available to cluster sequences of data?

Bounty: 50

I have a data set containing points through time, generated by multiple Markov processes (each point in time contains N points). I know the statistical nature of the Markov processes (same for all), but my task is to determine which points go together (from the same process). Are there developed algorithms that address this type of problem? I should say my more general problem has missing data and an unknown number of processes, but I’d be interested in approaches to the “easy” version too, where there are no missing points and N is known.

Get this bounty!!!

#StackBounty: #image-processing #classification #keras #conv-neural-network #pixels How to apply CNN for multi-channel pixel data based…

Bounty: 50

I have an image with 8 channels.I have a conventional algorithm where weights are added to each of these channels to get an output as ‘0’ or ‘1’.This works fine with several samples and complex scenarios. I would like implement the same in Machine Learning using CNN method.

I am new to ML and started looking out the tutorials which seem to be exclusively dealing with image processing problems- Hand writing recognition,Feature extraction etc.



I have setup the Keras with Theano as background.Basic Keras samples are working without problem.

What steps do I require to follow in order achieve the same result using CNN ? I do not comprehend the use of filters,kernels,stride in my use case.How do we provide Training data to Keras if the pixel channel values and output are in the below form?

Pixel#1 f(C1,C2…C8)=1

Pixel#2 f(C1,C2…C8)=1

Pixel#3 f(C1,C2…C8)=0 .


Pixel#N f(C1,C2…C8)=1

Get this bounty!!!

#StackBounty: #classification #accuracy #precision-recall #multi-class Summarising Precision/Recall Measures in Multi-class Problem

Bounty: 50

I have a hierarchical multi-class classification system, that classifies records into about 500 different categories. I want to summarise the performance of the classifier in a simple way.

A measure of accuracy on validation data is easy to implement: correctly coded/all coded. For each class, we can look at binary measures of precision and recall to summarise the performance relative to that class.

However, there doesn’t seem to be a generally accepted way to combine binary precision and recalls into summaries of precision and recall across the entire set of classes. There appear to be a few ways to approach this summary:

  1. Take a simple average (arithmetic/geometric/harmonic) of each class’s precision/recall.

  2. Take a weighted average (weighted by number of examples, etc) of each class’s precision/recall.

  3. Use bookmaker’s informedness/markedness which seems to have a natural generalisation in the multiclass context.

Are there advantages to using one of these approaches particularly? Is there a generally accepted way to do this that I’ve just been missing?

Get this bounty!!!

#StackBounty: #classification #scikit #apache-spark #preprocessing #sentiment-analysis Extracting individual emails from an email thread

Bounty: 50

Most of the open source datasets are well formatted i.e each email message is separated well like the enron email dataset. But out in the real world it is highly difficult to separate a top email message from a thread of emails.

For example consider the below message.


Can you offer me a better discount.

Customer Relations.

---- On Wed, 10 May 2017 04:05:16 -0700 someone@somewhere.com wrote ------

Hello Mr.X,

Does the below work out. Do let us know your thoughts.


Sales Manager.

Now the reason why we want to split the emails is because we want to do sentiment analysis. When we fail to split the email then the results will be wrong.

I searched around and found this very comprehensive research paper. Also found an implementation by mail gun called mail gun talon. But unfortunately it does not work well for certain kind of patters.

For example when the second message in the email thread breaks like

---------- Forwarded message ---------- 

instead of the above

---- On Wed, 10 May 2017 04:05:16 -0700 someone@somewhere.com wrote ------

My question is many people who are trying to do such stuff would have definitely faced such problems, but yet the area remains pretty shady. Is there any pretty solid implementation of the paper or something else that splits email pretty well.

Get this bounty!!!

#StackBounty: #classification #overfitting #hyperparameter #bagging #xgboost XGBoost feature subsampling

Bounty: 50

I have a dataset with ~30k samples and 35 features (after feature selection; these seem to be the most important features for this dataset and they have low correlation between each other).

After doing grid search with 10-fold CV on the hyperparameters, to my surprise I get the lowest validation error when colsample_bytree is such that only 1 feature is sampled for each tree! (Edit: actually, with 2 features sampled per tree it works slightly better – but if I increase the number of features sampled per tree the performance keeps getting worse). The depth of each tree is 3 and I am building 2000 trees. That is, for each tree, a feature is randomly selected, and then xgboost tries to fit to residuals using only that feature.

That seems to be very unusual. How should I interpret this? If I have feature interactions in my trees, I start to overfit? But then I would expect performance with trees of depth 1 and no feature subsampling to perform just as good, yet they don’t. In fact, in the grid search, nearly all models with such extreme feature subsampling did better than models without feature subsampling.

Edit: is it possible that I have some features that fit well to the training set but generalize very poorly, and such individual feature sampling helps to avoid those features dominating the model? I am struggling to see what else this could mean.

Edit2: Tried removing individual features, performance does not improve, which suggests that my hypothesis from the previous edit is unlikely. On the other hand, I found that the optimal performance is actually when I sample 2 features per tree. At least now my features are interacting, but still, I am not sure how to explain this gain in performance.

Get this bounty!!!