#StackBounty: #machine-learning #python #neural-network #classification #supervised-learning Recognising made up terms

Bounty: 50

Say I have a tagging system on electrical circuit:

Name Description

BT014Battery. Power source

Name          Description
BT104         Battery. Power source
SW104         Circuit switch
LBLB-F104     Fluorescent light bulb
LBLB104       Light bulb
...           ...

I have a hundreds of tags created by people who should have followed my naming conventions but sometimes they make mistakes add unnecessary additional characters to tag names (i.e. BTwq104 etc.).

Up until now I used regular expressions, that I built over time whilst observing various inconsistencies that users introduce whilst naming different parts of their curuits, to parse the names and tell me what the different elements are. For example: name ‘BT104‘ would tell me its a battery on circuit 104.

I would like to investigate or use a machine learning technique to identify what a tag name is (same way I used reg ex’s). Any suggestions and approaches are welcome.

So far I tried Named-entity recognition suggested technique “Bag of words“. Followed a few tutorials here and here (latter being the most useful in learning). None of them produced wanted results if any. I think that “Bag of words” are mostly used for real word rather than made up words.

Thanks


Get this bounty!!!

#StackBounty: #time-series #classification #normalization Normalization and train/dev/test split for multi-dimensional time series

Bounty: 50

We have multi-dimensional sensor data (30 different measurements) measured every ten seconds on over 500 assets. These assets were put into production at different times in the past five years. Some of these assets have failed in the past. (For our purposes, a failed asset does not make it back into the dataset after it gets fixed.) The assets are all not similar, so their sensor readings could be in different ranges. Our goal is to estimate the probability of asset failure in the next 30 days using machine/deep learning.

  • For data normalization, is it reasonable to normalize each asset’s data independently? For example, in the case of standard scaling, say we use the first 3 months of data from each asset (all our assets are functional for at least the first 3 months) to compute mean and standard deviation, and then use those to normalize all the data points for that asset? We can do the same for each asset independently. The idea is that a single model can then work well for all the assets.

  • How can we do train/dev/test set splits?
    • Should we do time-based split, so the training data is from the
      earliest time period (say a year), dev set should be a few months of
      data at least 30 days after the last training data point and test
      data should be at least 30 days after the last dev set data point?
    • Or can we just split by assets? This has the concern of potentially using
      “future data” to predict the past, but can we think of assets as
      independent?


Get this bounty!!!

#StackBounty: #hypothesis-testing #classification #confidence-interval How to measure confidence in classifier of non-independent data?

Bounty: 100

I have some noisy high dimensional data, and each data point has a “score”. The scores are roughly normally distributed. Some scores are known and some are unknown; I want to separate the unknown points into two groups, based on whether I think the score is positive or not.

I have a black box which, given some data points and their scores, gives me a hyperplane correctly separating the points (if one exists).

I separate the points with known score into two disjoint sets for training and validation respectively.

Then, repeatedly (say k times), I do the following:

  • Randomly select some data points with positive score and some points with negative score from the training set (for some fixed positive values for m and n).
  • Use the black box to (try to) get a separating hyperplane for these sampled points.
  • If I get a hyperplane back, save it.

Now I have some hyperplanes (say I have 0 < k’ <= k of them).

I use these hyperplanes to separate the validation set. I select the hyperplane which correctly classifies the most points as having positive or negative score (number of correct positives + number of correct negatives).

My question is: How can I measure the statistical confidence that the finally selected hyperplane is better than random?

Here’s what I’ve done so far:

Say there are n points in the validation set. If a hyperplane correctly classifies a point with probability p, and this is independent for all the points, we can use a binomial distribution.

Let F be the cdf of the binomial distribution. Let X be the number of correctly classified points in the validation set (so we are assuming X ~ B(n, p)). Then P(X <= x) = F(x).

Now, we have k’ hyperplanes. Let’s assume these can be represented as k’ IID variables X1, X2, …, Xk’.

Now P(max(X1, X1, …, Xk’) <= x) = F(x) ^ k’.

Let’s say a random hyperplane is one as above where p equals the proportion of positive scores in the total (so if it’s three quarters positive, p = 0.75).

Sticking some numbers in, I ran these numbers. Let p = 0.5 for simplicity. Suppose I want to check if the selected hyperplane is better than random with probability > 0.95.

If n = 2000, I need to classify 1080 correctly to have confidence greater than 0.95 that this classifier is better than random (I think, unless I did the calculation wrong).

However, if the points themselves are not independent, this doesn’t work. Suppose many of the points are identical so the effective size of the set is much smaller than n. If n = 20, you need to get 18 correct for 0.95 confidence; extrapolating that suggests you’d need 1800/2000.

I am sure that the points are not independent, but I’m not sure in what way, or how to go about measuring that and accounting for it in a calculation similar to the above.


Get this bounty!!!

#StackBounty: #machine-learning #classification #scikit-learn #algorithms Algorithm to move events of MC from one class to other to mat…

Bounty: 50

I have MC and data each having events in two classes 0 and 1. I am trying to write an algorithm such that I can match the number of events in class 0 and 1 of MC to data i.e I want to correct MC events by moving them from one class to other such that the ratio of events in the two classes for both data and MC is same. The way I proceeded is:

  1. Train a GradientBoostingClassifier from scikit ensemble for both data and MC individually(say data_clf and mc_clf)
     mc_clf.fit(X_mc, Y_mc)
     data_clf.fit(X_data , Y_data)
    

where Y_mc and Y_data is the corresponding class “mc_class” and “data_class” having values 0 or 1 depending on which class they belong to.

  1. Now, if X_mc is my input variable, use predict_proba to predict the probability of classifier of data and MC using MC inputs ONLY i.e
     y_mc = mc_clf.predict_proba(X_mc)
     y_data = data_clf.predict_proba(X_mc)
    
  2. After this, I try to move the events of MC from one class to another by comparing their probability in data and MC.
     for i in range(0, len(mc)):
         if (mc.loc[i]['mc_class'] == 0): 
             wgt = y_data[i][0]/ y_mc[i][0]
             if (wgt<1): mc.loc[i]['mc_class_corrected'] = 1
             else: mc.loc[i]['mc_class_corrected'] = mc.loc[i]['mc_class'] 
    
    
         if (mc.loc[i]['mc_class'] == 1): 
             wgt = y_data[i][1]/ y_mc[i][1]
             if (wgt<1) : mc.loc[i]['mc_class_corrected'] = 0
             else: mc.loc[i]['mc_class_corrected'] = mc.loc[i]['mc_class'] 
    

In the end what happens is that initially suppose I had more events in class 0 than 1 in MC as compared to data. So I expect events from class 0 to move to class 1. However, I see that almost >95% of my events in class 0 of MC are moving to class 1 while I was expecting only about 30% of events to move (when compared to the number of events in data and MC)?
Is there any mistake in this ideology of working?

Thanks a lot:)


Get this bounty!!!

#StackBounty: #regression #classification #discriminant-analysis Looking for good, recent examples of Discriminant Analysis (Linear, Qu…

Bounty: 50

We have looked at LDA/QDA several times during my stats masters coursework, but I’m not convinced that it’s due to the usefulness of the techniques more than my school being stuck with a 20-year old curriculum.

This answer by Frank Harrell suggests that these techniques aren’t very useful nowadays, and even the textbook we’re using indicates that LDA/QDA is only expected to perform significantly better than logistic regression when the assumptions are met, which seems fairly disqualifying for most purposes.

To make my title request more specific, I’m looking for at least two different examples (i.e. not the same type of problem, preferably different disciplines) that are

  • Good: clearly the most optimal tool to solve the problem at hand, not just being used in place of logistic regression because of researcher preference

  • Recent: published in the last six years

  • “… or else”: need not be textbook LDA/QDA. It’s okay if the technique is an extension of the above models, but it should obviously follow the same reasoning i.e. related to a decision rule based on distributional assumptions on predictors conditional to the outcome of interest

Alternatively, it would also be acceptable if someone can provide proof that any formulation of a discriminant-type model can be re-expressed as a regression problem (e.g. linear regression produces results equivalent to LDA).


Get this bounty!!!

#StackBounty: #hypothesis-testing #classification #confidence-interval How to measure confidence in classifier chosen from several avai…

Bounty: 100

I have some noisy high dimensional data, and each data point has a “score”. The scores are roughly normally distributed. Some are known and some are unknown; I want to separate the unknown points into two groups, based on whether I think the score is positive or not.

I have a black box which, given some data points and their scores, gives me a hyperplane correctly separating the points (if one exists).

I separate the points with known score into two disjoint sets for training and validation respectively.

Then, repeatedly (say k times), I do the following:

  • Randomly select m data points with positive score and n points with negative score from the training set (for some fixed positive values for m and n).
  • Use the black box to (try to) get a separating hyperplane for these sampled points.
  • If I get a hyperplane back, save it.

Now I have some hyperplanes (say I have 0 < k’ <= k of them).

I use these hyperplanes to separate the validation set. I rank them either by the average score of validation data points categorised as positive, or by recall (total positively categorised/total positive validation points)[1]. Then I select the top ranking hyperplane, and use this to label my data with unknown scores.

My question is: How can I measure the statistical confidence that the finally selected hyperplane is better than random?

I have a vague idea how to test the significance of a single classifier (using a T test maybe?) but am not sure how it’s affected by being the “best” of several.

[1]: I’m not sure if the choice of ranking scheme between these two makes a difference to the confidence calculation. I haven’t decided which ranking method to use, so I mentioned both as possibilities.


Get this bounty!!!

#StackBounty: #classification #unbalanced-classes #data-cleaning #active-learning Using ML to assist human labelling in dataset with hi…

Bounty: 100

Is there anything wrong with using ML to assist human labeling in a scientific setting?

I’ve got a 3 class dataset where only 1 in 500 elements belong to the 2 classes of interest, and a simple NN could likely be used to filter out most irrelevant elements, bringing the number down to around 1 in 100, and increasing the effectiveness of human annotators time by 50x. The dataset will be used to train, test and validate a classifer.

However I can foresee reasons why this could cause an issue:

  • If the annotated data is unrepresentative due to bias in the ML used before human annotation the classifier might struggle to generalise
  • Use of an ML data-cleaner, which isn’t based on human supplied, justifiable rules, puts a black box at the beginning of the data analysis proccess
  • Only annotating a small proportion of the highly prevalent class makes the dataset very selective, would this invite criticism on the misuse of this bias (i.e. manipulation for a desired hypothesis)

All thoughts appreciated


Get this bounty!!!

#StackBounty: #machine-learning #classification #binary-data #accuracy Difference between cumulative gains chart and cumulative accurac…

Bounty: 50

I am confused about the following:

Here I find the definition of cumulative gains chart as the plot of

  • x-axis: the rate of predicted positive
  • y-axis: true positive rate

It turns out that we can e.g. order by the probability of a positive outcome and plot the corresponding rate of true positives.

On the other hand there is the cumulative accuracy profile (CAP) explained here where it is said to be constructed as

The CAP of a model represents the cumulative number of positive
outcomes along the y-axis versus the corresponding cumulative number
of a classifying parameter along the x-axis.

Another source for the CAP is “Measuring the Discriminative Power of Rating Systems” by Engelmann, Hayden and Tasche.
Together with the plots I would say that the two concepts are the same. Is this true?


Get this bounty!!!

#StackBounty: #probability #classification #ensemble #feature-weighting How to describe most important features of ensemble model as li…

Bounty: 100

I have created 3 different models and output of them is a class probability in binary classification problem. Models are bit different, showing importance from different features. I have of course one data matrix as a source for this exercise where 70% of data is used as training sample.

How one can summarize importance of different feature values to the final class prob estimate if only data matrix and list of features used is know besides this class probability estimate?

Individual models can be of course explained by different methods, but how one can explain avg ensemble predictions?


Get this bounty!!!

#StackBounty: #classification #neural-networks #dataset #overfitting Higher overfitting using data augmentation with noise?

Bounty: 50

I am training a neural network for Audio classification.

I trained it on the UrbanSound8K dataset, and then I wanted to evaluate how different levels of added noise to the inputs influenced prediction accuracy.

As expected, higher levels of noise resulted in lower accuracy.

Then, I decided to perform data augmentation with noise. So I took the dataset, and I duplicated it with the same files but adding pink noise (+0 dB SNR) to them.

As expected (by me), the overall accuracy increased (a very tinny bit though, 0.5%), and the network became more robust to noise corruptions of the inputs.

However! One thing that I was not expecting was that now the network has reduced its accuracy when predicting only uncorrupted-with-noise inputs (validation inputs). Somehow, it has overfitted to the clean inputs, thus reducing prediction accuracy on these audios.

Is there any explanation or intuition into this result?

I was expecting that the network, having now more and more varied training data, would learn more meaningful features. I guess it is more difficult to overfit to the noisy inputs, but still I don’t understand why it has overfitted to the clean inputs mainly.


Get this bounty!!!