#StackBounty: #machine-learning #distributions #optimization #exponential #tensorflow Batch loss of objective function contains exp bec…

Bounty: 100

I am trying to solve a survival analysis problem, where all data are either left-censoring or right-censoring. I use an objective function which contains the CDF of Gumbel distribution.

I have $m$ features and $m+1$ coefficients which need to be learned. The scale of the distribution, $lambda$ can be represented by a linear regression. Since the scale must be a positive number, I use softplus. (I think an exp transformation may be easy to go unlimited.)

$lambda = softplus(theta_0 + sum_{j=1}^{m} theta_jx_j ) = ln[ 1+exp(theta_0 + sum_{j=1}^{m} theta_jx_j) ]$

The scale is fed into a Gumbel distribution.

$h(t) = e^{-e^{-(t-mu)/lambda}}$, where the location $mu$ is pre-specified.

$h(t)$ is the probability that the patient is dead before time $t$, i.e., left-censoring. $1 – h(t)$ is the probability that the patient is dead after $t$, i.e., right-censoring.

In the ground truth, binary target, $y^{(i)}$, is whether the patient is left-censoring. As the model outputs how likely the patient is left censoring at $t$, I use log-loss to measure the loss of the model.

I use Tensorflow to implement the model:

    input_vectors = tf.placeholder(tf.float32,
                                  shape=[None, num_features],
                                  name='input_vectors')

    time = tf.placeholder(tf.float32, shape=[None], name='time')
    event = tf.placeholder(tf.int32, shape=[None], name='event')

    weights = tf.Variable(tf.truncated_normal(shape=(num_features, 1), mean=0.0, stddev=0.02))
    scale = tf.nn.softplus(self.regression(input_vectors, weights))
    ''' 
    if event == 0, right-censoring
    if event == 1, left-censoring 
    '''
    not_survival_proba = self.distribution.left_censoring(time, scale)  # the left area
logloss = tf.losses.log_loss(labels=event, predictions=not_survival_proba)

The implementation of the Gumbel distribution:

class GumbelDistribution:
    def __init__(self, location=1.0):
        self.location = location

    def left_censoring(self, time, scale):
        return tf.exp(-1 * tf.exp(time - self.location) / scale)

    def right_censoring(self, time, scale):
        return 1 - self.left_censoring(time, scale)

However, the batch loss becomes NaN after several iteration. After I change the distribution to Weibull. It works. So I guess the problem is the two $exp$s in the CDF of Gumbel.

Epoch 1 - Batch 1/99693: batch loss = 16.3606
Epoch 1 - Batch 2/99693: batch loss = 25.5445
Epoch 1 - Batch 3/99693: batch loss = 17.1181
Epoch 1 - Batch 4/99693: batch loss = 10.6815
Epoch 1 - Batch 5/99693: batch loss = 17.2127
Epoch 1 - Batch 6/99693: batch loss = 28.7549
Epoch 1 - Batch 7/99693: batch loss = 13.8332
Epoch 1 - Batch 8/99693: batch loss = 19.3377
Epoch 1 - Batch 9/99693: batch loss = 19.7385
Epoch 1 - Batch 10/99693: batch loss = 17.7479
Epoch 1 - Batch 11/99693: batch loss = 13.1403
Epoch 1 - Batch 12/99693: batch loss = 15.0979
Epoch 1 - Batch 13/99693: batch loss = 17.5434
Epoch 1 - Batch 14/99693: batch loss = 21.5072
Epoch 1 - Batch 15/99693: batch loss = 10.4660
Epoch 1 - Batch 16/99693: batch loss = 26.9554
Epoch 1 - Batch 17/99693: batch loss = nan
Epoch 1 - Batch 18/99693: batch loss = nan
Epoch 1 - Batch 19/99693: batch loss = nan
Epoch 1 - Batch 20/99693: batch loss = nan
Epoch 1 - Batch 21/99693: batch loss = nan
Epoch 1 - Batch 22/99693: batch loss = nan
Epoch 1 - Batch 23/99693: batch loss = nan

Any idea how to solve this problem?


Get this bounty!!!

#StackBounty: #machine-learning #decision-trees #categorical-data Catboost Categorical Features Handling Options (CTR settings)?

Bounty: 50

I am working with a dataset with large number of categorical features (>80%) predicting a continuous target variable (i.e. Regression). I have been reading quite a bit about ways to handle categorical features. And learned that one-hot encoding I have been using in past is really bad idea especially when it comes to lots of categorical features with many levels (read these post, and this).

While I’ve come across methods like target-based encoding (smoothing) of categorical features often based on mean of target values for each feature perhaps this post/kernel in Kaggle. Still I am struggling to find a more concrete way till I found CatBoost an open-source gradient boosting on decision trees released last year by Yandex group. They seem to offer extra statistical counting options for categorical features likely much more efficient than simple one-hot encoding or smoothing.

The problem is the documentation is not helpful and have figured out how to set CTR settings. I have tried more than 10 different ways to make it working but it does not accept the way I input the CTR setting as simple_ctr, see (here under the CTR setting section):

['CtrType[:TargetBorderCount=BorderCount][:TargetBorderType=BorderType][:CtrBorderCount=Count][:CtrBorderType=Type][:Prior=num_1/denum_1]..[:Prior=num_N/denum_N]',
 'CtrType[:TargetBorderCount=BorderCount][:TargetBorderType=BorderType][:CtrBorderCount=Count][:CtrBorderType=Type][:Prior=num_1/denum_1]..[:Prior=num_N/denum_N]',
  ...]

Thanks for your time.


Get this bounty!!!

#StackBounty: #machine-learning #python #neural-network #classification #supervised-learning Recognising made up terms

Bounty: 50

Say I have a tagging system on electrical circuit:

Name Description

BT014Battery. Power source

Name          Description
BT104         Battery. Power source
SW104         Circuit switch
LBLB-F104     Fluorescent light bulb
LBLB104       Light bulb
...           ...

I have a hundreds of tags created by people who should have followed my naming conventions but sometimes they make mistakes add unnecessary additional characters to tag names (i.e. BTwq104 etc.).

Up until now I used regular expressions, that I built over time whilst observing various inconsistencies that users introduce whilst naming different parts of their curuits, to parse the names and tell me what the different elements are. For example: name ‘BT104‘ would tell me its a battery on circuit 104.

I would like to investigate or use a machine learning technique to identify what a tag name is (same way I used reg ex’s). Any suggestions and approaches are welcome.

So far I tried Named-entity recognition suggested technique “Bag of words“. Followed a few tutorials here and here (latter being the most useful in learning). None of them produced wanted results if any. I think that “Bag of words” are mostly used for real word rather than made up words.

Thanks


Get this bounty!!!

#StackBounty: #machine-learning #theory #vc-theory #pac-learning Intuition behind Occam's Learner Algorithm using VC-Dimension

Bounty: 50

So I’m learning about Occam’s Learning algorithm and PAC-Learning where for a given hypothesis space $H$, if we want to have a model/hypothesis $h$ that has an True error of $error_D leq epsilon$, with a probability of $(1-delta)$ for a given probability $delta$, we need to train it on $m$ examples with $m$ being defined as:

$$ m > frac{1}{2epsilon^2}{log(|H|)+log(frac{1}{delta})}$$

Now, I’m looking for some way to explain the terms of the equation in very simple terms to gain some intuition into why this equation is the way it is, without reverting to a complicated mathematical proof. And so, while being introduced to the concept of VC-dimensionality which is, as I understand it, a measure of how complicated the example space $X$ can be while still being linearly separable within $H$.

Substituting $log(H)$ for $ VC(H)$ (and adding a few constants) I found the following equation:

$$ m > frac{1}{epsilon}{8VC(H)log(frac{13}{epsilon})+4logfrac{2}{delta}}$$

Which then made sense to me because basically if the size of $H$,i.e. the dimensionality $|H|$, then, to shatter $m$ examples we must have that: $|H|>2^m$, or in other words $log(|H|) geq VC(H)$.

So far I’ve just been restating things I’ve learned online in my own words.

Now, if you will allow me to try to very hand-wavingly reprove the first equation above, here is my restatement of Occam’s Learning Algorithm in simple terms:

In order to have an algorithm that performs perfectly over hypothesis space $H$, we must train it on at least every possible partition of $H$, which is defined as: $2^m$ where $2^m > |H|$

Question 1: Are all my assumptions so far correct?

Question 2: Does this mean my intuition is correct?

If so, then the rest is just algebraic manipulation where we first add some probability that $|H|$ is not expressive enough by some constant, we call this $delta$:

$$ 2^m > frac{|H|}delta $$

Isolate $m$, split up the $log$ terms using logarithmic identities:

$$ m > log(|H|) + log(frac{1}{delta}) $$

Adding some margin of error $epsilon$ to the size of $H$, square it, and add a few constants, we get:

$$ m > frac{1}{2epsilon^2} {log(|H|) + log(frac{1}{delta}) } $$

Question 3: Did I just understand this in simple terms or am I way off?


Get this bounty!!!

#StackBounty: #machine-learning #theory #vc-theory #pac-learning Intuition behind Occam's Learner Algorithm using VC-Dimension

Bounty: 50

So I’m learning about Occam’s Learning algorithm and PAC-Learning where for a given hypothesis space $H$, if we want to have a model/hypothesis $h$ that has an True error of $error_D leq epsilon$, with a probability of $(1-delta)$ for a given probability $delta$, we need to train it on $m$ examples with $m$ being defined as:

$$ m > frac{1}{2epsilon^2}{log(|H|)+log(frac{1}{delta})}$$

Now, I’m looking for some way to explain the terms of the equation in very simple terms to gain some intuition into why this equation is the way it is, without reverting to a complicated mathematical proof. And so, while being introduced to the concept of VC-dimensionality which is, as I understand it, a measure of how complicated the example space $X$ can be while still being linearly separable within $H$.

Substituting $log(H)$ for $ VC(H)$ (and adding a few constants) I found the following equation:

$$ m > frac{1}{epsilon}{8VC(H)log(frac{13}{epsilon})+4logfrac{2}{delta}}$$

Which then made sense to me because basically if the size of $H$,i.e. the dimensionality $|H|$, then, to shatter $m$ examples we must have that: $|H|>2^m$, or in other words $log(|H|) geq VC(H)$.

So far I’ve just been restating things I’ve learned online in my own words.

Now, if you will allow me to try to very hand-wavingly reprove the first equation above, here is my restatement of Occam’s Learning Algorithm in simple terms:

In order to have an algorithm that performs perfectly over hypothesis space $H$, we must train it on at least every possible partition of $H$, which is defined as: $2^m$ where $2^m > |H|$

Question 1: Are all my assumptions so far correct?

Question 2: Does this mean my intuition is correct?

If so, then the rest is just algebraic manipulation where we first add some probability that $|H|$ is not expressive enough by some constant, we call this $delta$:

$$ 2^m > frac{|H|}delta $$

Isolate $m$, split up the $log$ terms using logarithmic identities:

$$ m > log(|H|) + log(frac{1}{delta}) $$

Adding some margin of error $epsilon$ to the size of $H$, square it, and add a few constants, we get:

$$ m > frac{1}{2epsilon^2} {log(|H|) + log(frac{1}{delta}) } $$

Question 3: Did I just understand this in simple terms or am I way off?


Get this bounty!!!

#StackBounty: #machine-learning #theory #vc-theory #pac-learning Intuition behind Occam's Learner Algorithm using VC-Dimension

Bounty: 50

So I’m learning about Occam’s Learning algorithm and PAC-Learning where for a given hypothesis space $H$, if we want to have a model/hypothesis $h$ that has an True error of $error_D leq epsilon$, with a probability of $(1-delta)$ for a given probability $delta$, we need to train it on $m$ examples with $m$ being defined as:

$$ m > frac{1}{2epsilon^2}{log(|H|)+log(frac{1}{delta})}$$

Now, I’m looking for some way to explain the terms of the equation in very simple terms to gain some intuition into why this equation is the way it is, without reverting to a complicated mathematical proof. And so, while being introduced to the concept of VC-dimensionality which is, as I understand it, a measure of how complicated the example space $X$ can be while still being linearly separable within $H$.

Substituting $log(H)$ for $ VC(H)$ (and adding a few constants) I found the following equation:

$$ m > frac{1}{epsilon}{8VC(H)log(frac{13}{epsilon})+4logfrac{2}{delta}}$$

Which then made sense to me because basically if the size of $H$,i.e. the dimensionality $|H|$, then, to shatter $m$ examples we must have that: $|H|>2^m$, or in other words $log(|H|) geq VC(H)$.

So far I’ve just been restating things I’ve learned online in my own words.

Now, if you will allow me to try to very hand-wavingly reprove the first equation above, here is my restatement of Occam’s Learning Algorithm in simple terms:

In order to have an algorithm that performs perfectly over hypothesis space $H$, we must train it on at least every possible partition of $H$, which is defined as: $2^m$ where $2^m > |H|$

Question 1: Are all my assumptions so far correct?

Question 2: Does this mean my intuition is correct?

If so, then the rest is just algebraic manipulation where we first add some probability that $|H|$ is not expressive enough by some constant, we call this $delta$:

$$ 2^m > frac{|H|}delta $$

Isolate $m$, split up the $log$ terms using logarithmic identities:

$$ m > log(|H|) + log(frac{1}{delta}) $$

Adding some margin of error $epsilon$ to the size of $H$, square it, and add a few constants, we get:

$$ m > frac{1}{2epsilon^2} {log(|H|) + log(frac{1}{delta}) } $$

Question 3: Did I just understand this in simple terms or am I way off?


Get this bounty!!!

#StackBounty: #machine-learning #theory #vc-theory #pac-learning Intuition behind Occam's Learner Algorithm using VC-Dimension

Bounty: 50

So I’m learning about Occam’s Learning algorithm and PAC-Learning where for a given hypothesis space $H$, if we want to have a model/hypothesis $h$ that has an True error of $error_D leq epsilon$, with a probability of $(1-delta)$ for a given probability $delta$, we need to train it on $m$ examples with $m$ being defined as:

$$ m > frac{1}{2epsilon^2}{log(|H|)+log(frac{1}{delta})}$$

Now, I’m looking for some way to explain the terms of the equation in very simple terms to gain some intuition into why this equation is the way it is, without reverting to a complicated mathematical proof. And so, while being introduced to the concept of VC-dimensionality which is, as I understand it, a measure of how complicated the example space $X$ can be while still being linearly separable within $H$.

Substituting $log(H)$ for $ VC(H)$ (and adding a few constants) I found the following equation:

$$ m > frac{1}{epsilon}{8VC(H)log(frac{13}{epsilon})+4logfrac{2}{delta}}$$

Which then made sense to me because basically if the size of $H$,i.e. the dimensionality $|H|$, then, to shatter $m$ examples we must have that: $|H|>2^m$, or in other words $log(|H|) geq VC(H)$.

So far I’ve just been restating things I’ve learned online in my own words.

Now, if you will allow me to try to very hand-wavingly reprove the first equation above, here is my restatement of Occam’s Learning Algorithm in simple terms:

In order to have an algorithm that performs perfectly over hypothesis space $H$, we must train it on at least every possible partition of $H$, which is defined as: $2^m$ where $2^m > |H|$

Question 1: Are all my assumptions so far correct?

Question 2: Does this mean my intuition is correct?

If so, then the rest is just algebraic manipulation where we first add some probability that $|H|$ is not expressive enough by some constant, we call this $delta$:

$$ 2^m > frac{|H|}delta $$

Isolate $m$, split up the $log$ terms using logarithmic identities:

$$ m > log(|H|) + log(frac{1}{delta}) $$

Adding some margin of error $epsilon$ to the size of $H$, square it, and add a few constants, we get:

$$ m > frac{1}{2epsilon^2} {log(|H|) + log(frac{1}{delta}) } $$

Question 3: Did I just understand this in simple terms or am I way off?


Get this bounty!!!

#StackBounty: #machine-learning #theory #vc-theory #pac-learning Intuition behind Occam's Learner Algorithm using VC-Dimension

Bounty: 50

So I’m learning about Occam’s Learning algorithm and PAC-Learning where for a given hypothesis space $H$, if we want to have a model/hypothesis $h$ that has an True error of $error_D leq epsilon$, with a probability of $(1-delta)$ for a given probability $delta$, we need to train it on $m$ examples with $m$ being defined as:

$$ m > frac{1}{2epsilon^2}{log(|H|)+log(frac{1}{delta})}$$

Now, I’m looking for some way to explain the terms of the equation in very simple terms to gain some intuition into why this equation is the way it is, without reverting to a complicated mathematical proof. And so, while being introduced to the concept of VC-dimensionality which is, as I understand it, a measure of how complicated the example space $X$ can be while still being linearly separable within $H$.

Substituting $log(H)$ for $ VC(H)$ (and adding a few constants) I found the following equation:

$$ m > frac{1}{epsilon}{8VC(H)log(frac{13}{epsilon})+4logfrac{2}{delta}}$$

Which then made sense to me because basically if the size of $H$,i.e. the dimensionality $|H|$, then, to shatter $m$ examples we must have that: $|H|>2^m$, or in other words $log(|H|) geq VC(H)$.

So far I’ve just been restating things I’ve learned online in my own words.

Now, if you will allow me to try to very hand-wavingly reprove the first equation above, here is my restatement of Occam’s Learning Algorithm in simple terms:

In order to have an algorithm that performs perfectly over hypothesis space $H$, we must train it on at least every possible partition of $H$, which is defined as: $2^m$ where $2^m > |H|$

Question 1: Are all my assumptions so far correct?

Question 2: Does this mean my intuition is correct?

If so, then the rest is just algebraic manipulation where we first add some probability that $|H|$ is not expressive enough by some constant, we call this $delta$:

$$ 2^m > frac{|H|}delta $$

Isolate $m$, split up the $log$ terms using logarithmic identities:

$$ m > log(|H|) + log(frac{1}{delta}) $$

Adding some margin of error $epsilon$ to the size of $H$, square it, and add a few constants, we get:

$$ m > frac{1}{2epsilon^2} {log(|H|) + log(frac{1}{delta}) } $$

Question 3: Did I just understand this in simple terms or am I way off?


Get this bounty!!!

#StackBounty: #machine-learning #theory #vc-theory #pac-learning Intuition behind Occam's Learner Algorithm using VC-Dimension

Bounty: 50

So I’m learning about Occam’s Learning algorithm and PAC-Learning where for a given hypothesis space $H$, if we want to have a model/hypothesis $h$ that has an True error of $error_D leq epsilon$, with a probability of $(1-delta)$ for a given probability $delta$, we need to train it on $m$ examples with $m$ being defined as:

$$ m > frac{1}{2epsilon^2}{log(|H|)+log(frac{1}{delta})}$$

Now, I’m looking for some way to explain the terms of the equation in very simple terms to gain some intuition into why this equation is the way it is, without reverting to a complicated mathematical proof. And so, while being introduced to the concept of VC-dimensionality which is, as I understand it, a measure of how complicated the example space $X$ can be while still being linearly separable within $H$.

Substituting $log(H)$ for $ VC(H)$ (and adding a few constants) I found the following equation:

$$ m > frac{1}{epsilon}{8VC(H)log(frac{13}{epsilon})+4logfrac{2}{delta}}$$

Which then made sense to me because basically if the size of $H$,i.e. the dimensionality $|H|$, then, to shatter $m$ examples we must have that: $|H|>2^m$, or in other words $log(|H|) geq VC(H)$.

So far I’ve just been restating things I’ve learned online in my own words.

Now, if you will allow me to try to very hand-wavingly reprove the first equation above, here is my restatement of Occam’s Learning Algorithm in simple terms:

In order to have an algorithm that performs perfectly over hypothesis space $H$, we must train it on at least every possible partition of $H$, which is defined as: $2^m$ where $2^m > |H|$

Question 1: Are all my assumptions so far correct?

Question 2: Does this mean my intuition is correct?

If so, then the rest is just algebraic manipulation where we first add some probability that $|H|$ is not expressive enough by some constant, we call this $delta$:

$$ 2^m > frac{|H|}delta $$

Isolate $m$, split up the $log$ terms using logarithmic identities:

$$ m > log(|H|) + log(frac{1}{delta}) $$

Adding some margin of error $epsilon$ to the size of $H$, square it, and add a few constants, we get:

$$ m > frac{1}{2epsilon^2} {log(|H|) + log(frac{1}{delta}) } $$

Question 3: Did I just understand this in simple terms or am I way off?


Get this bounty!!!

#StackBounty: #machine-learning #theory #vc-theory #pac-learning Intuition behind Occam's Learner Algorithm using VC-Dimension

Bounty: 50

So I’m learning about Occam’s Learning algorithm and PAC-Learning where for a given hypothesis space $H$, if we want to have a model/hypothesis $h$ that has an True error of $error_D leq epsilon$, with a probability of $(1-delta)$ for a given probability $delta$, we need to train it on $m$ examples with $m$ being defined as:

$$ m > frac{1}{2epsilon^2}{log(|H|)+log(frac{1}{delta})}$$

Now, I’m looking for some way to explain the terms of the equation in very simple terms to gain some intuition into why this equation is the way it is, without reverting to a complicated mathematical proof. And so, while being introduced to the concept of VC-dimensionality which is, as I understand it, a measure of how complicated the example space $X$ can be while still being linearly separable within $H$.

Substituting $log(H)$ for $ VC(H)$ (and adding a few constants) I found the following equation:

$$ m > frac{1}{epsilon}{8VC(H)log(frac{13}{epsilon})+4logfrac{2}{delta}}$$

Which then made sense to me because basically if the size of $H$,i.e. the dimensionality $|H|$, then, to shatter $m$ examples we must have that: $|H|>2^m$, or in other words $log(|H|) geq VC(H)$.

So far I’ve just been restating things I’ve learned online in my own words.

Now, if you will allow me to try to very hand-wavingly reprove the first equation above, here is my restatement of Occam’s Learning Algorithm in simple terms:

In order to have an algorithm that performs perfectly over hypothesis space $H$, we must train it on at least every possible partition of $H$, which is defined as: $2^m$ where $2^m > |H|$

Question 1: Are all my assumptions so far correct?

Question 2: Does this mean my intuition is correct?

If so, then the rest is just algebraic manipulation where we first add some probability that $|H|$ is not expressive enough by some constant, we call this $delta$:

$$ 2^m > frac{|H|}delta $$

Isolate $m$, split up the $log$ terms using logarithmic identities:

$$ m > log(|H|) + log(frac{1}{delta}) $$

Adding some margin of error $epsilon$ to the size of $H$, square it, and add a few constants, we get:

$$ m > frac{1}{2epsilon^2} {log(|H|) + log(frac{1}{delta}) } $$

Question 3: Did I just understand this in simple terms or am I way off?


Get this bounty!!!