## #StackBounty: #algorithms #optimization #machine-learning What is the difference in SMO algorithm for SVM and SMO for one class?

### Bounty: 100

Please let know if this is not the correct forum to ask this question. If not can anyone please tell where can I ask this question?

I am trying to understand the difference between the paper :
https://pdfs.semanticscholar.org/59ee/e096b49d66f39891eb88a6c84cc89acba12d.pdf

The first paper is about SMO for SVM and the second paper is about SMO for one class. The question is what are the difference between this two algorithms? Also what are the common factors in both the paper?

The second paper has not shown any details of the algorithm so not sure what minimal change to be done on the first to get the second algorithm?

Get this bounty!!!

## #StackBounty: #machine-learning #deep-learning #variational-bayes #generative-models Is the optimization of the Gaussian VAE well-posed?

### Bounty: 50

In a Variational Autoencoder (VAE), given some data $$x$$ and latent variables $$t$$ with prior distribution $$p(t) = mathcal{N}(t | 0, I)$$, the encoder aims to learn a distribution $$q_{phi}(t)$$ that approximates the true posterior $$p(t|x)$$ and the decoder aims to learn a distribution $$p_{theta}(x|t)$$ that approximates the true underlying distribution $$p^*(x|t)$$.

These models are then trained jointly to maximize an objective $$L(phi, theta)$$, which is a lower bound for the log-likelihood of the training set:

$$L(phi, theta) = sum_i mathbb{E}{q{phi}} log frac{p_{theta}(x_i|t)p(t)}{q_{phi}(t)} leq sum_i log int p_{theta}(x_i|t)p(t) dt$$

According to section C.2 in the original paper from Kingma and Welling (https://arxiv.org/pdf/1312.6114.pdf), when we model $$p_{theta}(x|t)$$ as a family of gaussians, the decoder should output both the mean $$mu(t)$$ and the (diagonal) covariance $$sigma^2(t) I$$ for the gaussian distribution.

My question is: isn’t this optimization problem ill-posed (just like maximum likelihood training in GMMs)? Having an output for the variance (or log-variance, as is most common), if the decoder can produce a perfect reconstruction for a single image in the training set (i.e. $$mu(t_i)=x_i$$) then it can set the corresponding variance $$sigma^2(t_i)$$ to something arbitrarily close to zero and therefore the likelihood goes to infinity regardless of what happens with the remaining training examples.

I know that most gaussian VAE implementations have a simplified decoder that outputs the mean only, replacing the term $$mathbb{E}{q{phi}} log p_{theta}(x_i|t)$$
by the squared error between the original image and the reconstruction (which is equivalent to setting the covariance to be always the identity matrix). Is this because of the ill-posedness of the original formulation?

Get this bounty!!!

## #StackBounty: #machine-learning #mathematical-statistics #causality How does a causal tree optimize for heterogenous treatment effects?

### Bounty: 50

I have a very specific question regarding how the causal tree in the causal forest/generalized random forest optimizes for heterogeneity in treatment effects.

This question comes from the Athey & Imbens (2016) paper “Recursive partitioning for heterogeneous causal effects” from PNAS. Another paper is Wager & Athey (2018), “Estimation and inference of heterogeneous treatment effects using random forests” in JASA (arxiv.org link here). I know that the answer to my question is in those papers, but I, unfortunately, can’t parse some of the equations to extract it. I know I understand an algorithm well when I can express it in words, so it has been irking me that I can’t do so here.

In my understanding, an honest causal tree is generally constructed by:

Given a dataset with an outcome $$Y$$, covariates $$X$$, and a randomized condition $$W$$ that takes on the value of 0 for control and 1 for treatment:

1. Split the data into subsample $$I$$ and subsample $$J$$

2. Train a decision tree on subsample $$I$$ predicting $$Y$$ from $$X$$, with the requirement that each terminal node has at least $$k$$ observations from each condition in subsample $$J$$

3. Apply the decision tree constructed on subsample $$I$$ to subsample $$J$$

4. At each terminal node, get the mean of predictions for the $$W$$ = 1 cases from subsample $$J$$ and subtract the mean of predictions for the $$W$$ = 0 cases from subsample $$J$$; the resulting difference is the estimated treatment effect

Any future, out-of-sample cases (such as those used after deploying the model) will be dropped down the tree and assigned the predicted treatment effect for the node in which they end are placed.

This is called “honest,” because the actual training and estimation are done on completely different data. Athey and colleagues have a nice asymptotic theory showing that you can derive variance estimates for these treatment effects, which is part of the motivation behind making them “honest.”

This is then applied to a causal random forest by using bagging or bootstrapping.

Now, Athey & Imbens (2016) note that this procedure uses a modified mean squared error criterion for splitting, which rewards “a partition for finding strong heterogeneity in treatment effects and penalize a partition that creates variance in leaf estimates” (p. 7357).

My question is: Can you explain how this is the case, using words?

In the previous two sections before this quotation, Modifying Conventional CART for Treatment Effects and Modifying the Honest Approach, the authors use the Rubin causal model/potential outcomes framework to derive an estimation for the treatment effect.

They note that we are not trying to predict $$Y$$—like in most machine learning cases—but the difference between the expectation of $$Y$$ in two conditions, given some covariates $$X$$. In line with the potential outcomes framework, this is “infeasible”: We can only measure the outcome of someone in one of the two conditions.

In a series of equations, they show how we can use a modified splitting criterion that predicts the treatment effect. They say: “…the treatment effect analog is infeasible, but we can use an unbiased estimate of it, which leads to…” (p. 7357) and they show the equation for it using observed data. As someone who has a background in social science and applied statistics, I can’t connect the dots between what they have set up and how we can estimate it from the data.

Any help at explaining how this criterion maximizes the variance in treatment effects (i.e., the heterogeneity of causal effects) OR any correction on my description of how to build a causal tree that might be leading to my confusion would be greatly appreciated.

Get this bounty!!!

## #StackBounty: #machine-learning #semi-supervised Why does using pseudo-labeling non-trivially affect the results?

### Bounty: 50

I’ve been looking into semi-supervised learning methods, and have come across the concept of “pseudo-labeling”.

As I understand it, with pseudo-labeling you have a set of labeled data as well as a set of unlabeled data. You first train a model on only the labeled data. You then use that initial data to classify (attach provisional labels to) the unlabeled data. You then feed both the labeled and unlabeled data back into your model training, (re-)fitting to both the known labels and the predicted labels. (Iterate this process, re-labeling with the updated model.)

The claimed benefits are that you can use the information about the structure of the unlabeled data to improve the model. A variation of the following figure is often shown, “demonstrating” that the process can make a more complex decision boundary based on where the (unlabeled) data lies.

Image from Wikimedia Commons by Techerin CC BY-SA 3.0

However, I’m not quite buying that simplistic explanation. Naively, if the original labeled-only training result was the upper decision boundary, the pseudo-labels would be assigned based on that decision boundary. Which is to say that the left hand of the upper curve would be pseudo-labeled white and the right hand of the lower curve would be pseudo-labeled black. You wouldn’t get the nice curving decision boundary after retraining, as the new pseudo-labels would simply reinforce the current decision boundary.

Or to put it another way, the current labeled-only decision boundary would have perfect prediction accuracy for the unlabeled data (as that’s what we used to make them). There’s no driving force (no gradient) which would cause us to change the location of that decision boundary simply by adding in the pseudo-labeled data.

Am I correct in thinking that the explanation embodied by the diagram is lacking? Or is there something I’m missing? If not, what is the benefit of pseudo-labels, given the pre-retraining decision boundary has perfect accuracy over the pseudo-labels?

Get this bounty!!!

## #StackBounty: #machine-learning #mathematical-statistics Maximizing AUC based on point cloud distance

### Bounty: 50

Let \$V\$ be an \$n\$ dimensional space with sets of positive class vectors \$P\$ and negative class vectors \$N\$. The task is to find a vector \$x\$ such that AUC is maximized, based on ranking generated by computing distances between \$x\$ and \$P,V\$. So in a sense, \$x\$ is closer to \$P\$ than to \$V\$. It looks like this doesn’t have a unique solution, but I’m curious if there is a really easy explicit solution to this, or a short algorithm? Or is this NP-hard?

Surely this is a well known classical problem? One algorithm that I think works is to sample triplets \$(x,p,n)\$ with one positive and one negative vector and then formulate the usual triplet loss:

\$\$L(x,p,n) = max(0,|x-p|^2-|x-n|^2+epsilon),\$\$

which pushes \$x\$ closer to \$p\$ and further from \$n\$. I’m just hoping for something easier.

Get this bounty!!!

## #StackBounty: #python #machine-learning #keras #deep-learning #forecasting How to handle Shift in Forecasted value

### Bounty: 50

I implemented a forecasting model using LSTM in Keras. The dataset is 15mints seperated and I am forecasting for 12 future steps.

The model performs good for the problem. But there is a small problem with the forecast made. It is showing a small shift effect. To get a more clear picture see the below attached figure.

How to handle this problem.? How the data must be transformed to handle this kind of issue.?

The model I used is given below

init_lstm = RandomUniform(minval=-.05, maxval=.05)
init_dense_1 = RandomUniform(minval=-.03, maxval=.06)

model = Sequential()

history = model.fit(X, y, epochs=1000, batch_size=16, validation_data=(X_valid, y_valid), verbose=1, shuffle=False)

I made the forecasts like this

my_forecasts = model.predict(X_valid, batch_size=16)

Time series data is transformed to supervised to feed the LSTM using this function

# convert time series into supervised learning problem
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
n_vars = 1 if type(data) is list else data.shape[1]
df = DataFrame(data)
cols, names = list(), list()
# input sequence (t-n, ... t-1)
for i in range(n_in, 0, -1):
cols.append(df.shift(i))
names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
# forecast sequence (t, t+1, ... t+n)
for i in range(0, n_out):
cols.append(df.shift(-i))
if i == 0:
names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
else:
names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
# put it all together
agg = concat(cols, axis=1)
agg.columns = names
# drop rows with NaN values
if dropnan:
agg.dropna(inplace=True)
return agg

super_data = series_to_supervised(data, 12, 1)

My timeseries is a multi-variate one. var2 is the one that I need to forecast. I dropped the future var1 like

del super_data['var1(t)']

Seperated train and valid like this

features = super_data[feat_names]
values = super_data[val_name]

ntest = 3444

train_feats, test_feats = features[0:-n_test], features[-n_test:]
train_vals, test_vals = values [0:-n_test], values [-n_test:]

X, y = train_feats.values, train_vals.values
X = X.reshape(X.shape[0], 1, X.shape[1])

X_valid, y_valid = test_feats .values, test_vals .values
X_valid = X_valid.reshape(X_valid.shape[0], 1, X_valid.shape[1])

I haven’t made the data stationary for this forecast. I also tried taking difference and making the model as stationary as I can, but the issue remains the same.

I have also tried different scaling ranges for the min-max scaler, hoping it may help the model. But the forecasts are getting worsened.

Other Things I have tried

* Tried other optimizers
* Tried mse loss and custom log-mae loss functions
* Tried varying batch_size
* Tried adding more past timesteps

I understand that the model is replicating the last known value to it, thereby minimizing the loss.

The validation and training loss remains low enough through out the training process. This makes me think whether I need to come up with a new loss function for this purpose.

Is that necessary.? If so what loss function should I go for.?

Get this bounty!!!

## #StackBounty: #machine-learning #mean #accuracy #image-processing #computer-vision Difference between Mean/average accuracy and Overall…

### Bounty: 50

I just got confusion while reading the paper “Local Binary Pattern-Based Hyperspectral Image Classification With Superpixel Guidance”.

They mentioned that they repeated each experiment 10 times and calculated both mean and standard deviation. after that they also mentioned they calculated overall accuracy. in the results they mentioned mean and std accuracy of each class and then overall accuracy. What is the difference between average/meanMean and overall accuracy? isn’t should be same?

Get this bounty!!!

## #StackBounty: #r #regression #machine-learning #predictive-models #r-squared Multiple Regression, good P-value, but Low R2

### Bounty: 50

I am trying to build a model in R to predict Conversion Rate (CR) based on age, gender, and interest (and also the campaign_Id):

The CR values look like this:

The correlation coefficients are not very promising:

rcorr(as.matrix(data.numeric))

correlations with CR:

xyz_campaign_id (-0.19), age (-0.1), gender(-0.04), interest(-0.03)

So, the model below:

library(caret)
set.seed(100)
TrainIndex <- sample(1:nrow(data), 0.8*nrow(data))
data.train <- data[TrainIndex,]
data.test <- data[-TrainIndex,]
nrow(data.test)
model <- lm(CR ~ age + gender + interest + xyz_campaign_id , data=data.train)

will not have a good adjusted r-squared (0.04):

Call:
lm(formula = CR ~ age + gender + interest + xyz_campaign_id,
data = data.train)

Residuals:
Min      1Q  Median      3Q     Max
-18.636 -11.858  -4.087   0.115  96.421

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)     47.231250   6.287738   7.512  1.4e-13 ***
age35-39         1.214713   1.916649   0.634  0.52639
age40-44        -1.971037   1.986316  -0.992  0.32131
age45-49        -3.064858   1.866713  -1.642  0.10097
genderM          3.709192   1.412311   2.626  0.00878 **
interest         0.030384   0.027617   1.100  0.27154
xyz_campaign_id -0.037856   0.006076  -6.231  7.1e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 21.16 on 907 degrees of freedom
Multiple R-squared:  0.05237,   Adjusted R-squared:  0.04611
F-statistic: 8.355 on 6 and 907 DF,  p-value: 7.81e-09

I also understand that I should probably convert “interest” from numeric to factor (I have tried that too, although I considered all 40 interest levels which is not ideal)

So, based on the provided information, is there any way to improve the model? what other models shall I try besides linear models to make sure that I have a good predictive model?

If you need more information, the challenge is available Here. Data is Here

Get this bounty!!!

## #StackBounty: #machine-learning #python #scikit-learn #nlp Pass 2 different kinds of X training data to ML model simultaneously

### Bounty: 50

I’m trying to classify if a book is fiction/nonfiction based on title and summary.

This is 2 distinct types of information – is there a way to segment title and summary before feeding it to a model, rather than concatenating the information?

For example:

Title: "such a long journey"

Summary: "it is bombay in 1971, the year india went to..."

Label: "fiction" (where fiction =1)

Current procedure:

What I’ve been doing until now is concatenating the information, so the above becomes,

example = "such a long journey it is bombay in 1971, the year india went to..."
label = 1

Then the usual setup, something like,

X.append(example)
y.append(label)
...
X = lemmatize(X)
...
X_train, X_test, y_train, y_test = split_data(X,y)

vectorizer = TfidfVectorizer(...)
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

classifier.fit(X_train, y_train)
y_predict = classifier.predict(X_test)

But feeding the data concatenated feels intuitively wrong. Is there a better way to do this?

If for some reason its possible with a library other than sklearn (keras, tensorflow) I’d be also open to hearing about that.

## UPDATE

Going from,

X = ['two'],['two'],['four'],['two'],['four'],['four']]
y = ['human','human','dog','human','dog','dog']

to,

X = [['two','hello'],['two','hello'],['four','bark'],['two','hi'],['four','bark'],['four','woof']]
y = ['human','human','dog','human','dog','dog']

causes errors to be thrown.

'list' object has no attribute 'lower' is X is a list, and 'numpy.ndarray' object has no attribute 'lower' if X is an array.

The error is thrown when I call,

X_train = vectorizer.fit_transform(X_train)

Is it possible to pass in a vector of features?

Get this bounty!!!

## #StackBounty: #machine-learning #maximum-likelihood #interpretation #model-selection #gaussian-process GP: How to select a model for a …

### Bounty: 50

I have fitted a Gaussian Process (GP) to perform a binary classification task. The dataset is balanced, so I have an equal number of samples with 0/1 label for the training. The covariance function used is an RBF kernel, which needs the hyperparameter “length scale” to be tuned.

To be sure that I am not overfitting data, and that I am selecting proper kernel hyperparameters, I performed a grid search to find out the best percentage of training data and length scale, obtaining as statistical metrics the overall accuracy (OA) and the log-marginal likelihood (LML) on the test set.

You can see the results in the following image (left for OA, right for LML):

EDIT: I re-uploaded the image with the normalized log-marginal likelihood. Common sense indicates that optimal model should find a
trade-off between model complexity and accuracy metrics. Thus, these
models lay somewhere between 30%-40% of training data and 0.7-0.9 of
length scale of the RBF kernel within the GP. This is great for model
selection, but unfortunately, I think I still cannot answer the
questions below… Any new insights on the interpretation of the LML?

After exploring the effect of training size and the hyperparameter on the statistical metrics, I think it would be safe to select a model using at least 30% of data for training and a length scale for RBF of 0.1. However, I do not understand the role of LML to select the model (or even whether it needs to be considered), but common sense suggests that it should be as small as possible (i.e. around -400, represented in yellow). This means my best model is located at training size = 10-20% and length_scale=0.1.

I have seen that other people (here and here) have (somewhat) similar questions regarding LML, but I can’t find ideas that help me understanding the link between good error OA metrics and LML. In other words, I am having trouble at interpreting the LML.

In concrete, I would like to get more insights on:

1. What is the impact of a high/low LML on the predictive power of the
GP?
2. How much better is a model with LML=-400 compared to one with
LML=-700?
3. What does it mean to have a LML of -400? Isn’t -400 a lot
for a statistical metric?
4. Did I really found a solution for my problem with these LML metrics?