#StackBounty: #machine-learning #deep-learning #data-mining #text-mining #rnn How to extract specific information from text using Machi…

Bounty: 100

Suppose I have a text like below which usually have 2/3 sentences and 100-200 characters.

Johny bought milk of 50 dollars from walmart. Now he has left only 20 dollars.

I want to extract

Person Name: Johny

Spent: 50 dollars

Money left: 20 dollars.

Spent where: Walmart.

I have gone through lots of material on Recurrent neural network. Watched cs231n video on RNN and understood the next character prediction. In these cases we have set of 26 characters that we can use as output classes to find the next character using probability. But here the problem seems entirely different because we don’t know the output classes. The output depends on the words and numbers in the text which can be any random word or number.

I read on Quora that convolutional neural network can also extract features on the text. Wondering if that can also solve this particular problem?


Get this bounty!!!

#StackBounty: #machine-learning #predictive-models #performance #sensitivity #specificity Assessing correlated predictions

Bounty: 50

Let’s assume we have a prediction algorithm (if it helps, imagine it’s using some boosted tree method) that does daily predictions for whether some event will happen to a unit (e.g. a machine that might break down, a patient that might get a problematic medical event etc.) during the next week. We can only get training and test data for a low number of units (e.g. 400 + 100 or so) across a limited period of time (e.g. half a year).

How would one assess prediction performance (e.g. sensitivity/specificity/AuROC etc.) of some algorithm (e.g. some tree method) in this setting on test data? Presumably there is a potential issue in that the prediction intervals overlap and even non-overlapping intervals for the same unit are somewhat correlated (i.e. if I can predict well for a particular unit due to its specific characteristics, I may do well on all time intervals for that unit, but this does not mean the algorithm would generalize well).

Perhaps I have just not hit on the right search terms, but I have failed to find anything published on this topic (surely someone has written about this before?!). Any pointers to any literature?

My initial thought was that perhaps naively calculated (i.e. just treating this as independent observations and predictions) point estimates of sensitivity/specificity might be fine, but that any problem would be more with the uncertainty around these? If so, could one just bootstrap (drawing whole units with replacement) and get decent assessments that way?


Get this bounty!!!

#StackBounty: #machine-learning #cross-validation #rms same cross-validation set for parameter tuning and RMSE calculations

Bounty: 50

I miss some very basic distinction between cross-validations used for parameter tuning and cross-validation used for calculating the performance of my algorithms (RMSE).

I have two functions: one performs grid search and the other calculates cross-validated RMSE.

def grid_search(clf, param_grid, x_train, y_train, kf):
    grid_model = GridSearchCV(estimator = clf, 
                              param_grid = param_grid,
                              cv = kf, verbose = 2)
    grid_model.fit(x_train, y_train)

def rmse_cv(clf, x_train, y_train, kf):
     rmses_cross = np.sqrt(-cross_val_score(clf, x_train, y_train, scoring="neg_mean_squared_error", cv = kf))
     return rmses_cross

The functions are called this way:

X_train, X_test, y_train, y_test =  train_test_split(dataset, Y, test_size=0.2, random_state=26)
kf = KFold(10, shuffle = True, random_state = 26)    

grid_search(clf, param_grid, X_train, y_train, kf)
# adjust parameters of a regressor
rmses_cross = rmse_cv(clf, splits, X_train, y_train, kf)

As you see I use the same KFold for my parameter tuning and exactly the same KFold set for my calculation of cross-validation RMSE.

And on basis of the calculated cross RMSEs I chose which algorithms performs better. BUT RMSEs are calculated exactly on the same folds on which hyper parameter tuning was performed.

Is it incorrect to do so? I feel that while tuning my model learns on the hold-out folds and it would be incorrect to use them when calculating the RMSEs. Should I choose different KFold for calculation of RMSE?

EDIT:

Why do those two codes produce two different results? I though the cross_val_score refits a given model to each fold. And therefore applying cross_val_score on grid_model or parameterised model should be the same.

kf = KFold(10, shuffle = True, random_state = 26)

First:

grid_model = grid_search(clf, param_grid, X_train, y_train, kf)
grid_model.fit(x_train, y_train)
clf = SVM(kernel='rbf',C=grid_model.best_params_['C'])
rmses_cross = np.sqrt(-cross_val_score(clf, x_train, y_train, 
                      scoring="neg_mean_squared_error",cv = kf))

Second:

grid_model = grid_search(clf, param_grid, X_train, y_train, kf)
grid_model.fit(x_train, y_train)
rmses_cross = np.sqrt(-cross_val_score(grid_model, x_train, y_train, 
                      scoring="neg_mean_squared_error", cv = kf))


Get this bounty!!!

#StackBounty: #machine-learning #bayesian #feature-selection #hierarchical-bayesian #shrinkage Feature selection on a Bayesian hierarch…

Bounty: 50

I am looking to estimate a hierarchical GLM but with feature selection to determine which covariates are relevant at the population level to include.

Suppose I have $G$ groups with $N$ observations and $K$ possible covariates
That is, I have design matrix of covariates $boldsymbol{x}{(Ncdot G) times K}$, outcomes $boldsymbol{y}{(Ncdot G) times 1}$. Coefficients on these covariates are $beta_{K times 1}$.

Suppose $Y$~$Bernoulli(p(x,beta))$

The below is a standard hierarchical bayesian GLM with logit sampling model and normally distributed group coefficients.

$${cal L}left(boldsymbol{y}|boldsymbol{x},beta_{1},…beta_{G}right)proptoprod_{g=1}^{G}prod_{t=1}^{N}left(Pr{j=1|p_{t},beta^{g}}right)^{y_{g,t}}left(1-Pr{j=1|p_{t},beta^{g}}right)^{1-y_{g,t}}$$

$$beta_{1},…beta_{G}|mu,Sigmasim^{iid}{cal N}_{d}left(mu,Sigmaright)$$

$$mu|Sigmasim{cal N}left(mu_{0},a^{-1}Sigmaright)$$
$$Sigmasim{cal IW}left(v_{0},V_{0}^{-1}right)$$

I want to modify this model (or find a paper that does, or work that discusses it) in such a way that there is some sharp feature selection (as in LASSO) on the dimensionality of $beta$.

(1) The simplest most direct way would be to regularize this at the population level so that we essentially restrict the dimensionality of $mu$ and all $beta$ have the same dimension.

(2) The more nuanced model would have shrinkage at the group level, where dimension of $beta$ depends on the hierarhical unit.

I am interested in solving 1 and 2, but much more important is 1.


Get this bounty!!!

#StackBounty: #machine-learning #convnet #backpropagation #cnn #kernel back propagation in CNN

Bounty: 100

I have the following CNN:

network layour

  1. I start with an input image of size 5×5
  2. Then I apply convolution using 2×2 kernel and stride = 1, that produces feature map of size 4×4.
  3. Then I apply 2×2 max-pooling with stride = 2, that reduces feature map to size 2×2.
  4. Then I apply logistic sigmoid.
  5. Then one fully connected layer with 2 neurons.
  6. And an output layer.

For the sake of simplicity, let’s assume I have already completed the forward pass and computed δH1=0.25 and δH2=-0.15

So after the complete forward pass and partially completed backward pass my network looks like this:

network after forward pass

Then I compute deltas for non-linear layer (logistic sigmoid):

$$
delta_{11}=(0.25 * 0.61 + -0.15 * 0.02) * 0.58 * (1 – 0.58) = 0.0364182\
delta_{12}=(0.25 * 0.82 + -0.15 * -0.50) * 0.57 * (1 – 0.57) = 0.068628\
delta_{21}=(0.25 * 0.96 + -0.15 * 0.23) * 0.65 * (1 – 0.65) = 0.04675125\
delta_{22}=(0.25 * -1.00 + -0.15 * 0.17) * 0.55 * (1 – 0.55) = -0.06818625
$$

Then, I propagate deltas to 4×4 layer and set all the values which were filtered out by max-pooling to 0 and gradient map look like this:

enter image description here

How do I update kernel weights from there? And if my network had another convolutional layer prior to 5×5, what values should I use to update it kernel weights? And overall, is my calculation correct?


Get this bounty!!!

#StackBounty: #machine-learning #logistic-regression #gradient-descent #cost-function logistic regression algorithm fails to work

Bounty: 100

I’m trying to code my own logistic regression algorithm using Andrew NG’s machine learning using Octave. lectures. So what I did was make a csv file, the first row being some parameter and the second one being the result:

121,1
124,0
97,0
104,0
110,0
...

Overall there are only 24 examples, but I’ve chosen points such that some pattern can be followed.

Here is my code:

data = load('data.dat');
x = data(:, 1);

y = data(:, 2)
m = length(y);

#plot(x, y, 'rx', 'MarkerSize', 10);
#xlabel('IQ');
#ylabel('Pass/Fail');
#title('Logistic Regression');

x = [ones(size(x, 1), 1) x];
alpha = 0.00001;
i = 15000;

g = inline("1 ./ (1 + exp(-z))")

theta = zeros(size(x(1, :)))';
j = zeros(i, 1);

for num = 1:i
  z = x * theta;
  h = g(z);
  j = (1./m) * ( -y' * log( h ) - ( 1 - y' ) * log ( 1 - h))
  grad = 1./m * x' * (h - y);
  theta = theta - alpha * grad;
end

However the output of the sigmoid function shows every value below 0.5… surely this has to be wrong. I’ve also tried with different learning rates and iterations, but to no avail. What is wrong with the code, or data?

Help would be appreciated.


Get this bounty!!!

#StackBounty: #machine-learning #deep-learning #gradient-descent Convergence of Stochastic Gradient Descent as a function of training s…

Bounty: 50

I am going through the following section of the book by (Goodfellow et al., 2016), and I don’t understand it quite well.

Stochastic gradient descent has many important uses outside the
context of deep learning. It is the main way to train large linear
models on very large datasets. For a fixed model size, the cost per SGD
update does not depend on the training set size $m$. In practice, we often
use a larger model as the training set size increases, but we are not
forced to do so. The number of updates required to reach convergence
usually increases with training set size. However,
as $m$ approaches infinity, the model will eventually converge to its best
possible test error before SGD has sampled every example in the
training set.
Increasing $m$ further will not extend the amount of training
time needed to reach the model’s best possible test error. From this
point of view, one can argue that the asymptotic cost of training a
model with SGD is $O(1)$ as a function of $m$.

Section 5.9, p.150

  1. “The number of updates required to reach convergence usually increases with training set size”. I can’t get away around this one. In the normal gradient descent, it becomes computationally expensive to calculate the gradient at each step as the number of training examples increases. But I don’t understand why the number of updates increases with the training size.
  2. “However, as $m$ approaches infinity, the model will eventually converge to its best possible test error before SGD has sampled every example in the training set. Increasing $m$ further will not extend the amount of training time needed to reach the model’s best possible test error.” I don’t understand this as well

Can you provide some intuition/ arguments for the above two cases?

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.


Get this bounty!!!

#StackBounty: #machine-learning #precision-recall #performance #confusion-matrix #curves How can Precision-Recall (PR) curves be used t…

Bounty: 50

How can Precision-Recall (PR) curves be used to judge overall classifier performance when Precision and Recall are class based metrics?

Since in a binary classifier, there are two classes, often labelled positive (+1) and negative (-1). Yet, the classifier performance metrics [y] precision (PPV) and [x] recall (TPR) which are used to plot PR curves can have different values for each of the two classes (if you swap the positive and negative classes). In almost all examples of PR curves, there is usually only a single curve, when there should surely be at least two curves (one curve per class)?

More specifically:

  1. Does the PR curve really only represent the precision and recall of a single class, or has some operation been done (e.g. averaging) to combine the precision and recall of both classes?

  2. Does it make sense to judge a classifier’s performance based on only looking at the PR curve for the positive class?

  3. Why are the metrics TNR and NPV not somehow integrated into the curve or graph, for a better overview of classifier performance?


Get this bounty!!!

#StackBounty: #machine-learning #deep-learning #feature-construction #supervised-learning #image-processing Detecting manipulation (pho…

Bounty: 50

I am looking for a solution to detect photos that are manipulated with tools such as Photoshop.
For a start, I want to detect copy-pasted images.

Any idea how to detect photos that are manipulated by pasting another photo on the top of the original photo?

For example, detecting a photo of an id card with a photo of a face pasted in the place of an original face.

To make it even more difficult, let’s assume we down sample the image after pasting the face in place. This will smooth the sharp edges of the pasted image.

UPDATE:

1) It seems that compression techniques as well as straight forward cnn training didn’t work.

2) This is a relevant post

3) This, is a summary of photo forensic methods.

Since there was no real progress in here, I am starting a bounty.


Get this bounty!!!

#StackBounty: #r #machine-learning #lime LIME explanation confusion

Bounty: 100

I am working in R creating a GBM model using H2O and trying to use LIME to look at some local explanations to get a feel for what the model is doing. It’s a binary classifier and I’m specifying 8 for n_features to the LIME package. However, I keep running into situations where all or most the 8 features are showing as contradicting the highest probability class. The funny thing is the predicted probability of the class is in the 90’s.

How would one interpret this? Is there a problem in the LIME package implementation?

Here are a couple of examples:
enter image description here

enter image description here


Get this bounty!!!