#StackBounty: #machine-learning #predictive-modeling #marketing Predict time of dispatch for marketing campaign

Bounty: 50

What would be appropriate models/algorithms/strategies for predicting best individual send times for marketing campaigns based on past response timestamps?


Given for example

  customer campaign    campaign_time       response_time   
1   100       a     2017-01-01 06:50:01 2017-01-01 08:02:21
2   101       a     2017-01-01 06:50:01 2017-01-01 16:45:31
3   101       a     2017-01-01 06:50:01 2017-01-02 07:20:00
4   100       b     2017-01-07 06:30:21 2017-01-08 08:15:21
5   101       b     2017-01-07 06:30:21 2017-01-07 17:00:12
6   100       c     2017-01-14 06:43:55 2017-01-14 07:59:44
7   101       d     2017-01-21 14:02:01 2017-01-21 16:50:01
  • two customers 100&101,
  • four past campaigns a-d.
  • with each campaign having different times of dispatch,
  • and multiple,one or no response time(s) (e.g. buying a product) for customers and campaigns


Assuming that

  1. campaign_time can vary for 100 and 101 (personalized times
    of dispatch
    ), and
  2. past response times are an indicator for when
    customers are most receptive for a campaign

I would like to predict the best next campaign_time ( 2017-01-28 ??:??:??) for each customer based on past response_times, so that the number of respondents per campaign is maximized.

Anyone having any experience with something similar or any ideas where to start? I’d be happy to hear some ideas.

To simplify things, I’d consider the first response_time the most valuable one (=> should be predicted) and I’d also abstract from weekdays (=> it’s about predicting time 0:00-23:59, marked by the ? above); however it would be nice to have a continous prediction instead of a discretized one (like suggested here).

Get this bounty!!!

#StackBounty: #machine-learning #neural-networks #deep-learning #backpropagation #batch-normalization Matrix form of backpropagation wi…

Bounty: 100

Batch normalization has been credited with substantial performance improvements in deep neural nets. Plenty of material on the internet shows how to implement it on an activation-by-activation basis. I’ve already implemented backprop using matrix algebra, and given that I’m working in high-level languages (while relying on Rcpp (and eventually GPU’s) for dense matrix multiplication), ripping everything out and resorting to for-loops would probably slow my code substantially, in addition to being a huge pain.

The batch normalization function is
b(x_p) = gamma left(x_p – mu_{x_p}right) sigma^{-1}_{x_p} + beta

  • $x_p$ is the $p$th node, before it gets activated
  • $gamma$ and $beta$ are scalar parameters
  • $mu_{x_p}$ and $sigma_{x_p}$ are the mean and SD of $x_p$. (Note that the square root of of the variance plus a fudge factor is normally used — let’s assume nonzero elements for compactness)

In matrix form, batch normalization for a whole layer would be
b(mathbf{X}) = left(gammaotimesmathbf{1}pright)odot left(mathbf{X} – mu{mathbf{X}}right) odotsigma^{-1}_{mathbf{X}} + left(betaotimesmathbf{1}_pright)

  • $mathbf{X}$ is $Ntimes p$
  • $mathbf{1}_N$ is a column vector of ones
  • $gamma$ and $beta$ are now row $p$-vectors of the per-layer normalization parameters
  • $otimes$ is the Kronecker product and $odot$ is the elementwise (Hadamard) product

A very simple one-layer neural net with no batch normalization and a continuous outcome is
y = aleft(mathbf{XGamma}_1right)Gamma_2 + epsilon


  • $Gamma_1$ is $p_1 times p_2$
  • $Gamma_2$ is $p_2 times 1$
  • $a(.)$ is the activation function

If the loss is $R = N^{-1}displaystylesumleft(y – hat{y}right)^2$, then the gradients are
frac{partial R}{partial Gamma_1} = -2mathbf{V}^T hatepsilon\
frac{partial R}{partial Gamma_2} = mathbf{X}^T left(a'(mathbf{X}mathbf{Gamma}_1) odot -2hatepsilon mathbf{Gamma}_2^Tright) \


  • $mathbf{V} = aleft(mathbf{X}Gamma_1right)$
  • $hat{epsilon} = y-hat{y}$

Under batch normalization, the net becomes
y = aleft(bleft(mathbf{X}Gamma_1right)right)Gamma_2
y = aBig(left(gammaotimesmathbf{1}Nright)odot left(mathbf{XGamma_1} – mu{mathbf{XGamma_1}}right) odotsigma^{-1}_{mathbf{XGamma_1}} + left(betaotimesmathbf{1}_Nright)Big)mathbf{Gamma_2}
I have no idea how to compute the derivatives of Hadamard and Kronecker products. On the subject of Kronecker products, the literature gets fairly arcane.

Is there a practical way of computing $partial R/partial gamma$, $partial R/partial sigma$, and $partial R/partial mathbf{Gamma_1}$ within the matrix framework? A simple expression, without resorting to node-by-node computation?

Get this bounty!!!

#StackBounty: #machine-learning #deep-learning #data-mining #text-mining #rnn How to extract specific information from text using Machi…

Bounty: 100

Suppose I have a text like below which usually have 2/3 sentences and 100-200 characters.

Johny bought milk of 50 dollars from walmart. Now he has left only 20 dollars.

I want to extract

Person Name: Johny

Spent: 50 dollars

Money left: 20 dollars.

Spent where: Walmart.

I have gone through lots of material on Recurrent neural network. Watched cs231n video on RNN and understood the next character prediction. In these cases we have set of 26 characters that we can use as output classes to find the next character using probability. But here the problem seems entirely different because we don’t know the output classes. The output depends on the words and numbers in the text which can be any random word or number.

I read on Quora that convolutional neural network can also extract features on the text. Wondering if that can also solve this particular problem?

Get this bounty!!!

#StackBounty: #machine-learning #predictive-models #performance #sensitivity #specificity Assessing correlated predictions

Bounty: 50

Let’s assume we have a prediction algorithm (if it helps, imagine it’s using some boosted tree method) that does daily predictions for whether some event will happen to a unit (e.g. a machine that might break down, a patient that might get a problematic medical event etc.) during the next week. We can only get training and test data for a low number of units (e.g. 400 + 100 or so) across a limited period of time (e.g. half a year).

How would one assess prediction performance (e.g. sensitivity/specificity/AuROC etc.) of some algorithm (e.g. some tree method) in this setting on test data? Presumably there is a potential issue in that the prediction intervals overlap and even non-overlapping intervals for the same unit are somewhat correlated (i.e. if I can predict well for a particular unit due to its specific characteristics, I may do well on all time intervals for that unit, but this does not mean the algorithm would generalize well).

Perhaps I have just not hit on the right search terms, but I have failed to find anything published on this topic (surely someone has written about this before?!). Any pointers to any literature?

My initial thought was that perhaps naively calculated (i.e. just treating this as independent observations and predictions) point estimates of sensitivity/specificity might be fine, but that any problem would be more with the uncertainty around these? If so, could one just bootstrap (drawing whole units with replacement) and get decent assessments that way?

Get this bounty!!!

#StackBounty: #machine-learning #cross-validation #rms same cross-validation set for parameter tuning and RMSE calculations

Bounty: 50

I miss some very basic distinction between cross-validations used for parameter tuning and cross-validation used for calculating the performance of my algorithms (RMSE).

I have two functions: one performs grid search and the other calculates cross-validated RMSE.

def grid_search(clf, param_grid, x_train, y_train, kf):
    grid_model = GridSearchCV(estimator = clf, 
                              param_grid = param_grid,
                              cv = kf, verbose = 2)
    grid_model.fit(x_train, y_train)

def rmse_cv(clf, x_train, y_train, kf):
     rmses_cross = np.sqrt(-cross_val_score(clf, x_train, y_train, scoring="neg_mean_squared_error", cv = kf))
     return rmses_cross

The functions are called this way:

X_train, X_test, y_train, y_test =  train_test_split(dataset, Y, test_size=0.2, random_state=26)
kf = KFold(10, shuffle = True, random_state = 26)    

grid_search(clf, param_grid, X_train, y_train, kf)
# adjust parameters of a regressor
rmses_cross = rmse_cv(clf, splits, X_train, y_train, kf)

As you see I use the same KFold for my parameter tuning and exactly the same KFold set for my calculation of cross-validation RMSE.

And on basis of the calculated cross RMSEs I chose which algorithms performs better. BUT RMSEs are calculated exactly on the same folds on which hyper parameter tuning was performed.

Is it incorrect to do so? I feel that while tuning my model learns on the hold-out folds and it would be incorrect to use them when calculating the RMSEs. Should I choose different KFold for calculation of RMSE?


Why do those two codes produce two different results? I though the cross_val_score refits a given model to each fold. And therefore applying cross_val_score on grid_model or parameterised model should be the same.

kf = KFold(10, shuffle = True, random_state = 26)


grid_model = grid_search(clf, param_grid, X_train, y_train, kf)
grid_model.fit(x_train, y_train)
clf = SVM(kernel='rbf',C=grid_model.best_params_['C'])
rmses_cross = np.sqrt(-cross_val_score(clf, x_train, y_train, 
                      scoring="neg_mean_squared_error",cv = kf))


grid_model = grid_search(clf, param_grid, X_train, y_train, kf)
grid_model.fit(x_train, y_train)
rmses_cross = np.sqrt(-cross_val_score(grid_model, x_train, y_train, 
                      scoring="neg_mean_squared_error", cv = kf))

Get this bounty!!!

#StackBounty: #machine-learning #bayesian #feature-selection #hierarchical-bayesian #shrinkage Feature selection on a Bayesian hierarch…

Bounty: 50

I am looking to estimate a hierarchical GLM but with feature selection to determine which covariates are relevant at the population level to include.

Suppose I have $G$ groups with $N$ observations and $K$ possible covariates
That is, I have design matrix of covariates $boldsymbol{x}{(Ncdot G) times K}$, outcomes $boldsymbol{y}{(Ncdot G) times 1}$. Coefficients on these covariates are $beta_{K times 1}$.

Suppose $Y$~$Bernoulli(p(x,beta))$

The below is a standard hierarchical bayesian GLM with logit sampling model and normally distributed group coefficients.

$${cal L}left(boldsymbol{y}|boldsymbol{x},beta_{1},…beta_{G}right)proptoprod_{g=1}^{G}prod_{t=1}^{N}left(Pr{j=1|p_{t},beta^{g}}right)^{y_{g,t}}left(1-Pr{j=1|p_{t},beta^{g}}right)^{1-y_{g,t}}$$

$$beta_{1},…beta_{G}|mu,Sigmasim^{iid}{cal N}_{d}left(mu,Sigmaright)$$

$$mu|Sigmasim{cal N}left(mu_{0},a^{-1}Sigmaright)$$
$$Sigmasim{cal IW}left(v_{0},V_{0}^{-1}right)$$

I want to modify this model (or find a paper that does, or work that discusses it) in such a way that there is some sharp feature selection (as in LASSO) on the dimensionality of $beta$.

(1) The simplest most direct way would be to regularize this at the population level so that we essentially restrict the dimensionality of $mu$ and all $beta$ have the same dimension.

(2) The more nuanced model would have shrinkage at the group level, where dimension of $beta$ depends on the hierarhical unit.

I am interested in solving 1 and 2, but much more important is 1.

Get this bounty!!!

#StackBounty: #machine-learning #convnet #backpropagation #cnn #kernel back propagation in CNN

Bounty: 100

I have the following CNN:

network layour

  1. I start with an input image of size 5×5
  2. Then I apply convolution using 2×2 kernel and stride = 1, that produces feature map of size 4×4.
  3. Then I apply 2×2 max-pooling with stride = 2, that reduces feature map to size 2×2.
  4. Then I apply logistic sigmoid.
  5. Then one fully connected layer with 2 neurons.
  6. And an output layer.

For the sake of simplicity, let’s assume I have already completed the forward pass and computed δH1=0.25 and δH2=-0.15

So after the complete forward pass and partially completed backward pass my network looks like this:

network after forward pass

Then I compute deltas for non-linear layer (logistic sigmoid):

delta_{11}=(0.25 * 0.61 + -0.15 * 0.02) * 0.58 * (1 – 0.58) = 0.0364182\
delta_{12}=(0.25 * 0.82 + -0.15 * -0.50) * 0.57 * (1 – 0.57) = 0.068628\
delta_{21}=(0.25 * 0.96 + -0.15 * 0.23) * 0.65 * (1 – 0.65) = 0.04675125\
delta_{22}=(0.25 * -1.00 + -0.15 * 0.17) * 0.55 * (1 – 0.55) = -0.06818625

Then, I propagate deltas to 4×4 layer and set all the values which were filtered out by max-pooling to 0 and gradient map look like this:

enter image description here

How do I update kernel weights from there? And if my network had another convolutional layer prior to 5×5, what values should I use to update it kernel weights? And overall, is my calculation correct?

Get this bounty!!!

#StackBounty: #machine-learning #logistic-regression #gradient-descent #cost-function logistic regression algorithm fails to work

Bounty: 100

I’m trying to code my own logistic regression algorithm using Andrew NG’s machine learning using Octave. lectures. So what I did was make a csv file, the first row being some parameter and the second one being the result:


Overall there are only 24 examples, but I’ve chosen points such that some pattern can be followed.

Here is my code:

data = load('data.dat');
x = data(:, 1);

y = data(:, 2)
m = length(y);

#plot(x, y, 'rx', 'MarkerSize', 10);
#title('Logistic Regression');

x = [ones(size(x, 1), 1) x];
alpha = 0.00001;
i = 15000;

g = inline("1 ./ (1 + exp(-z))")

theta = zeros(size(x(1, :)))';
j = zeros(i, 1);

for num = 1:i
  z = x * theta;
  h = g(z);
  j = (1./m) * ( -y' * log( h ) - ( 1 - y' ) * log ( 1 - h))
  grad = 1./m * x' * (h - y);
  theta = theta - alpha * grad;

However the output of the sigmoid function shows every value below 0.5… surely this has to be wrong. I’ve also tried with different learning rates and iterations, but to no avail. What is wrong with the code, or data?

Help would be appreciated.

Get this bounty!!!

#StackBounty: #machine-learning #deep-learning #gradient-descent Convergence of Stochastic Gradient Descent as a function of training s…

Bounty: 50

I am going through the following section of the book by (Goodfellow et al., 2016), and I don’t understand it quite well.

Stochastic gradient descent has many important uses outside the
context of deep learning. It is the main way to train large linear
models on very large datasets. For a fixed model size, the cost per SGD
update does not depend on the training set size $m$. In practice, we often
use a larger model as the training set size increases, but we are not
forced to do so. The number of updates required to reach convergence
usually increases with training set size. However,
as $m$ approaches infinity, the model will eventually converge to its best
possible test error before SGD has sampled every example in the
training set.
Increasing $m$ further will not extend the amount of training
time needed to reach the model’s best possible test error. From this
point of view, one can argue that the asymptotic cost of training a
model with SGD is $O(1)$ as a function of $m$.

Section 5.9, p.150

  1. “The number of updates required to reach convergence usually increases with training set size”. I can’t get away around this one. In the normal gradient descent, it becomes computationally expensive to calculate the gradient at each step as the number of training examples increases. But I don’t understand why the number of updates increases with the training size.
  2. “However, as $m$ approaches infinity, the model will eventually converge to its best possible test error before SGD has sampled every example in the training set. Increasing $m$ further will not extend the amount of training time needed to reach the model’s best possible test error.” I don’t understand this as well

Can you provide some intuition/ arguments for the above two cases?

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.

Get this bounty!!!

#StackBounty: #machine-learning #precision-recall #performance #confusion-matrix #curves How can Precision-Recall (PR) curves be used t…

Bounty: 50

How can Precision-Recall (PR) curves be used to judge overall classifier performance when Precision and Recall are class based metrics?

Since in a binary classifier, there are two classes, often labelled positive (+1) and negative (-1). Yet, the classifier performance metrics [y] precision (PPV) and [x] recall (TPR) which are used to plot PR curves can have different values for each of the two classes (if you swap the positive and negative classes). In almost all examples of PR curves, there is usually only a single curve, when there should surely be at least two curves (one curve per class)?

More specifically:

  1. Does the PR curve really only represent the precision and recall of a single class, or has some operation been done (e.g. averaging) to combine the precision and recall of both classes?

  2. Does it make sense to judge a classifier’s performance based on only looking at the PR curve for the positive class?

  3. Why are the metrics TNR and NPV not somehow integrated into the curve or graph, for a better overview of classifier performance?

Get this bounty!!!