#StackBounty: #machine-learning #neural-networks #cross-validation #hyperparameter After Deep Learning Hyperparam tuning, what adjustme…

Bounty: 50

I’m dealing with a fully connected NN, and I’m wondering if there are any rules of thumb for adjusting hyperparameters for changes to dataset size. For example, if I increase number of obs by 20%, then I should reduce epochs by 20%, or increase batch size by 20%, or decrease learning rate by X%, or whatever…

For context: After hyperparam tuning on a validation or test set, I’m taking my final model, and retraining on all available training data to maximize performance. Since the training data has now slightly increased in size, I want to know if I should make final fine-tune adjustments (which can’t be validated) to any part of the model.

If using 10-fold cv or something, then this increase is only 10%, so not a big deal. But two situations come to mind where the increase could be more substantial. 1) Feature space is so big that 10-fold, or even 5-fold could be computationally cost-prohibitive. 2) With time series data out-of-time validation is preferred, which means the validation data always must come after training. So it is not possible to get 10 folds trained on 90% of the data. If you want many "folds", you are likely using 50% or less of training data in each fold.


Get this bounty!!!

#StackBounty: #machine-learning #time-series #neural-networks #classification #siamese How to set up a DL classification model so that …

Bounty: 50

The question is edited for clarity after tchainzzz’s comments about meta-learning.

Let’s say we have 10,000 pet pictures and 10,000 kids. Each kid is presented with 10 randomly picked pet pictures at a time. Each time, they have to pick the one picture that they like best. Our goal, during inference, is to predict probabilities on which picture (from 0 to 9) the (same) kids will pick. My struggle is how to construct a NN to make this classification.

Paths I’ve been thinking about or tried:

  • I have created embeddings for kids and pictures using a (Netflix competition winner style) factorization method. The embeddings are pretty good: Visualizing the picture embeddings in a projector, similar pets are grouped together.

  • The first thing I tried was to concatenate the embeddings of the 10 pictures and feed this together with the embedding of the kid. The output layer a softmax and CE loss. But it doesn’t work – I guess it’s to difficult for the model to "understand" where one picture embedding starts and another stops, and to relate each of the embeddings to the 10 categories in the output layer.

  • tchainzzz pointed me in the direction of meta-learning, including few-shot learning (before I had clarified my case). But these methods are mainly intended for classifying the entities (is the pet a dog or a cat?) and they are intended for limited training sets. In our case, we’re not classifying the pictures (we already know which ones are cats and which ones are dogs) and we have ample training data.

  • Why not use metric learning with siamese networks? I don’t think it will work here, because this method assumes that there is one ideal pet that each kid would select, and we just need to figure out which picture is more like that ideal pet. But we don’t have an ideal pet for each kid, only the previously performed selections.

  • Why not use some kind of ranking solution? (We could probably create a system, like elo chess ranking. Every time a kid selects a picture, that picture would get a higher ranking, particularly for that kid, and more so if the competing picture already have a high ranking.) Because that’s not a neural network classification architecture. I can add such a ranking as a feature, but the question is how to create an NN model so that it "understands" that the classification should happen from a menu of 10 available dishes.

  • There is, however, elements from ‘siamese networks’ that I’ve been thinking about. Not the metric part of siamese architechture, but the ‘shared weights’ part: A possible solution to my problem could be that I insert the embedding for the kid next to the embedding of one picture (i from 0 to 9) into 10 siamese twin networks (i from 0 to 9) sharing the same weights. Each twin would have one output mapped to 10 classes in a softmax layer. (The softmax layer is on the outside of the siamese part of the network.) I have tried this quickly, without much luck. But so far, this is my best idea and i’m continuing to work in this direction.

Any further advice or ideas would be welcome!


Get this bounty!!!

#StackBounty: #machine-learning #neural-networks #calibration Is it a good idea to continue training a model after the train/validation…

Bounty: 50

The following animated diagram shows the training statistics of a Deep Neural Network classifier at the end of each epoch:

Training diagram

The diagrams on the left show the accuracy (upper) and loss (lower) values on training and validation data per epoch. The diagrams on the right show the distribution of confidence values (i.e. maximum of softmax scores) in training (upper) and validation (lower) data.

As you can see, both of training and validation accuracy values stop improving after a certain epoch (i.e. epoch #25) and reach a plateau. However, the confidence of (correct) predictions keeps increasing as we continue the training, while the training and validation loss value still has a decreasing trend (which is also consistent with that). Now:

  • Is it safe to claim that the model’s predictions are much more confident in, say, epoch #50 compared to those of epoch #25 and therefore it’s a better model to use? (My own answer to that is yes, because that effect is also happening with the held-out validation set.)
  • Is this a good approach – i.e. continuing training after reaching the accuracy plateau, while keeping an eye on loss value – to get a model with much more confident predictions, especially in applications where not only the correctness of prediction is important but also the confidence of the predictions is of great importance? Or is there better alternatives to this? (For example, as one point, I can see a trade-off of training time/computational resources vs. higher confidence.)


Get this bounty!!!

#StackBounty: #machine-learning #neural-networks Is it a good idea to continue training a model after the train/validation accuracy has…

Bounty: 50

The following animated diagram shows the training statistics of a Deep Neural Network classifier at the end of each epoch:

Training diagram

The diagrams on the left show the accuracy (upper) and loss (lower) values on training and validation data per epoch. The diagrams on the right show the distribution of confidence values (i.e. maximum of softmax scores) in training (upper) and validation (lower) data.

As you can see, both of training and validation accuracy values stop improving after a certain epoch (i.e. epoch #25) and reach a plateau. However, the confidence of (correct) predictions keeps increasing as we continue the training, while the training and validation loss value still has a decreasing trend (which is also consistent with that). Now:

  • Is it safe to claim that the model’s predictions are much more confident in, say, epoch #50 compared to those of epoch #25 and therefore it’s a better model to use? (My own answer to that is yes, because that effect is also happening with the held-out validation set.)
  • Is this a good approach – i.e. continuing training after reaching the accuracy plateau, while keeping an eye on loss value – to get a model with much more confident predictions, especially in applications where not only the correctness of prediction is important but also the confidence of the predictions is of great importance? Or is there better alternatives to this? (For example, as one point, I can see a trade-off of training time/computational resources vs. higher confidence.)


Get this bounty!!!

#StackBounty: #neural-networks #optimization #convex #non-convex Are there any "convex neural networks"?

Bounty: 50

Are there any neural network training procedure that involves solving a convex problem?

Note that I am referring more to MLPs, instead of (multi-class) logistic regression which is a neural network with no hidden layers.

I know that for MLPs, if there are no activation function in between (e.g. an identity activation function), then the entire model is simply $hat y = W_n cdots W_1 x$, $x$ is your example, $hat y$ is your output, and obviously leads to a convex problem (linear regression)
$min_w |hat y – y|$.

$hat y = text{softmax}(W_n cdots W_1 x)$ is also convex, I believe (a composition of convex functions).

  • What about the case when there are nonlinearity in between (or at the
    output)? Does adding ANY standard choices of nonlinearity automatically lead to a
    non-convex problem?

  • In the same vein, are there any convex models except for (multi-class) logistic regression and MLPs with no hidden layers?


Get this bounty!!!

#StackBounty: #neural-networks #python #recommender-system #metric Metrics for implicit data in the recommender system with NCF

Bounty: 100

Which metrics for analysis and evaluation for implicit data in a recommender system do you use? And which ones do you use when you are looking for the closest neighbors to make a recommendation?

I’m using the NCF model.

enter image description here

The architecture of a Neural Collaborative Filtering model. Taken from the Neural Collaborative Filtering paper.

First I let the model train with the help of the NCF model. Then I find the closest neighbors with k-means.

I found metrics like MSE, RMSE, Precision, Recall, … and What metric should I use for assessing Implicit matrix factorization recommender with ALS? .
I’m not sure which ones are best and how I can then determine whether the closest neighbors are good or bad.

  • What metrics are there to evaluate the model?
  • What metrics are there to evaluate whether the neighbors found are "good"?


Get this bounty!!!

#StackBounty: #machine-learning #time-series #neural-networks #python #predictive-models Time series prediction completely off using ESN

Bounty: 50

I am attempting to predict closing prices based on closing prices extracted from OHLC data from a two-month window with 10 minute intervals (roughly 8600 data points). For this attempt, I am building an echo-state network (ESM), following this tutorial.

With the code below, the prediction is fairly worthless. It looks like noise around an arbitrary average, which does not even resemble the latest data point in the training data. This is nothing close to what the ESN in the tutorial managed at this point. I have tried to alter the results by manually tweaking the hyperparameters n_reservoir, sparsity, and spectral_radius, but all to no avail. During a 4-week course last spring ESNs were briefly touched upon, but not enough for me to understand where I am at fault.

enter image description here

My code:

import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from pyESN import ESN
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('path')
data = df['close'].to_numpy()

n_reservoir = 500
sparsity = 0.2
spectral_radius = 1.2
noise = .0005

esn = ESN(n_inputs = 1,
          n_outputs = 1,
          n_reservoir = n_reservoir,
          sparsity=sparsity,
          spectral_radius = spectral_radius,
          noise=noise)

trainlen = 7000
future = 10
futureTotal = 100
pred_tot = np.zeros(futureTotal)

for i in range(0,futureTotal,future):
    pred_training = esn.fit(np.ones(trainlen),data[i:trainlen+i])
    prediction = esn.predict(np.ones(future))
    pred_tot[i:i+future] = prediction[:,0]

plt.plot(range(0,trainlen+futureTotal),data[0:trainlen+futureTotal],'b',label="Data", alpha=0.3)
plt.plot(range(trainlen,trainlen+futureTotal),pred_tot,'k',  alpha=0.8, label='Free Running ESN')

lo,hi = plt.ylim()
plt.plot([trainlen,trainlen],[lo+np.spacing(1),hi-np.spacing(1)],'k:', linewidth=4)

plt.title(r'Ground Truth and Echo State Network Output')
plt.xlabel(r'Time', labelpad=10)
plt.ylabel(r'Price ($)', labelpad=10)
plt.legend(loc='best')
sns.despine()
plt.show()


Get this bounty!!!

#StackBounty: #machine-learning #neural-networks #tensorflow #keras #differential-equations On solving ode/pde with Neural Networks

Bounty: 50

Recently, I watched this video on YouTube on the solution of ode/pde with neural network and it motivated me to write a short code in Keras. Also, I believe the video is referencing this paper found here.

I selected an example ode
$$
frac{partial^2 x(t)}{partial t^2} + 14 frac{partial x(t)}{partial t} + 49x(t) = 0
$$

with initial conditions
$$
x(0) = 0, frac{partial x(t)}{partial t}rvert_{t=0} = -3
$$

According to the video, if I understand correctly, we let the neural network $hat{x}(t)$, be the solution of our ode, so $x(t) approx hat{x}(t)$

Then, we minimize the ode which is our custom cost function per say. Since, we have initial conditions, I created a step function for individual data point loss:

At, $t=0$:
$$
loss_i = left( frac{partial^2 hat{x}(t_i)}{partial t^2} + 14 frac{partial hat{x}(t_i)}{partial t} + 49hat{x}(t_i) right)^2 +
left( frac{partial hat{x}(t_i)}{partial t} + 3 right)^2 +
left( hat{x}(t_i) right)^2
$$

else
$$
loss_i = left( frac{partial^2 hat{x}(t_i)}{partial t^2} + 14 frac{partial hat{x}(t_i)}{partial t} + 49hat{x}(t_i) right)^2
$$

Then, minimize batch loss
$$
min frac{1}{b} sum_{i}^{b} loss_i
$$

where $b$ is the batch size in training.

Unfortunately, the network always learns zero. On good evidence, the first and second derivatives are very small – and the $x$ coefficient is very large i.e.: $49$, so the network learns that zero output is a good minimization.

enter image description here

Now there is a chance that I misinterpret the video because I think my code is correct. If someone can shed some light I will truly appreciate it.

Is my cost function correct? Do I need some other transformation?

Update:

I managed to improve the training by removing the conditional cost function. What was happening was that the the conditions were very infrequent – so the network was not adjusting enough for the initial conditions.

By changing the cost function to the following, now the network has to satisfy the initial condition on every step:

$$
loss_i = left( frac{partial^2 hat{x}(t_i)}{partial t^2} + 14 frac{partial hat{x}(t_i)}{partial t} + 49hat{x}(t_i) right)^2 +
left( frac{partial hat{x}(t=0)}{partial t}rvert_{t=0} + 3 right)^2 +
left( hat{x}(t=0)rvert_{t=0} right)^2
$$

The results are not perfect but better. I have not managed to get the loss almost zero. Deep networks have not worked at all, only shallow one with sigmoid and lots of epochs.

I am surprised this works at all since the cost function depends on derivatives of non-trainable parameters.

enter image description here

I would appreciate any input on improving the solution. I have seen a lot of fancy methods but this is the most straight forward. For example, in the referenced paper above – the author uses a trial solution. I do not understand how that works at all.


Get this bounty!!!

#StackBounty: #neural-networks #feature-selection #dimensionality-reduction #autoencoders #attention Does Attention Help with standard …

Bounty: 50

I understand the use of attention mechanisms in the encoder-decoder for sequence-to-sequence problem such as a language translator.

I am just trying to figure out whether it is possible to use attention mechanisms with standard auto-encoders for feature extraction where the goal is to compress the data into a latent vector?

Suppose we had a time series data with N dimensions and we wanted to use an auto-encoder with attention mechanisms (I am thinking of a self-attention because I think it is more appropriate in this case – I might be wrong) to better learn interdependence among the input sequence and thus we would get a better latent vector L.

Or it could be better to use Recurrent Neural Network or its variants in this case.

Does anyone have better thoughts or an intuition behind this?


Get this bounty!!!