## #StackBounty: #time-series #neural-networks #forecasting #predictive-models #keras what does it mean when keras+tensorflow predicts all…

### Bounty: 100

From what I understand is that in supervised learning problems there is a dependent variable Y, which I included in my ANN. There is one set of matching predictions for each sample for each Y. The number of predictions should match the number of true values given.

The problem I’m having is that after using model.predict() in Keras the ANN is giving me the Y dependent variable + the 10 timesteps of the Y variable that I gave for the predictors (i think).

My training dataset includes 10 timesteps for each variable. I assumed that I could use timesteps to insert lagged versions of each predictor variable.

Basically I don’t understand what these 10 predicted timesteps for the Y variable are. They are not lagged versions of the predicted Y at time t.

The reason I’m asking is that I don’t know if the global score of the model should really include predicted timesteps of Y. Should I ignore it or include them?

Also, in terms of prediction which values do I use? Just the ones at time T?

Is the Y(t-1) the predicted values for all the predictors at timestep (t-1) like Y?

Get this bounty!!!

## #StackBounty: #neural-networks #maximum-likelihood #regularization #kullback-leibler Label smoothing and KL divergence

### Bounty: 50

I am reading the paper Regularizing Neural Networks by Penalizing Confident Output Distributions where the authors introduce label smoothing in section 3.2. For a neural network that produces a conditional distribution $$p_theta(y|x)$$ over classes $$y$$ given an input $$x$$ through a softmax function, the label smoothing loss function is defined as:

$$mathcal L(theta) = -sum log p_theta(y|x) – D_{mathrm{KL}}(u||p_theta(y|x))$$

where $$D_{mathrm{KL}}$$ refers to the KL divergence and $$u$$ the uniform distribution. However my understanding is that minimising this expression would in fact attempt to maximise the KL divergence and, since this is a measure of the dissimilarity between the posterior distribution and the uniform distribution, this would encourage the opposite of smoothing. Where is my understanding falling down here?

Trying to get to the bottom of this I noticed a few things. In the next line of the paper the authors mention that

By reversing the direction of the KL divergence, $$D_{mathrm{KL}}(p_theta(y|x)‖u)$$, we recover the confidence penalty.

Where, for entropy function $$H$$ and constant $$beta$$, the confidence penalty is defined as
$$mathcal L(theta) =−sum log p_theta(y|x)−beta H(p_theta(y|x))$$

However when I do the derivation myself I obtain

$$mathcal L(theta) =−sum log p_theta(y|x) + H(p_theta(y|x))$$

Since the experiments all use positive valued $$beta$$‘s this suggests to me that perhaps the original equation is in fact a typo and should be adding the KL divergence rather than subtracting.

I have checked all the versions of the paper I could find online and the original label smoothing equation always subtracts the KL divergence.

Get this bounty!!!

## #StackBounty: #distributions #neural-networks #classification Confusion over concatenating EMG data from different muscles to a single …

### Bounty: 50

I’m trying to predict Freezing of Gait (FoG) for Parkinson’s patients using EMG signals recorded from three types of muscles of the subjects – tibialis anterior muscle of right leg, gastrocnemius muscle of right leg, and tibialis anterior muscle of left leg. It’s a two class classification problem.

Should I concatenate the data of these three columns to a single column before applying some window function to them? Or should I store the data from these three muscles in three different columns and process them separately because data from different muscles have different distributions hence putting them in a single column may confuse the deep learning model we are going to build?

I’ve shared signals (4000 samples) from three muscles for a particular subject below:

Get this bounty!!!

## #StackBounty: #machine-learning #neural-networks #moving-average #batch-normalization Why does batch norm uses exponentially weighted a…

### Bounty: 50

I was watching a lecture by Andrew Ng on batch normalization. When discussing the inference (prediction) on a test it is said that an exponentially weighted average (EWA) of batch normalization parameters is used. My question is: why use exponentially weighted average instead of a "simple" average without any weights (or, to be precise, with equal weights)?

I intuit that:

1. the latest batch is computed on weights being the closest to the final ones therefore we want it to influence the data at test time the most,
2. at the same time we do not want to get rid significant part of data used for previous training, so we let them influence predictions but in smaller degree (smaller weights).

Get this bounty!!!

## #StackBounty: #machine-learning #neural-networks #natural-language #transformers Why are the embeddings of tokens multiplied by \$sqrt …

### Bounty: 100

Why does the transformer tutorial in PyTorch have a multiplication by sqrt number of inputs? I know there is a division by sqrt(D) in the multiheaded self attention, but why is there something similar to with the output of the encoder? Especially because the original paper doesn’t seem to mention it.

In particular (https://pytorch.org/tutorials/beginner/translation_transformer.html):

``````src = self.encoder(src) * math.sqrt(self.ninp)
``````
``````# helper Module to convert tensor of input indices into corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
def __init__(self, vocab_size: int, emb_size):
super(TokenEmbedding, self).__init__()
self.embedding = nn.Embedding(vocab_size, emb_size)
self.emb_size = emb_size

def forward(self, tokens: Tensor):
return self.embedding(tokens.long()) * math.sqrt(self.emb_size)
``````

Note that I am aware that the Attention layer has this equation:

$$alpha = Attention(Q,K,V) = SoftMax( frac{ Q K^top }{sqrt{D}} ) V$$

and they argue why about it in the paper in a one of the margins (something about sum of variance being 1).

Is this related to that comment and how is it related? Is this mentioned in the original paper?

cross posted:

Get this bounty!!!

## #StackBounty: #machine-learning #neural-networks #natural-language #transformers Why are the embeddings of tokens multiplied by \$sqrt …

### Bounty: 100

Why does the transformer tutorial in PyTorch have a multiplication by sqrt number of inputs? I know there is a division by sqrt(D) in the multiheaded self attention, but why is there something similar to with the output of the encoder? Especially because the original paper doesn’t seem to mention it.

In particular (https://pytorch.org/tutorials/beginner/translation_transformer.html):

``````src = self.encoder(src) * math.sqrt(self.ninp)
``````
``````# helper Module to convert tensor of input indices into corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
def __init__(self, vocab_size: int, emb_size):
super(TokenEmbedding, self).__init__()
self.embedding = nn.Embedding(vocab_size, emb_size)
self.emb_size = emb_size

def forward(self, tokens: Tensor):
return self.embedding(tokens.long()) * math.sqrt(self.emb_size)
``````

Note that I am aware that the Attention layer has this equation:

$$alpha = Attention(Q,K,V) = SoftMax( frac{ Q K^top }{sqrt{D}} ) V$$

and they argue why about it in the paper in a one of the margins (something about sum of variance being 1).

Is this related to that comment and how is it related? Is this mentioned in the original paper?

cross posted:

Get this bounty!!!

## #StackBounty: #machine-learning #neural-networks #natural-language #transformers Why are the embeddings of tokens multiplied by \$sqrt …

### Bounty: 100

Why does the transformer tutorial in PyTorch have a multiplication by sqrt number of inputs? I know there is a division by sqrt(D) in the multiheaded self attention, but why is there something similar to with the output of the encoder? Especially because the original paper doesn’t seem to mention it.

In particular (https://pytorch.org/tutorials/beginner/translation_transformer.html):

``````src = self.encoder(src) * math.sqrt(self.ninp)
``````
``````# helper Module to convert tensor of input indices into corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
def __init__(self, vocab_size: int, emb_size):
super(TokenEmbedding, self).__init__()
self.embedding = nn.Embedding(vocab_size, emb_size)
self.emb_size = emb_size

def forward(self, tokens: Tensor):
return self.embedding(tokens.long()) * math.sqrt(self.emb_size)
``````

Note that I am aware that the Attention layer has this equation:

$$alpha = Attention(Q,K,V) = SoftMax( frac{ Q K^top }{sqrt{D}} ) V$$

and they argue why about it in the paper in a one of the margins (something about sum of variance being 1).

Is this related to that comment and how is it related? Is this mentioned in the original paper?

cross posted:

Get this bounty!!!

## #StackBounty: #machine-learning #neural-networks #natural-language #transformers Why are the embeddings of tokens multiplied by \$sqrt …

### Bounty: 100

Why does the transformer tutorial in PyTorch have a multiplication by sqrt number of inputs? I know there is a division by sqrt(D) in the multiheaded self attention, but why is there something similar to with the output of the encoder? Especially because the original paper doesn’t seem to mention it.

In particular (https://pytorch.org/tutorials/beginner/translation_transformer.html):

``````src = self.encoder(src) * math.sqrt(self.ninp)
``````
``````# helper Module to convert tensor of input indices into corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
def __init__(self, vocab_size: int, emb_size):
super(TokenEmbedding, self).__init__()
self.embedding = nn.Embedding(vocab_size, emb_size)
self.emb_size = emb_size

def forward(self, tokens: Tensor):
return self.embedding(tokens.long()) * math.sqrt(self.emb_size)
``````

Note that I am aware that the Attention layer has this equation:

$$alpha = Attention(Q,K,V) = SoftMax( frac{ Q K^top }{sqrt{D}} ) V$$

and they argue why about it in the paper in a one of the margins (something about sum of variance being 1).

Is this related to that comment and how is it related? Is this mentioned in the original paper?

cross posted:

Get this bounty!!!

## #StackBounty: #machine-learning #neural-networks #natural-language #transformers Why are the embeddings of tokens multiplied by \$sqrt …

### Bounty: 100

Why does the transformer tutorial in PyTorch have a multiplication by sqrt number of inputs? I know there is a division by sqrt(D) in the multiheaded self attention, but why is there something similar to with the output of the encoder? Especially because the original paper doesn’t seem to mention it.

In particular (https://pytorch.org/tutorials/beginner/translation_transformer.html):

``````src = self.encoder(src) * math.sqrt(self.ninp)
``````
``````# helper Module to convert tensor of input indices into corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
def __init__(self, vocab_size: int, emb_size):
super(TokenEmbedding, self).__init__()
self.embedding = nn.Embedding(vocab_size, emb_size)
self.emb_size = emb_size

def forward(self, tokens: Tensor):
return self.embedding(tokens.long()) * math.sqrt(self.emb_size)
``````

Note that I am aware that the Attention layer has this equation:

$$alpha = Attention(Q,K,V) = SoftMax( frac{ Q K^top }{sqrt{D}} ) V$$

and they argue why about it in the paper in a one of the margins (something about sum of variance being 1).

Is this related to that comment and how is it related? Is this mentioned in the original paper?

cross posted:

Get this bounty!!!

## #StackBounty: #machine-learning #neural-networks #natural-language #transformers Why are the embeddings of tokens multiplied by \$sqrt …

### Bounty: 100

Why does the transformer tutorial in PyTorch have a multiplication by sqrt number of inputs? I know there is a division by sqrt(D) in the multiheaded self attention, but why is there something similar to with the output of the encoder? Especially because the original paper doesn’t seem to mention it.

In particular (https://pytorch.org/tutorials/beginner/translation_transformer.html):

``````src = self.encoder(src) * math.sqrt(self.ninp)
``````
``````# helper Module to convert tensor of input indices into corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
def __init__(self, vocab_size: int, emb_size):
super(TokenEmbedding, self).__init__()
self.embedding = nn.Embedding(vocab_size, emb_size)
self.emb_size = emb_size

def forward(self, tokens: Tensor):
return self.embedding(tokens.long()) * math.sqrt(self.emb_size)
``````

Note that I am aware that the Attention layer has this equation:

$$alpha = Attention(Q,K,V) = SoftMax( frac{ Q K^top }{sqrt{D}} ) V$$

and they argue why about it in the paper in a one of the margins (something about sum of variance being 1).

Is this related to that comment and how is it related? Is this mentioned in the original paper?

cross posted:

Get this bounty!!!