#StackBounty: #machine-learning #neural-networks #natural-language #transformers Why are the embeddings of tokens multiplied by $sqrt …

Bounty: 50

Why does the transformer tutorial in PyTorch have a multiplication by sqrt number of inputs? I know there is a division by sqrt(D) in the multiheaded self attention, but why is there something similar to with the output of the encoder? Especially because the original paper doesn’t seem to mention it.

In particular (https://pytorch.org/tutorials/beginner/translation_transformer.html):

src = self.encoder(src) * math.sqrt(self.ninp)

or this (https://pytorch.org/tutorials/beginner/transformer_tutorial.html):

# helper Module to convert tensor of input indices into corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

Note that I am aware that the Attention layer has this equation:

$$
alpha = Attention(Q,K,V) = SoftMax( frac{ Q K^top }{sqrt{D}} ) V
$$

and they argue why about it in the paper in a one of the margins (something about sum of variance being 1).

Is this related to that comment and how is it related? Is this mentioned in the original paper?

cross posted:


Get this bounty!!!

#StackBounty: #machine-learning #neural-networks #natural-language #transformers Why are the embeddings of tokens multiplied by $sqrt …

Bounty: 50

Why does the transformer tutorial in PyTorch have a multiplication by sqrt number of inputs? I know there is a division by sqrt(D) in the multiheaded self attention, but why is there something similar to with the output of the encoder? Especially because the original paper doesn’t seem to mention it.

In particular (https://pytorch.org/tutorials/beginner/translation_transformer.html):

src = self.encoder(src) * math.sqrt(self.ninp)

or this (https://pytorch.org/tutorials/beginner/transformer_tutorial.html):

# helper Module to convert tensor of input indices into corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

Note that I am aware that the Attention layer has this equation:

$$
alpha = Attention(Q,K,V) = SoftMax( frac{ Q K^top }{sqrt{D}} ) V
$$

and they argue why about it in the paper in a one of the margins (something about sum of variance being 1).

Is this related to that comment and how is it related? Is this mentioned in the original paper?

cross posted:


Get this bounty!!!

#StackBounty: #machine-learning #neural-networks #natural-language #transformers Why are the embeddings of tokens multiplied by $sqrt …

Bounty: 50

Why does the transformer tutorial in PyTorch have a multiplication by sqrt number of inputs? I know there is a division by sqrt(D) in the multiheaded self attention, but why is there something similar to with the output of the encoder? Especially because the original paper doesn’t seem to mention it.

In particular (https://pytorch.org/tutorials/beginner/translation_transformer.html):

src = self.encoder(src) * math.sqrt(self.ninp)

or this (https://pytorch.org/tutorials/beginner/transformer_tutorial.html):

# helper Module to convert tensor of input indices into corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

Note that I am aware that the Attention layer has this equation:

$$
alpha = Attention(Q,K,V) = SoftMax( frac{ Q K^top }{sqrt{D}} ) V
$$

and they argue why about it in the paper in a one of the margins (something about sum of variance being 1).

Is this related to that comment and how is it related? Is this mentioned in the original paper?

cross posted:


Get this bounty!!!

#StackBounty: #machine-learning #neural-networks #natural-language #transformers Why are the embeddings of tokens multiplied by $sqrt …

Bounty: 50

Why does the transformer tutorial in PyTorch have a multiplication by sqrt number of inputs? I know there is a division by sqrt(D) in the multiheaded self attention, but why is there something similar to with the output of the encoder? Especially because the original paper doesn’t seem to mention it.

In particular (https://pytorch.org/tutorials/beginner/translation_transformer.html):

src = self.encoder(src) * math.sqrt(self.ninp)

or this (https://pytorch.org/tutorials/beginner/transformer_tutorial.html):

# helper Module to convert tensor of input indices into corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

Note that I am aware that the Attention layer has this equation:

$$
alpha = Attention(Q,K,V) = SoftMax( frac{ Q K^top }{sqrt{D}} ) V
$$

and they argue why about it in the paper in a one of the margins (something about sum of variance being 1).

Is this related to that comment and how is it related? Is this mentioned in the original paper?

cross posted:


Get this bounty!!!

#StackBounty: #machine-learning #neural-networks #natural-language #transformers Why are the embeddings of tokens multiplied by $sqrt …

Bounty: 50

Why does the transformer tutorial in PyTorch have a multiplication by sqrt number of inputs? I know there is a division by sqrt(D) in the multiheaded self attention, but why is there something similar to with the output of the encoder? Especially because the original paper doesn’t seem to mention it.

In particular (https://pytorch.org/tutorials/beginner/translation_transformer.html):

src = self.encoder(src) * math.sqrt(self.ninp)

or this (https://pytorch.org/tutorials/beginner/transformer_tutorial.html):

# helper Module to convert tensor of input indices into corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

Note that I am aware that the Attention layer has this equation:

$$
alpha = Attention(Q,K,V) = SoftMax( frac{ Q K^top }{sqrt{D}} ) V
$$

and they argue why about it in the paper in a one of the margins (something about sum of variance being 1).

Is this related to that comment and how is it related? Is this mentioned in the original paper?

cross posted:


Get this bounty!!!

#StackBounty: #machine-learning #neural-networks #natural-language #transformers Why are the embeddings of tokens multiplied by $sqrt …

Bounty: 50

Why does the transformer tutorial in PyTorch have a multiplication by sqrt number of inputs? I know there is a division by sqrt(D) in the multiheaded self attention, but why is there something similar to with the output of the encoder? Especially because the original paper doesn’t seem to mention it.

In particular (https://pytorch.org/tutorials/beginner/translation_transformer.html):

src = self.encoder(src) * math.sqrt(self.ninp)

or this (https://pytorch.org/tutorials/beginner/transformer_tutorial.html):

# helper Module to convert tensor of input indices into corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

Note that I am aware that the Attention layer has this equation:

$$
alpha = Attention(Q,K,V) = SoftMax( frac{ Q K^top }{sqrt{D}} ) V
$$

and they argue why about it in the paper in a one of the margins (something about sum of variance being 1).

Is this related to that comment and how is it related? Is this mentioned in the original paper?

cross posted:


Get this bounty!!!

#StackBounty: #machine-learning #neural-networks #natural-language #transformers Why are the embeddings of tokens multiplied by $sqrt …

Bounty: 50

Why does the transformer tutorial in PyTorch have a multiplication by sqrt number of inputs? I know there is a division by sqrt(D) in the multiheaded self attention, but why is there something similar to with the output of the encoder? Especially because the original paper doesn’t seem to mention it.

In particular (https://pytorch.org/tutorials/beginner/translation_transformer.html):

src = self.encoder(src) * math.sqrt(self.ninp)

or this (https://pytorch.org/tutorials/beginner/transformer_tutorial.html):

# helper Module to convert tensor of input indices into corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

Note that I am aware that the Attention layer has this equation:

$$
alpha = Attention(Q,K,V) = SoftMax( frac{ Q K^top }{sqrt{D}} ) V
$$

and they argue why about it in the paper in a one of the margins (something about sum of variance being 1).

Is this related to that comment and how is it related? Is this mentioned in the original paper?

cross posted:


Get this bounty!!!

#StackBounty: #machine-learning #neural-networks #natural-language #transformers Why are the embeddings of tokens multiplied by $sqrt …

Bounty: 50

Why does the transformer tutorial in PyTorch have a multiplication by sqrt number of inputs? I know there is a division by sqrt(D) in the multiheaded self attention, but why is there something similar to with the output of the encoder? Especially because the original paper doesn’t seem to mention it.

In particular (https://pytorch.org/tutorials/beginner/translation_transformer.html):

src = self.encoder(src) * math.sqrt(self.ninp)

or this (https://pytorch.org/tutorials/beginner/transformer_tutorial.html):

# helper Module to convert tensor of input indices into corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

Note that I am aware that the Attention layer has this equation:

$$
alpha = Attention(Q,K,V) = SoftMax( frac{ Q K^top }{sqrt{D}} ) V
$$

and they argue why about it in the paper in a one of the margins (something about sum of variance being 1).

Is this related to that comment and how is it related? Is this mentioned in the original paper?

cross posted:


Get this bounty!!!

#StackBounty: #machine-learning #neural-networks #natural-language #transformers Why are the embeddings of tokens multiplied by $sqrt …

Bounty: 50

Why does the transformer tutorial in PyTorch have a multiplication by sqrt number of inputs? I know there is a division by sqrt(D) in the multiheaded self attention, but why is there something similar to with the output of the encoder? Especially because the original paper doesn’t seem to mention it.

In particular (https://pytorch.org/tutorials/beginner/translation_transformer.html):

src = self.encoder(src) * math.sqrt(self.ninp)

or this (https://pytorch.org/tutorials/beginner/transformer_tutorial.html):

# helper Module to convert tensor of input indices into corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

Note that I am aware that the Attention layer has this equation:

$$
alpha = Attention(Q,K,V) = SoftMax( frac{ Q K^top }{sqrt{D}} ) V
$$

and they argue why about it in the paper in a one of the margins (something about sum of variance being 1).

Is this related to that comment and how is it related? Is this mentioned in the original paper?

cross posted:


Get this bounty!!!

#StackBounty: #machine-learning #neural-networks #natural-language #transformers Why are the embeddings of tokens multiplied by $sqrt …

Bounty: 50

Why does the transformer tutorial in PyTorch have a multiplication by sqrt number of inputs? I know there is a division by sqrt(D) in the multiheaded self attention, but why is there something similar to with the output of the encoder? Especially because the original paper doesn’t seem to mention it.

In particular (https://pytorch.org/tutorials/beginner/translation_transformer.html):

src = self.encoder(src) * math.sqrt(self.ninp)

or this (https://pytorch.org/tutorials/beginner/transformer_tutorial.html):

# helper Module to convert tensor of input indices into corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

Note that I am aware that the Attention layer has this equation:

$$
alpha = Attention(Q,K,V) = SoftMax( frac{ Q K^top }{sqrt{D}} ) V
$$

and they argue why about it in the paper in a one of the margins (something about sum of variance being 1).

Is this related to that comment and how is it related? Is this mentioned in the original paper?

cross posted:


Get this bounty!!!