#StackBounty: #nlp #language-model #stanford-nlp #ngrams In smoothing of n-gram model in NLP, why don't we consider start and end o…

Bounty: 50

When learning Add-1 smoothing, I found that somehow we’re adding 1 to each word in our vocabulary but not considering start-of-sentence and end-of-sentence as two words in the vocabulary. Let me throw an example to explain.


Assume we’ve a corpus of three sentences: "John read Moby Dick", "Mary read a different book", and "She read a book by Cher".
After training our bi-gram model on this corpus of three sentences, we need to evaluate the probability of a sentence "John read a book", i.e. to find $P(John; read; a; book)$

To differentiate John appearing anywhere in a sentence from it’s appearance at the beginning, and likewise for book appearing at the end, we rather try to find $P(<s>John; read; a; book<backslash s>)$ after introducing two more words $<s>$ and $<backslash s>$, indicating start of a sentence, and end of a sentence respectively.

Finally, we arrive at the

$P(<s>John; read; a; book<backslash s>)$ as
$P(John|<s>)P(read|John)P(a|read)P(book|a)P(<backslash s>|book)=frac{1}{3}frac{1}{1}frac{2}{3}frac{1}{2}frac{1}{2}$

My Question:
Now, to find $P(Cher; read; a; book)$, using Add-1 smoothing (Laplace smoothing) shouldn’t we add the word ‘Cher’ that appears first in a sentence? And to that, we must add $<s>$ and $<backslash s>$ in our vocabulary. With this, our calculation becomes

$P(Cher|<s>)P(read|Cher)P(a|read)P(book|a)P(<backslash s>|book)=frac{0+1}{3+13}frac{0+1}{1+13}frac{2+1}{3+13}frac{1+1}{2+13}frac{1+1}{2+13}$

The 13 added to each numerator is due to the unique word count of the vocabulary which has 11 English words from our 3-sentence corpus plus 2 tokens – start and end of a sentence. In few places, I see 11 is added instead of 13 to the numerator, wondering what I’m missing here!

Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.