# #StackBounty: #nlp #language-model #stanford-nlp #ngrams In smoothing of n-gram model in NLP, why don't we consider start and end o…

### Bounty: 50

When learning Add-1 smoothing, I found that somehow we’re adding 1 to each word in our vocabulary but not considering start-of-sentence and end-of-sentence as two words in the vocabulary. Let me throw an example to explain.

Example:

Assume we’ve a corpus of three sentences: "`John read Moby Dick`", "`Mary read a different book`", and "`She read a book by Cher`".
After training our bi-gram model on this corpus of three sentences, we need to evaluate the probability of a sentence "John read a book", i.e. to find $$P(John; read; a; book)$$

To differentiate John appearing anywhere in a sentence from it’s appearance at the beginning, and likewise for book appearing at the end, we rather try to find $$P(John; read; a; book)$$ after introducing two more words $$$$ and $$$$, indicating start of a sentence, and end of a sentence respectively.

Finally, we arrive at the

$$P(John; read; a; book)$$ as
$$P(John|)P(read|John)P(a|read)P(book|a)P(|book)=frac{1}{3}frac{1}{1}frac{2}{3}frac{1}{2}frac{1}{2}$$

My Question:
Now, to find $$P(Cher; read; a; book)$$, using Add-1 smoothing (Laplace smoothing) shouldn’t we add the word ‘Cher’ that appears first in a sentence? And to that, we must add $$$$ and $$$$ in our vocabulary. With this, our calculation becomes

$$P(Cher|)P(read|Cher)P(a|read)P(book|a)P(|book)=frac{0+1}{3+13}frac{0+1}{1+13}frac{2+1}{3+13}frac{1+1}{2+13}frac{1+1}{2+13}$$

The 13 added to each numerator is due to the unique word count of the vocabulary which has 11 English words from our 3-sentence corpus plus 2 tokens – start and end of a sentence. In few places, I see 11 is added instead of 13 to the numerator, wondering what I’m missing here!

Get this bounty!!!

This site uses Akismet to reduce spam. Learn how your comment data is processed.