*Bounty: 50*

*Bounty: 50*

When learning Add-1 smoothing, I found that somehow we’re adding 1 to each word in our vocabulary but not considering start-of-sentence and end-of-sentence as two words in the vocabulary. Let me throw an example to explain.

**Example:**

Assume we’ve a corpus of three sentences: "`John read Moby Dick`

", "`Mary read a different book`

", and "`She read a book by Cher`

".

After training our bi-gram model on this corpus of three sentences, we need to evaluate the probability of a sentence "John read a book", i.e. to find $P(John; read; a; book)$

To differentiate *John* appearing anywhere in a sentence from it’s appearance at the beginning, and likewise for *book* appearing at the end, we rather try to find $P(<s>John; read; a; book<backslash s>)$ after introducing two more words $<s>$ and $<backslash s>$, indicating start of a sentence, and end of a sentence respectively.

Finally, we arrive at the

$P(<s>John; read; a; book<backslash s>)$ as

$P(John|<s>)P(read|John)P(a|read)P(book|a)P(<backslash s>|book)=frac{1}{3}frac{1}{1}frac{2}{3}frac{1}{2}frac{1}{2}$

**My Question:**

Now, to find $P(Cher; read; a; book)$, using Add-1 smoothing (Laplace smoothing) shouldn’t we add the word ‘Cher’ that appears first in a sentence? And to that, we must add $<s>$ and $<backslash s>$ in our vocabulary. With this, our calculation becomes

$P(Cher|<s>)P(read|Cher)P(a|read)P(book|a)P(<backslash s>|book)=frac{0+1}{3+13}frac{0+1}{1+13}frac{2+1}{3+13}frac{1+1}{2+13}frac{1+1}{2+13}$

The 13 added to each numerator is due to the unique word count of the vocabulary which has 11 English words from our 3-sentence corpus plus 2 tokens – start and end of a sentence. In few places, I see 11 is added instead of 13 to the numerator, wondering what I’m missing here!