When learning Add-1 smoothing, I found that somehow we’re adding 1 to each word in our vocabulary but not considering start-of-sentence and end-of-sentence as two words in the vocabulary. Let me throw an example to explain.
Assume we’ve a corpus of three sentences: "
John read Moby Dick", "
Mary read a different book", and "
She read a book by Cher".
After training our bi-gram model on this corpus of three sentences, we need to evaluate the probability of a sentence "John read a book", i.e. to find $P(John; read; a; book)$
To differentiate John appearing anywhere in a sentence from it’s appearance at the beginning, and likewise for book appearing at the end, we rather try to find $P(<s>John; read; a; book<backslash s>)$ after introducing two more words $<s>$ and $<backslash s>$, indicating start of a sentence, and end of a sentence respectively.
Finally, we arrive at the
$P(<s>John; read; a; book<backslash s>)$ as
Now, to find $P(Cher; read; a; book)$, using Add-1 smoothing (Laplace smoothing) shouldn’t we add the word ‘Cher’ that appears first in a sentence? And to that, we must add $<s>$ and $<backslash s>$ in our vocabulary. With this, our calculation becomes
The 13 added to each numerator is due to the unique word count of the vocabulary which has 11 English words from our 3-sentence corpus plus 2 tokens – start and end of a sentence. In few places, I see 11 is added instead of 13 to the numerator, wondering what I’m missing here!