#StackBounty: #classification #text-mining #naive-bayes #information-theory How to compute gain statistic for the multinomial Naive Bay…

Bounty: 100

I’m trying to figure out how to compute the gain statistic G(w) following the fitting of the multinomial Naive Bayes model. This statistic is described on p17 of the new edition of Jurafsky and Martin, Speech and Language Processing, Chapter 4 “Naive Bayes and Sentiment Classification”.

enter image description here

I can compute everything here except the P(c_i|bar{w}). In the text, they state that

bar{w} means that a document does not contain the word w

What if the word exists in every document, as in the example of the word “Chinese” below? In the multinomial variant of Naives Bayes, word likelihoods are based on (smooth) counts, not occurrence or non-occurrence. So how, in very specific terms, is P(c_i|bar{w}) computed?

For illustration, I show a worked example from Manning, Raghavan, & Schütze (2008). An Introduction to Information Retrieval. Cambridge: Cambridge University Press, Chapter 13 Table 13.1.. This uses the implementation of the multinomial Naive Bayes classifier in my R package quanteda, shown for concrete illustration (this is not an R question however!).

library("quanteda")
## Package version: 1.4.1

## Example from 13.1 of _An Introduction to Information Retrieval_
## https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
corp <- corpus(c(d1 = "Chinese Beijing Chinese",
                 d2 = "Chinese Chinese Shanghai",
                 d3 = "Chinese Macao",
                 d4 = "Tokyo Japan Chinese",
                 d5 = "Chinese Chinese Chinese Tokyo Japan"),
               docvars = data.frame(train = factor(c("Y", "Y", "Y", "N", NA), 
                                                   ordered = TRUE)))
dfmat <- dfm(corp, tolower = FALSE)
dfmat
## Document-feature matrix of: 5 documents, 6 features (60.0% sparse).
## 5 x 6 sparse Matrix of class "dfm"
##        features
## docs    Chinese Beijing Shanghai Macao Tokyo Japan
##   text1       2       1        0     0     0     0
##   text2       2       0        1     0     0     0
##   text3       1       0        0     1     0     0
##   text4       1       0        0     0     1     1
##   text5       3       0        0     0     1     1

## replicate IIR p261 prediction for test set (document 5)
tmod <- textmodel_nb(dfmat, y = docvars(dfmat, "train"), prior = "docfreq", smooth = 1)
predict(tmod, newdata = dfmat[5, ], type = "prob")
##               N         Y
## text5 0.3102414 0.6897586

# word (smoothed) likelihoods
tmod$PwGc
##        features
## classes   Chinese   Beijing  Shanghai     Macao      Tokyo      Japan
##       N 0.2222222 0.1111111 0.1111111 0.1111111 0.22222222 0.22222222
##       Y 0.4285714 0.1428571 0.1428571 0.1428571 0.07142857 0.07142857

# word posteriors by class
tmod$PcGw
##        features
## classes   Chinese   Beijing  Shanghai     Macao     Tokyo     Japan
##       N 0.1473684 0.2058824 0.2058824 0.2058824 0.5090909 0.5090909
##       Y 0.8526316 0.7941176 0.7941176 0.7941176 0.4909091 0.4909091

Any help is greatly appreciated.


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.