#StackBounty: #nlp #grammar-inference How to write a simple (non-English) grammar checker/comparator (given 2 sentences, return which o…

Bounty: 50

I’m new to NLP and I’m tackling a problem where I want to improve on the existing hacky solution.

tl;dr

Given a corpus of grammatically-correct text in the target language (it is uncommon: Bulgarian), train a "comparator" which then is able to take two (or more) potentially "wrong" sentences. These are generated by an algorithm and could be wrong grammatically. The comparator shall return the one which seems to be the best grammar-wise.

Background

First things first, Bulgarian is an obscure slavic language, and I do not want to write a full Bulgarian parser. The time & resources for this project are way too little to afford highly-sophisticated solutions.

This task pertains to an art installation, a fun application where very high accuracy is not required.
The source sentences are generated algorithmically from news articles’ titles, and this mutation part operates more or less on a "search’n’replace" method. This is capable of producing grammatically incorrect sentences, so there’s a need for further filtering.

Consider the input sentence John Doe wasn’t invited to the awarding ceremony.

The aforementioned mutation part might produce the following candidates:

  1. John Doe and I wasn’t invited to the awarding ceremony.
  2. John Doe and I weren’t invited to the awarding ceremony.

(keep in mind in reality this is all in the obscure language and I’m omitting details. I’m translating to English for convenience only. Also the actual language has much more particularities since adjectives have to agree to the nouns on gender, number and definiteness. It is unfeasible to "fix" the mutator part, even though it seems tempting.)

With these two outputs, the comparator would ideally select 2 as being the correct sentence.

Current hacky approach

I have about 7 million grammatically correct sentences as my training corpus and the current approach is:

Training:

  1. Handle formatting differences, fix minor issues in the source, so I have a list of sentences. Each sentence is a list of words. Words that are names, numbers, dates, etc. are replaced by placeholders like @NAME, @NUMBER, @DATE — don’t care what the name is, as long as it’s a name. E.g. the sample sentence becomes [ "@NAME", "wasn’t", "invited", "to", "the", "awarding", "ceremony" ].
  2. Consider each two consecutive words (bigrams) and count the relative frequency they occur in the corpus. E.g. add the edges "$NAME"-"wasn’t", "wasn’t"-"invited", and so forth. In the end, for each word, I’d know the relative frequency of what word may come next.

Evaluation:

  1. Given a new sentence, split it into words as above, then compute the probability of this sentence by the geometric mean of the probabilities of the bigrams in it.
  2. To compare two or more sentences, select the one with the highest probability.

This works to some extent, and is especially useful when the mutator produces a syntactically wrong word.
However its approach is too local ("peephole"), and for the two sentences above it fails, it actually prefers option 1 (and it’s not hard to see why).

Question

Which approach can be used to statistically infer the grammar of a language just enough to write a better comparator of the described type? I’m happy to invest some time, e.g. a few days, into learning NLP stuff and approaches (and neural networks, if they’ll help).

EDIT: I’m curious why are people downvoting this question, while providing no feedback why they think it’s bad. I’m new to this SE and if I’m violating some guideline, let me know — I’ll do my best to fix it.


Get this bounty!!!

#StackBounty: #nlp #language-model #stanford-nlp #ngrams In smoothing of n-gram model in NLP, why don't we consider start and end o…

Bounty: 50

When learning Add-1 smoothing, I found that somehow we’re adding 1 to each word in our vocabulary but not considering start-of-sentence and end-of-sentence as two words in the vocabulary. Let me throw an example to explain.

Example:

Assume we’ve a corpus of three sentences: "John read Moby Dick", "Mary read a different book", and "She read a book by Cher".
After training our bi-gram model on this corpus of three sentences, we need to evaluate the probability of a sentence "John read a book", i.e. to find $P(John; read; a; book)$

To differentiate John appearing anywhere in a sentence from it’s appearance at the beginning, and likewise for book appearing at the end, we rather try to find $P(<s>John; read; a; book<backslash s>)$ after introducing two more words $<s>$ and $<backslash s>$, indicating start of a sentence, and end of a sentence respectively.

Finally, we arrive at the

$P(<s>John; read; a; book<backslash s>)$ as
$P(John|<s>)P(read|John)P(a|read)P(book|a)P(<backslash s>|book)=frac{1}{3}frac{1}{1}frac{2}{3}frac{1}{2}frac{1}{2}$

My Question:
Now, to find $P(Cher; read; a; book)$, using Add-1 smoothing (Laplace smoothing) shouldn’t we add the word ‘Cher’ that appears first in a sentence? And to that, we must add $<s>$ and $<backslash s>$ in our vocabulary. With this, our calculation becomes

$P(Cher|<s>)P(read|Cher)P(a|read)P(book|a)P(<backslash s>|book)=frac{0+1}{3+13}frac{0+1}{1+13}frac{2+1}{3+13}frac{1+1}{2+13}frac{1+1}{2+13}$

The 13 added to each numerator is due to the unique word count of the vocabulary which has 11 English words from our 3-sentence corpus plus 2 tokens – start and end of a sentence. In few places, I see 11 is added instead of 13 to the numerator, wondering what I’m missing here!


Get this bounty!!!

#StackBounty: #nlp #language-model #stanford-nlp #ngrams In smoothing of n-gram model in NLP, why don't we consider start and end o…

Bounty: 50

When learning Add-1 smoothing, I found that somehow we’re adding 1 to each word in our vocabulary but not considering start-of-sentence and end-of-sentence as two words in the vocabulary. Let me throw an example to explain.

Example:

Assume we’ve a corpus of three sentences: "John read Moby Dick", "Mary read a different book", and "She read a book by Cher".
After training our bi-gram model on this corpus of three sentences, we need to evaluate the probability of a sentence "John read a book", i.e. to find $P(John; read; a; book)$

To differentiate John appearing anywhere in a sentence from it’s appearance at the beginning, and likewise for book appearing at the end, we rather try to find $P(<s>John; read; a; book<backslash s>)$ after introducing two more words $<s>$ and $<backslash s>$, indicating start of a sentence, and end of a sentence respectively.

Finally, we arrive at the

$P(<s>John; read; a; book<backslash s>)$ as
$P(John|<s>)P(read|John)P(a|read)P(book|a)P(<backslash s>|book)=frac{1}{3}frac{1}{1}frac{2}{3}frac{1}{2}frac{1}{2}$

My Question:
Now, to find $P(Cher; read; a; book)$, using Add-1 smoothing (Laplace smoothing) shouldn’t we add the word ‘Cher’ that appears first in a sentence? And to that, we must add $<s>$ and $<backslash s>$ in our vocabulary. With this, our calculation becomes

$P(Cher|<s>)P(read|Cher)P(a|read)P(book|a)P(<backslash s>|book)=frac{0+1}{3+13}frac{0+1}{1+13}frac{2+1}{3+13}frac{1+1}{2+13}frac{1+1}{2+13}$

The 13 added to each numerator is due to the unique word count of the vocabulary which has 11 English words from our 3-sentence corpus plus 2 tokens – start and end of a sentence. In few places, I see 11 is added instead of 13 to the numerator, wondering what I’m missing here!


Get this bounty!!!

#StackBounty: #nlp #language-model #stanford-nlp #ngrams In smoothing of n-gram model in NLP, why don't we consider start and end o…

Bounty: 50

When learning Add-1 smoothing, I found that somehow we’re adding 1 to each word in our vocabulary but not considering start-of-sentence and end-of-sentence as two words in the vocabulary. Let me throw an example to explain.

Example:

Assume we’ve a corpus of three sentences: "John read Moby Dick", "Mary read a different book", and "She read a book by Cher".
After training our bi-gram model on this corpus of three sentences, we need to evaluate the probability of a sentence "John read a book", i.e. to find $P(John; read; a; book)$

To differentiate John appearing anywhere in a sentence from it’s appearance at the beginning, and likewise for book appearing at the end, we rather try to find $P(<s>John; read; a; book<backslash s>)$ after introducing two more words $<s>$ and $<backslash s>$, indicating start of a sentence, and end of a sentence respectively.

Finally, we arrive at the

$P(<s>John; read; a; book<backslash s>)$ as
$P(John|<s>)P(read|John)P(a|read)P(book|a)P(<backslash s>|book)=frac{1}{3}frac{1}{1}frac{2}{3}frac{1}{2}frac{1}{2}$

My Question:
Now, to find $P(Cher; read; a; book)$, using Add-1 smoothing (Laplace smoothing) shouldn’t we add the word ‘Cher’ that appears first in a sentence? And to that, we must add $<s>$ and $<backslash s>$ in our vocabulary. With this, our calculation becomes

$P(Cher|<s>)P(read|Cher)P(a|read)P(book|a)P(<backslash s>|book)=frac{0+1}{3+13}frac{0+1}{1+13}frac{2+1}{3+13}frac{1+1}{2+13}frac{1+1}{2+13}$

The 13 added to each numerator is due to the unique word count of the vocabulary which has 11 English words from our 3-sentence corpus plus 2 tokens – start and end of a sentence. In few places, I see 11 is added instead of 13 to the numerator, wondering what I’m missing here!


Get this bounty!!!

#StackBounty: #nlp #language-model #stanford-nlp #ngrams In smoothing of n-gram model in NLP, why don't we consider start and end o…

Bounty: 50

When learning Add-1 smoothing, I found that somehow we’re adding 1 to each word in our vocabulary but not considering start-of-sentence and end-of-sentence as two words in the vocabulary. Let me throw an example to explain.

Example:

Assume we’ve a corpus of three sentences: "John read Moby Dick", "Mary read a different book", and "She read a book by Cher".
After training our bi-gram model on this corpus of three sentences, we need to evaluate the probability of a sentence "John read a book", i.e. to find $P(John; read; a; book)$

To differentiate John appearing anywhere in a sentence from it’s appearance at the beginning, and likewise for book appearing at the end, we rather try to find $P(<s>John; read; a; book<backslash s>)$ after introducing two more words $<s>$ and $<backslash s>$, indicating start of a sentence, and end of a sentence respectively.

Finally, we arrive at the

$P(<s>John; read; a; book<backslash s>)$ as
$P(John|<s>)P(read|John)P(a|read)P(book|a)P(<backslash s>|book)=frac{1}{3}frac{1}{1}frac{2}{3}frac{1}{2}frac{1}{2}$

My Question:
Now, to find $P(Cher; read; a; book)$, using Add-1 smoothing (Laplace smoothing) shouldn’t we add the word ‘Cher’ that appears first in a sentence? And to that, we must add $<s>$ and $<backslash s>$ in our vocabulary. With this, our calculation becomes

$P(Cher|<s>)P(read|Cher)P(a|read)P(book|a)P(<backslash s>|book)=frac{0+1}{3+13}frac{0+1}{1+13}frac{2+1}{3+13}frac{1+1}{2+13}frac{1+1}{2+13}$

The 13 added to each numerator is due to the unique word count of the vocabulary which has 11 English words from our 3-sentence corpus plus 2 tokens – start and end of a sentence. In few places, I see 11 is added instead of 13 to the numerator, wondering what I’m missing here!


Get this bounty!!!

#StackBounty: #nlp #language-model #stanford-nlp #ngrams In smoothing of n-gram model in NLP, why don't we consider start and end o…

Bounty: 50

When learning Add-1 smoothing, I found that somehow we’re adding 1 to each word in our vocabulary but not considering start-of-sentence and end-of-sentence as two words in the vocabulary. Let me throw an example to explain.

Example:

Assume we’ve a corpus of three sentences: "John read Moby Dick", "Mary read a different book", and "She read a book by Cher".
After training our bi-gram model on this corpus of three sentences, we need to evaluate the probability of a sentence "John read a book", i.e. to find $P(John; read; a; book)$

To differentiate John appearing anywhere in a sentence from it’s appearance at the beginning, and likewise for book appearing at the end, we rather try to find $P(<s>John; read; a; book<backslash s>)$ after introducing two more words $<s>$ and $<backslash s>$, indicating start of a sentence, and end of a sentence respectively.

Finally, we arrive at the

$P(<s>John; read; a; book<backslash s>)$ as
$P(John|<s>)P(read|John)P(a|read)P(book|a)P(<backslash s>|book)=frac{1}{3}frac{1}{1}frac{2}{3}frac{1}{2}frac{1}{2}$

My Question:
Now, to find $P(Cher; read; a; book)$, using Add-1 smoothing (Laplace smoothing) shouldn’t we add the word ‘Cher’ that appears first in a sentence? And to that, we must add $<s>$ and $<backslash s>$ in our vocabulary. With this, our calculation becomes

$P(Cher|<s>)P(read|Cher)P(a|read)P(book|a)P(<backslash s>|book)=frac{0+1}{3+13}frac{0+1}{1+13}frac{2+1}{3+13}frac{1+1}{2+13}frac{1+1}{2+13}$

The 13 added to each numerator is due to the unique word count of the vocabulary which has 11 English words from our 3-sentence corpus plus 2 tokens – start and end of a sentence. In few places, I see 11 is added instead of 13 to the numerator, wondering what I’m missing here!


Get this bounty!!!

#StackBounty: #nlp #language-model #stanford-nlp #ngrams In smoothing of n-gram model in NLP, why don't we consider start and end o…

Bounty: 50

When learning Add-1 smoothing, I found that somehow we’re adding 1 to each word in our vocabulary but not considering start-of-sentence and end-of-sentence as two words in the vocabulary. Let me throw an example to explain.

Example:

Assume we’ve a corpus of three sentences: "John read Moby Dick", "Mary read a different book", and "She read a book by Cher".
After training our bi-gram model on this corpus of three sentences, we need to evaluate the probability of a sentence "John read a book", i.e. to find $P(John; read; a; book)$

To differentiate John appearing anywhere in a sentence from it’s appearance at the beginning, and likewise for book appearing at the end, we rather try to find $P(<s>John; read; a; book<backslash s>)$ after introducing two more words $<s>$ and $<backslash s>$, indicating start of a sentence, and end of a sentence respectively.

Finally, we arrive at the

$P(<s>John; read; a; book<backslash s>)$ as
$P(John|<s>)P(read|John)P(a|read)P(book|a)P(<backslash s>|book)=frac{1}{3}frac{1}{1}frac{2}{3}frac{1}{2}frac{1}{2}$

My Question:
Now, to find $P(Cher; read; a; book)$, using Add-1 smoothing (Laplace smoothing) shouldn’t we add the word ‘Cher’ that appears first in a sentence? And to that, we must add $<s>$ and $<backslash s>$ in our vocabulary. With this, our calculation becomes

$P(Cher|<s>)P(read|Cher)P(a|read)P(book|a)P(<backslash s>|book)=frac{0+1}{3+13}frac{0+1}{1+13}frac{2+1}{3+13}frac{1+1}{2+13}frac{1+1}{2+13}$

The 13 added to each numerator is due to the unique word count of the vocabulary which has 11 English words from our 3-sentence corpus plus 2 tokens – start and end of a sentence. In few places, I see 11 is added instead of 13 to the numerator, wondering what I’m missing here!


Get this bounty!!!

#StackBounty: #nlp #language-model #stanford-nlp #ngrams In smoothing of n-gram model in NLP, why don't we consider start and end o…

Bounty: 50

When learning Add-1 smoothing, I found that somehow we’re adding 1 to each word in our vocabulary but not considering start-of-sentence and end-of-sentence as two words in the vocabulary. Let me throw an example to explain.

Example:

Assume we’ve a corpus of three sentences: "John read Moby Dick", "Mary read a different book", and "She read a book by Cher".
After training our bi-gram model on this corpus of three sentences, we need to evaluate the probability of a sentence "John read a book", i.e. to find $P(John; read; a; book)$

To differentiate John appearing anywhere in a sentence from it’s appearance at the beginning, and likewise for book appearing at the end, we rather try to find $P(<s>John; read; a; book<backslash s>)$ after introducing two more words $<s>$ and $<backslash s>$, indicating start of a sentence, and end of a sentence respectively.

Finally, we arrive at the

$P(<s>John; read; a; book<backslash s>)$ as
$P(John|<s>)P(read|John)P(a|read)P(book|a)P(<backslash s>|book)=frac{1}{3}frac{1}{1}frac{2}{3}frac{1}{2}frac{1}{2}$

My Question:
Now, to find $P(Cher; read; a; book)$, using Add-1 smoothing (Laplace smoothing) shouldn’t we add the word ‘Cher’ that appears first in a sentence? And to that, we must add $<s>$ and $<backslash s>$ in our vocabulary. With this, our calculation becomes

$P(Cher|<s>)P(read|Cher)P(a|read)P(book|a)P(<backslash s>|book)=frac{0+1}{3+13}frac{0+1}{1+13}frac{2+1}{3+13}frac{1+1}{2+13}frac{1+1}{2+13}$

The 13 added to each numerator is due to the unique word count of the vocabulary which has 11 English words from our 3-sentence corpus plus 2 tokens – start and end of a sentence. In few places, I see 11 is added instead of 13 to the numerator, wondering what I’m missing here!


Get this bounty!!!

#StackBounty: #nlp #language-model #stanford-nlp #ngrams In smoothing of n-gram model in NLP, why don't we consider start and end o…

Bounty: 50

When learning Add-1 smoothing, I found that somehow we’re adding 1 to each word in our vocabulary but not considering start-of-sentence and end-of-sentence as two words in the vocabulary. Let me throw an example to explain.

Example:

Assume we’ve a corpus of three sentences: "John read Moby Dick", "Mary read a different book", and "She read a book by Cher".
After training our bi-gram model on this corpus of three sentences, we need to evaluate the probability of a sentence "John read a book", i.e. to find $P(John; read; a; book)$

To differentiate John appearing anywhere in a sentence from it’s appearance at the beginning, and likewise for book appearing at the end, we rather try to find $P(<s>John; read; a; book<backslash s>)$ after introducing two more words $<s>$ and $<backslash s>$, indicating start of a sentence, and end of a sentence respectively.

Finally, we arrive at the

$P(<s>John; read; a; book<backslash s>)$ as
$P(John|<s>)P(read|John)P(a|read)P(book|a)P(<backslash s>|book)=frac{1}{3}frac{1}{1}frac{2}{3}frac{1}{2}frac{1}{2}$

My Question:
Now, to find $P(Cher; read; a; book)$, using Add-1 smoothing (Laplace smoothing) shouldn’t we add the word ‘Cher’ that appears first in a sentence? And to that, we must add $<s>$ and $<backslash s>$ in our vocabulary. With this, our calculation becomes

$P(Cher|<s>)P(read|Cher)P(a|read)P(book|a)P(<backslash s>|book)=frac{0+1}{3+13}frac{0+1}{1+13}frac{2+1}{3+13}frac{1+1}{2+13}frac{1+1}{2+13}$

The 13 added to each numerator is due to the unique word count of the vocabulary which has 11 English words from our 3-sentence corpus plus 2 tokens – start and end of a sentence. In few places, I see 11 is added instead of 13 to the numerator, wondering what I’m missing here!


Get this bounty!!!

#StackBounty: #nlp #language-model #stanford-nlp #ngrams In smoothing of n-gram model in NLP, why don't we consider start and end o…

Bounty: 50

When learning Add-1 smoothing, I found that somehow we’re adding 1 to each word in our vocabulary but not considering start-of-sentence and end-of-sentence as two words in the vocabulary. Let me throw an example to explain.

Example:

Assume we’ve a corpus of three sentences: "John read Moby Dick", "Mary read a different book", and "She read a book by Cher".
After training our bi-gram model on this corpus of three sentences, we need to evaluate the probability of a sentence "John read a book", i.e. to find $P(John; read; a; book)$

To differentiate John appearing anywhere in a sentence from it’s appearance at the beginning, and likewise for book appearing at the end, we rather try to find $P(<s>John; read; a; book<backslash s>)$ after introducing two more words $<s>$ and $<backslash s>$, indicating start of a sentence, and end of a sentence respectively.

Finally, we arrive at the

$P(<s>John; read; a; book<backslash s>)$ as
$P(John|<s>)P(read|John)P(a|read)P(book|a)P(<backslash s>|book)=frac{1}{3}frac{1}{1}frac{2}{3}frac{1}{2}frac{1}{2}$

My Question:
Now, to find $P(Cher; read; a; book)$, using Add-1 smoothing (Laplace smoothing) shouldn’t we add the word ‘Cher’ that appears first in a sentence? And to that, we must add $<s>$ and $<backslash s>$ in our vocabulary. With this, our calculation becomes

$P(Cher|<s>)P(read|Cher)P(a|read)P(book|a)P(<backslash s>|book)=frac{0+1}{3+13}frac{0+1}{1+13}frac{2+1}{3+13}frac{1+1}{2+13}frac{1+1}{2+13}$

The 13 added to each numerator is due to the unique word count of the vocabulary which has 11 English words from our 3-sentence corpus plus 2 tokens – start and end of a sentence. In few places, I see 11 is added instead of 13 to the numerator, wondering what I’m missing here!


Get this bounty!!!