I’m new to NLP and I’m tackling a problem where I want to improve on the existing hacky solution.
Given a corpus of grammatically-correct text in the target language (it is uncommon: Bulgarian), train a "comparator" which then is able to take two (or more) potentially "wrong" sentences. These are generated by an algorithm and could be wrong grammatically. The comparator shall return the one which seems to be the best grammar-wise.
First things first, Bulgarian is an obscure slavic language, and I do not want to write a full Bulgarian parser. The time & resources for this project are way too little to afford highly-sophisticated solutions.
This task pertains to an art installation, a fun application where very high accuracy is not required.
The source sentences are generated algorithmically from news articles’ titles, and this mutation part operates more or less on a "search’n’replace" method. This is capable of producing grammatically incorrect sentences, so there’s a need for further filtering.
Consider the input sentence John Doe wasn’t invited to the awarding ceremony.
The aforementioned mutation part might produce the following candidates:
- John Doe and I wasn’t invited to the awarding ceremony.
- John Doe and I weren’t invited to the awarding ceremony.
(keep in mind in reality this is all in the obscure language and I’m omitting details. I’m translating to English for convenience only. Also the actual language has much more particularities since adjectives have to agree to the nouns on gender, number and definiteness. It is unfeasible to "fix" the mutator part, even though it seems tempting.)
With these two outputs, the comparator would ideally select 2 as being the correct sentence.
Current hacky approach
I have about 7 million grammatically correct sentences as my training corpus and the current approach is:
- Handle formatting differences, fix minor issues in the source, so I have a list of sentences. Each sentence is a list of words. Words that are names, numbers, dates, etc. are replaced by placeholders like @NAME, @NUMBER, @DATE — don’t care what the name is, as long as it’s a name. E.g. the sample sentence becomes [ "@NAME", "wasn’t", "invited", "to", "the", "awarding", "ceremony" ].
- Consider each two consecutive words (bigrams) and count the relative frequency they occur in the corpus. E.g. add the edges "$NAME"-"wasn’t", "wasn’t"-"invited", and so forth. In the end, for each word, I’d know the relative frequency of what word may come next.
- Given a new sentence, split it into words as above, then compute the probability of this sentence by the geometric mean of the probabilities of the bigrams in it.
- To compare two or more sentences, select the one with the highest probability.
This works to some extent, and is especially useful when the mutator produces a syntactically wrong word.
However its approach is too local ("peephole"), and for the two sentences above it fails, it actually prefers option 1 (and it’s not hard to see why).
Which approach can be used to statistically infer the grammar of a language just enough to write a better comparator of the described type? I’m happy to invest some time, e.g. a few days, into learning NLP stuff and approaches (and neural networks, if they’ll help).
EDIT: I’m curious why are people downvoting this question, while providing no feedback why they think it’s bad. I’m new to this SE and if I’m violating some guideline, let me know — I’ll do my best to fix it.