#StackBounty: #nlp #evaluation #ngrams Shouldn't ROUGE-1 precision be equal to BLEU with w=(1, 0, 0, 0) when brevity penalty is 1?

Bounty: 50

I am trying to evaluate a NLP model using BLEU and ROUGE. However, I am a bit confused about the difference between those scores. While I am aware that ROUGE is aimed at recall whilst BLEU measures precision, all ROUGE implementations I have come across also output precision and the F-score. The original ROUGE paper only briefly mentions precision and the F-score, therefore I am a bit unsure about what meaning they have to ROUGE. Is ROUGE mainly about recall and the precision and F-score are just added as a compliment, or is the ROUGE considered to be the combination of those three scores?

What confuses me even more is that to my understanding ROUGE-1 precision should be equal to BLEU when using the weights (1, 0, 0, 0), but that does not seem to be the case.
The only explanation I could have for this is the brevity penalty. However, I checked that the accumulated lengths of the references are shorter than the length of the hypothesis, which means that the brevity penalty is 1.
Nonetheless, BLEU with w = (1, 0, 0, 0) scores 0.55673 while ROUGE-1 precision scores 0.7249.

What am I getting wrong?

I am using nltk to evaluate BLEU and rouge-metric for ROUGE.


Get this bounty!!!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.