*Bounty: 50*

I am trying to evaluate a NLP model using BLEU and ROUGE. However, I am a bit confused about the difference between those scores. While I am aware that ROUGE is aimed at recall whilst BLEU measures precision, all ROUGE implementations I have come across also output precision and the F-score. The original ROUGE paper only briefly mentions precision and the F-score, therefore I am a bit unsure about what meaning they have to ROUGE. Is ROUGE mainly about recall and the precision and F-score are just added as a compliment, or is the ROUGE considered to be the combination of those three scores?

What confuses me even more is that to my understanding ROUGE-1 precision should be equal to BLEU when using the weights (1, 0, 0, 0), but that does not seem to be the case.

The only explanation I could have for this is the brevity penalty. However, I checked that the accumulated lengths of the references are shorter than the length of the hypothesis, which means that the brevity penalty is 1.

Nonetheless, BLEU with w = (1, 0, 0, 0) scores 0.55673 while ROUGE-1 precision scores 0.7249.

What am I getting wrong?

I am using nltk to evaluate BLEU and rouge-metric for ROUGE.

**Disclaimer**: I already posted this question on Data Science, however after not receiving any replies and doing some additional research on the differences between Data Science and Cross Validated, I figured that this question might be better suited for Cross Validated (correct me if I am wrong).

Get this bounty!!!