I am trying to evaluate a NLP model using BLEU and ROUGE. However, I am a bit confused about the difference between those scores. While I am aware that ROUGE is aimed at recall whilst BLEU measures precision, all ROUGE implementations I have come across also output precision and the F-score. The original ROUGE paper only briefly mentions precision and the F-score, therefore I am a bit unsure about what meaning they have to ROUGE. Is ROUGE mainly about recall and the precision and F-score are just added as a compliment, or is the ROUGE considered to be the combination of those three scores?
What confuses me even more is that to my understanding ROUGE-1 precision should be equal to BLEU when using the weights (1, 0, 0, 0), but that does not seem to be the case.
The only explanation I could have for this is the brevity penalty. However, I checked that the accumulated lengths of the references are shorter than the length of the hypothesis, which means that the brevity penalty is 1.
Nonetheless, BLEU with w = (1, 0, 0, 0) scores 0.55673 while ROUGE-1 precision scores 0.7249.
What am I getting wrong?