I was going through a research paper: FINE-GRAINED ANALYSIS OF SENTENCE
EMBEDDINGS USING AUXILIARY PREDICTION TASKS
The key take away was Comparison of Encoder decoder and average word sentence embedding validated for accuracy for sentence embedding on 3 basic language characteristics- sentence length, word content, and word order.
I found it surprising, that an averaged word embedding for a sentence is better at predicting presence of a word in a sentence than an Encoder Decoder. Also, how is it that, increasing embedding size deteriorates it’s performance.
Same question goes for word ordering, how is average word embedding able to do that? The experiments are able to explain, what would happen if prediction is based on permutation of words, but the explanation doesn’t feel intuitive to me. How is simple avg word embedding able to contain information like word order when taking it’s average kinda nullifies the order info