#StackBounty: #classification #natural-language #word-embeddings Text Embeddings on a Small Dataset

Bounty: 50

I am trying to solve a binary text classification problem of academic text in a niche domain (Generative vs Cognitive Linguistics). My target text data consists of near 400 paper abstracts with less than 300 words in each. Previously I tried to use Doc2Vec in order to solve the problem, but the best accuracy that I could get was around 82%. I have since tried to use pre-trained vectors but the consensus on Doc2Vec is that it is best not to use pre-trained doc2vecs. I have tried using pre-trained Word2Vec models, but the models are usually huge and my laptop (8 GBs of RAM) cannot handle loading them. So as a result I tried to collect a larger source data myself, and train a word embedding model on that data, and then use the word vectors in the target domain.

I have collected more than 70K of paper abstracts in the related fields (mostly papers categorized with Linguistics tag), and have trained FastText, Doc2Vec and Word2Vec models on the source data. But after using these models in the target domain, the results are not even better than my previous attempts with a simple Doc2Vec, let alone being marginally better.

I have also tried using TFIDF and CountVectorizer on the target domain, but the results do not become better.

Yesterday I stumbled upon this implementation of getting document vectors using Word2Vec and TF IDF simultaneously, but to my surprise, the results are on par with averaging the documents.

I was thinking of maybe using active learning in the process? Since maybe the bottleneck is the very small target dataset. Or maybe generating synthetic texts similar to the target data?

Thank you for reading this.


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.