#StackBounty: #python #similarities #tf-idf #latent-semantic-indexing #bag-of-words Online Document Similarity (LSI/WMD)

Bounty: 100

I’m running a gensim-based LSI similarity model, which needs to be rebuilt every time a new entry is added to the corpus. Since these additions are fairly common (the target is to reach multiple additions per minute) I would like to explore online options.

Is there an option in LSI for an incremental learning model? Would using WMD and replacing new entries into the dictionary be more efficient? Right now my issue is that WMD takes a lot of memory, but I’m willing to sacrifice upfront cost if I can get better per-query performance, as I ultimately aim to include this in a fast-responding API.

Currently building as (and please excuse the naming conventions):

def build_cache():
    self.MODEL_CACHE = {
        'urls': [],
        'texts': []
    }

    all_articles = [retrieve from database]
    for article in all_articles:
        self.MODEL_CACHE['urls'].append(article.url)
        self.MODEL_CACHE['texts'].append(self.preprocess(article.body))
    # self.preprocess(text) runs nltk's word_tokenize and [if not in stop_words]

    # all the imports below are from gensim, gensim.models, gensim.similarities
    self.MODEL_CACHE['dictionary'] = corpora.HashDictionary(self.MODEL_CACHE['texts'])
    self.MODEL_CACHE['corpus_gensim'] = [self.MODEL_CACHE['dictionary'].doc2bow(doc) for doc in self.MODEL_CACHE['texts']]

    self.MODEL_CACHE['corpus_tfidf'] = TfidfModel(self.MODEL_CACHE['corpus_gensim'])[self.MODEL_CACHE['corpus_gensim']]

    self.MODEL_CACHE['lsi'] = LsiModel(self.MODEL_CACHE['corpus_tfidf'], id2word=self.MODEL_CACHE['dictionary'], num_topics=100)
    self.MODEL_CACHE['lsi_index'] = MatrixSimilarity(self.MODEL_CACHE['lsi'][self.MODEL_CACHE['corpus_tfidf']])

    self.MODEL_CACHE['results'] = [self.MODEL_CACHE['lsi_index'][self.MODEL_CACHE['lsi'][self.MODEL_CACHE['corpus_tfidf'][i]]]
                                   for i in range(len(self.MODEL_CACHE['texts']))]

Most of what I’m doing is pretty close and inspired by:

https://www.kernix.com/blog/similarity-measure-of-textual-documents_p12

If there’s a more efficient high-performance docsim implementation out there, I’d love some pointers, I haven’t had much luck with Keras or its two backends.


Get this bounty!!!