#StackBounty: #scikit-learn #natural-language #cosine-similarity Spectral Clustering of a skipgram model

Bounty: 100

I have a model where I’m applying Spectral Clustering to frequencies of words. My pipeline consists in TF-IDF, followed by a LSA to 100 dimensions, and finally a Spectral Clustering, all of these operations using sklearn.

I’m trying to replace TF-IDF and LSA by a skipgram model.

With the LSA model, the vectors had only positive values, hence a cosine similarity that is also positive.
However, using the skipgram model the word vectors also contain negative values, resulting in negative cosine similarities when words are antonyms.

The problem is that the Spectral Clustering in sklearn uses a normalized Laplacian of the cosine similarities, where the root square of the sum over rows is used to normalize. This results in inf or nan, and the Spectral Clustering does not work.

What is the correct way to handle this problem :

  • compute the pairwise_kernels and then set the matrix in the [0; 1] range by doing (matrix + 1) / 2. In this case, antonyms would have a 0 similarity, and synonyms 1. Words without any relation would have a 0.5 value
  • use the absolute value of the similarity. Antonyms would be similar, but it would make some sense for clustering.
  • use another similarity. But from what I read, cosine is the best similarity for skipgram.
  • use another clustering algorithm. But from my trials, Spectral Clustering gives the best clusters.

Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.