I have a model where I’m applying
Spectral Clustering to frequencies of words. My pipeline consists in
TF-IDF, followed by a
LSA to 100 dimensions, and finally a Spectral Clustering, all of these operations using
I’m trying to replace
LSA by a
LSA model, the vectors had only positive values, hence a cosine similarity that is also positive.
However, using the
skipgram model the word vectors also contain negative values, resulting in negative cosine similarities when words are antonyms.
The problem is that the
Spectral Clustering in
sklearn uses a
normalized Laplacian of the cosine similarities, where the root square of the sum over rows is used to normalize. This results in
nan, and the
Spectral Clustering does not work.
What is the correct way to handle this problem :
- compute the
pairwise_kernelsand then set the matrix in the
[0; 1]range by doing
(matrix + 1) / 2. In this case, antonyms would have a 0 similarity, and synonyms 1. Words without any relation would have a 0.5 value
- use the absolute value of the similarity. Antonyms would be similar, but it would make some sense for clustering.
- use another similarity. But from what I read, cosine is the best similarity for
- use another clustering algorithm. But from my trials,
Spectral Clusteringgives the best clusters.