I have a problem related to clustering, where i need to cluster skill set from job domain.
Let’s say, in a resume a candidate can mention they familiarity with amazon s3 bucket. But each people can mention it in any way. For example,
- amazon s3
- aws s3
For a human, we can easily understand these three are exactly equavalent. I can’t use kmeans type of clustering because it can fail in a lot of cases.
- spring framework
- Spring MVC
- Spring Boot
These may fall in same cluster which is wrong. A candidate who knows spring framework might not know sprint boot etc.,
Similarity of word based on embeddings/bow model fail here.
What are the options I have? Currently I manually collected a lot of word variations in a dict format, key is root word value is array of variations of that root word.
Any help is really appreciated?