I have a dataset with several columns.
Now what I want is to basically calculate score based on a particular column ("name") but grouped on the "id" column.
_id fName lName age 0 ABCD Andrew Schulz 1 ABCD Andreww 23 2 DEFG John boy 3 DEFG Johnn boy 14 4 CDGH Bob TANNA 13
So what I am looking is whether for the same id, I am getting similar entries, so I can remove those entries based on a threshold score values. Like here if i run it for col "fName". I should be able to reduce this dataframe to based on a score threshold:
_id fName lName age 0 ABCD Andrew Schulz 23 2 DEFG John boy 14 4 CDGH Bob TANNA 13
I intend to use pyjarowinkler.
If I had two independent columns (without all the group by stuff) to check, this is how I use it.
df['score'] = [distance.get_jaro_distance(x, y) for x, y in zip(df['name_1'],df['name_2'])] df = df[df['score'] > 0.87]
Can someone suggest a pythonic and fast way of doing this
So, I have tried using record linkage library for this. And I have ended up at a dataframe containing pair of indexes that are similar called ‘matches’. Now I just want to basically combine the data.
# Indexation step indexer = recordlinkage.Index() indexer.block(left_on='_id') candidate_links = indexer.index(df) # Comparison step compare_cl = recordlinkage.Compare() compare_cl.string('fName', 'fName', method='jarowinkler', threshold=threshold, label='full_name') features = compare_cl.compute(candidate_links, df) # Classification step matches = features[features.sum(axis=1) >= 1] print(len(matches))
This is how matches looks:
index1 index2 fName 0 1 1.0 2 3 1.0
I need someone to suggest a way to combine the similar rows in a way that takes data from similar rows