# #StackBounty: #c# #performance The process to calculate the Levenshtien distance of each element of a large data set with every other e…

### Bounty: 50

I am currently working on a problem to find the best matches of data in a List namely “ListA” from another List called “ListB”. Whenever I find a match of an element for “ListA” with any element in “ListB” which has a confidence and accuracy with 70% or greater I add the matched string from List B and the string in List A to a tuple which i further save in a database.

Levenshtien algorithm gives me the distance between the word in ListA and words in ListB, I use the distance to calculate the similarity between the words and compare them with my threshold of 70% and if the values returned is equal or greater than the 70 percent threshold then I add the results which are either 70% or greater to the tuple.

The code that I have written for this process works fine if the records in “ListA” and “ListB” are within thousands of values and if I increase the records to a million it takes about an hour to calculate the distance for each element of the List A.

I need to optimize the process for huge data sets. Please advise where do I need to make the improvements.

My code for the process so far looks like this

``````  public static PerformFuzzyMatch()
{
// Fetch the ListA & List B from SQL tables
var ListACount = await FuzzyMatchRepo.FetchListACount();
var ListB = await FuzzyMatchRepo.FetchListBAsync();

//Split the ListA data to smaller chunks and loop through those chunks
var splitGroupSize = 1000;
var sourceDataBatchesCount = ListACount / splitGroupSize;

// Loop through the smaller chunks of List A
for (int b = 0; b < sourceDataBatchesCount; b++)
{
var currentBatchMatchedWords = new List<Tuple<string, string, double>>();
int skipRowCount = b * splitGroupSize;
int takeRowCount = splitGroupSize;

// Get chunks of data from ListA according to the skipRowCount and takeRowCount
var currentSourceDataBatch = FuzzyMatchRepository.FetchSourceDataBatch(skipRowCount, takeRowCount);

//Loop through the ListB and parallely calculate the distance between chunks of List A and List B data
for (int i = 0; i < ListB.Count; i++)
{
Parallel.For(
0,
currentSourceDataBatch.Count,
new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount * 10 },
cntr =>
{
try
{
// call the Levenshtien Algorithm to calculate the distance between each element of ListB and the smaller chunk of List A.
int leven = LevenshteinDistance(currentSourceDataBatch[cntr], ListB[i]);
int length = Math.Max(currentSourceDataBatch[cntr].Length, ListB[i].Length);
double similarity = double similarity = 1.0 - (double)leven / length;
if ((similarity * 100) >= 70)
{
}
cntr++;
}
catch (Exception ex)
{
exceptions.Enqueue(ex);
}
});
}
}
}
``````

And the algorithm which it calls is to calculate the distance is

`````` public static int LevenshteinDistance(this string input, string comparedTo, bool caseSensitive = false)
{
if (string.IsNullOrWhiteSpace(input) || string.IsNullOrWhiteSpace(comparedTo))
{
return -1;
}

if (!caseSensitive)
{
input = Common.Hashing.InvariantUpperCaseStringExtensions.ToUpperInvariant(input);
comparedTo = Common.Hashing.InvariantUpperCaseStringExtensions.ToUpperInvariant(comparedTo);
}

int inputLen = input.Length;
int comparedToLen = comparedTo.Length;

int[,] matrix = new int[inputLen, comparedToLen];

//initialize
for (var i = 0; i < inputLen; i++)
{
matrix[i, 0] = i;
}
for (var i = 0; i < comparedToLen; i++)
{
matrix[0, i] = i;
}

//analyze
for (var i = 1; i < inputLen; i++)
{
ushort si = input[i - 1];
for (var j = 1; j < comparedToLen; j++)
{
ushort tj = comparedTo[j - 1];
int cost = (si == tj) ? 0 : 1;

int above = matrix[i - 1, j];
int left = matrix[i, j - 1];
int diag = matrix[i - 1, j - 1];
int cell = FindMinimumOptimized(above + 1, left + 1, diag + cost);

//transposition
if (i > 1 && j > 1)
{
int trans = matrix[i - 2, j - 2] + 1;
if (input[i - 2] != comparedTo[j - 1])
{
trans++;
}
if (input[i - 1] != comparedTo[j - 2])
{
trans++;
}
if (cell > trans)
{
cell = trans;
}
}
matrix[i, j] = cell;
}
}
return matrix[inputLen - 1, comparedToLen - 1];
}
``````

Find Minimum Optimized

`````` public static int FindMinimumOptimized(int a, int b, int c)
{
return Math.Min(a, Math.Min(b, c));
}
``````

Get this bounty!!!

This site uses Akismet to reduce spam. Learn how your comment data is processed.