Saturday 27 July 2013

The Levenshtein Distance Algorithm (Edit Distance)

Ever wondered how a spellchecker works? or how the song matching app on your phone works?
Well, it is not as complicated as you might think. Its pretty much a variation in the Levenshtein Distance algorithm or the Edit Distance algorithm, named after Vladimir Levenshtein.

It works like this: To tell the distance or rather the levenshtein distance between two words, all you have to do is find out how many operations are required to convert one word to another.
For example, the distance between "running" and "cunning" would be just one, as all you have to do is replace 'r' with 'c'. Similarly the distance between "roast" and "rest" would be two, one operation is removal of one of the characters 'o' or 'a' and replacing the other with 'e'. So how do we implement  this?

The Algorithm goes like this:
Given the two words(strings in this case), we calculate the number of characters in both and create a matrix having rows and columns as the sizes of the words+1. It does not matter which word is the row and which is the column. The words themselves are not in the matrix, its just a matrix of integers.
So in the example of "rest" and "roast" the matrix would be of size 5x6(or 6x5) and would look like this:

Now, lets represent the position in the matrix as a[i,j]. i-->row and j-->column.
The first element a[0,0] is always initiated as 0. to get any other element say a[i,j] we do the following.
We first compare the two characters, if they are equal then we simply bring down the value from the diagonal element i.e. a[i-1,j-1] and place it in a[i,j].
Else we see minValue = minimum( a[i-1,j] , a[i,j-1], a[i-1,j-1]) and put this minValue+1 in a[i,j].
And follow the same steps repeatedly till all the cells in the matrix are filled.

Now that we have built the matrix, here is the analysis we can make from it:
-- The last element in the matrix gives us the distance between the words.
-- Any intermediate cell gives us number of operations to convert the characters traversed in the row to the characters in the column.
-- The first row a[i,0] and the first column a[0,j] would be incremental as that's the number of operations it would take to convert a null string to the characters traversed.

The final matrix would look like this:

So a spell checker would generally use this distance to suggest the user what he might have been actually looking for by comparing the input with the database and sort the edit distance and based on a certain threshold set would give a suggestion. Similarly a song matching app would convert the recorded audio (after some signal processing) to a string which would be matched with the strings with the database, to calculate this distance and the one with the least value would most likely be the one.
Well, then again this is just a very abstract way putting things, the actual implementation is much more complicated but kinda works on the same principles.

For a simple java implementation click here.
PS: What we just did was "Dynamic Programming". =D
Cheers!

No comments:

Post a Comment