Saturday 27 July 2013

Easy Java Implementation of the Levenshtein distance (Edit Distance) Algorithm

Here I will give a few simple methods to implement the Levenshtein distance (Edit Distance) algorithm. This a good algorithm to implement for beginners in java or rather a programming language in general.
To know more about what this algorithm does and how it works, see my previous post here.

The first method takes the input the two string s to be compared, calls the matrixGenerator method and finally this returns the editDistance value.

int editDistValue(String str1, String str2) {
       
        int editRow = str1.length() + 1; //initialize row
        int editCol = str2.length() + 1; //initialize column

        int[][] editMatrix = new int[editRow][editCol];
        editMatrix = editMatrixGenerator(str1, str2, editMatrix, editRow,
                editCol);

        int editDistance = 999;
        editDistance = editMatrix[editRow - 1][editCol - 1];

        return editDistance;
    }

the editMatrix Generator creates the matrix using the algorithm and generates the matrix.

int[][] editMatrixGenerator(String str1, String str2,
            int[][] editMatrix, int editRow, int editCol) {
       
        int i, j;

        char[] rowString = str1.toCharArray();
        char[] colString = str2.toCharArray();

        for (i = 0; i < editRow; i++) {
         for (j = 0; j < editCol; j++) {
           if (i == 0) {
             editMatrix[0][j] = j;
            }
           if (j == 0) {
             editMatrix[i][0] = i;
            }
           if (i != 0 && j != 0) {
             if (rowString[i - 1] == colString[j - 1]) {
              editMatrix[i][j] = editMatrix[i - 1][j - 1];
             } else {
              editMatrix[i][j] = (Math.min(editMatrix[i - 1][j - 1],
                                   Math.min(editMatrix[i][j - 1],
                                   editMatrix[i - 1][j]))) + 1;
                    }
                }
            }
        }
        return editMatrix;
    }

Finally to get a proper confidence value we send the editDistance value obtained, along with the input to the edSimilarityCal method to get a value between [0,1].

float edSimilarityCal(String string1, String string2, int editDistance) {
       
        float maxLen=string1.length();
        if (maxLen < string2.length())
        maxLen = string2.length();
        if (maxLen == 0.0F)
        return 1.0F;
        else
        return 1.0F - editDistance/maxLen;  
    }


Well that's about it.
Cheers!

The Levenshtein Distance Algorithm (Edit Distance)

Ever wondered how a spellchecker works? or how the song matching app on your phone works?
Well, it is not as complicated as you might think. Its pretty much a variation in the Levenshtein Distance algorithm or the Edit Distance algorithm, named after Vladimir Levenshtein.

It works like this: To tell the distance or rather the levenshtein distance between two words, all you have to do is find out how many operations are required to convert one word to another.
For example, the distance between "running" and "cunning" would be just one, as all you have to do is replace 'r' with 'c'. Similarly the distance between "roast" and "rest" would be two, one operation is removal of one of the characters 'o' or 'a' and replacing the other with 'e'. So how do we implement  this?

The Algorithm goes like this:
Given the two words(strings in this case), we calculate the number of characters in both and create a matrix having rows and columns as the sizes of the words+1. It does not matter which word is the row and which is the column. The words themselves are not in the matrix, its just a matrix of integers.
So in the example of "rest" and "roast" the matrix would be of size 5x6(or 6x5) and would look like this:

Now, lets represent the position in the matrix as a[i,j]. i-->row and j-->column.
The first element a[0,0] is always initiated as 0. to get any other element say a[i,j] we do the following.
We first compare the two characters, if they are equal then we simply bring down the value from the diagonal element i.e. a[i-1,j-1] and place it in a[i,j].
Else we see minValue = minimum( a[i-1,j] , a[i,j-1], a[i-1,j-1]) and put this minValue+1 in a[i,j].
And follow the same steps repeatedly till all the cells in the matrix are filled.

Now that we have built the matrix, here is the analysis we can make from it:
-- The last element in the matrix gives us the distance between the words.
-- Any intermediate cell gives us number of operations to convert the characters traversed in the row to the characters in the column.
-- The first row a[i,0] and the first column a[0,j] would be incremental as that's the number of operations it would take to convert a null string to the characters traversed.

The final matrix would look like this:

So a spell checker would generally use this distance to suggest the user what he might have been actually looking for by comparing the input with the database and sort the edit distance and based on a certain threshold set would give a suggestion. Similarly a song matching app would convert the recorded audio (after some signal processing) to a string which would be matched with the strings with the database, to calculate this distance and the one with the least value would most likely be the one.
Well, then again this is just a very abstract way putting things, the actual implementation is much more complicated but kinda works on the same principles.

For a simple java implementation click here.
PS: What we just did was "Dynamic Programming". =D
Cheers!