Fuzzy Matching with UTL_MATCH

May 18, 2017

in Datatypes, PL/SQL, UTL packages

Fuzzy Match with UTL_MATCH
Oracle’s UTL_MATCH package contains functions to perform fuzzy matching between two strings based on one of these algorithms:

  • Levenshtein Distance
  • Jaro-Winkler Distance

Let’s understand the algorithms and see UTL_MATCH subprograms in action.

Note: It is not necessary to understand the nitty-gritties of the algorithms to use UTL_MATCH. The deep dive is for the keen. For others and for all practical purposes,  a broad ("fuzzy", shall we say) understanding of the subprogams will serve you just fine.

Understanding Levenshtein Distance

The Levenshtein distance between two strings is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one string into the other. Levenshtein distance is named after Russian scientist Vladimir Levenshtein, who came up with this algorithm in 1965.

Deep Dive: The Levenshtein distance between ‘Steven’ and ‘Stephen’ is 2, since the minimum number of single-character edits to convert ‘Steven’ to ‘Stephen’ is 2:

Step 1: STEVEN -> substitute P for V -> STEPEN
Step 2: STEPEN -> insert H –> STEPHEN

UTL_MATCH Subprograms based on Levenshtein Distance

Levenshtein distance is implemented in UTL_MATCH using two functions:

EDIT_DISTANCE

Numeric value of Levenshtein distance – i.e. the minimum number of single-character edits required to change one word into the other.

For example, the EDIT_DISTANCE between ‘Steven’ and ‘Stephen’ is 2.

EDIT_DISTANCE_SIMILARITY

Normalized Levenshtein distance in the range of 0 (no match) and 100 (perfect match). Calculated as:

ceiling value of
[100 – (EDIT_DISTANCE/(Length of the Longer String being Matched) * 100)]

For example, the EDIT_DISTANCE_SIMILARITY between ‘Steven’ and ‘Stephen’ is ceiling [100 – (2/7*100)] = 72.

 

UTL_MATCH.EDIT_DISTANCE in action:

select utl_match.edit_distance
         ('Steven'
        , 'Stephen') edit_distance
from dual;

When run:

SQL> select utl_match.edit_distance
  2           ('Steven'
  3          , 'Stephen') edit_distance
  4  from dual;

EDIT_DISTANCE
-------------
            2

UTL_MATCH.EDIT_DISTANCE_SIMILARITY in action:

select utl_match.edit_distance_similarity
         ('Steven'
        , 'Stephen') edit_distance_similarity
from dual;

When run:

SQL> select utl_match.edit_distance_similarity
  2           ('Steven'
  3          , 'Stephen') edit_distance_similarity
  4  from dual;

EDIT_DISTANCE_SIMILARITY
------------------------
                      72

Understanding Jaro-Winkler Distance

The Jaro-Winkler distance is a measure of agreement between two strings. This algorithm was originally proposed as Jaro distance in 1989 by Matthew Jaro, and a variant was added in 1999 by William Winkler.

Deep Dive: The formula for Jaro-Winkler distance is more involved than the one for Levenshtein distance.

Jaro distance between strings S1 and S2 is calculated as:

Jaro Distance
where

m = Number of matching characters between S1 and S2, provided the matching characters fall within a "match window" (defined below*)

|S1| = Length of string S1

|S2| = Length of string S2

t = Half the number of transpositions
e.g. if S1 = JOHNSON and S2 = JONHSON, the characters H/N and N/H are transposed leading to t = half of (2 transpositions) = 1

*match window = floor(max (|S1|, |S2|), 2) – 1

To Jaro distance, Winkler added a prefix scale p which considers strings to have a stronger match if they match from the beginning.

Jaro-Winkler distance between strings S1 and S2 is calculated as:

Jaro Winkler Distance

where

dj = Jaro score

l = length of common prefix at the start of the strings up to max 4 characters

p = constant scaling factor for how much the score is adjusted upwards for having common prefixes. Usually – and in Oracle’s UTL_MATCH implementation -  this constant is set to 0.1.

Let’s apply the Jaro-Winkler formula to our old friends ‘Steven’ and ‘Stephen’.

m [Number of matching characters] = (S,T,E,E,N) =  5
|S1| = length(‘Steven’) = 6
|S2| = length(‘Stephen’) = 7
t [Half the number of transpositions] = 0
l [length of common prefix at the start] = Ste = 3

match window = floor(max(6, 7), 2) – 1 = 2

So
Jaro score = 1/3 (5/6 +5/7 + (5 – 0)/5) 
                       = 0.8492

and
Jaro-Winkler score = Jaro score 
                                                  + (length of common prefix  * p (1 – Jaro score))
                                         = 0.8492 + (3 * 0.1 (1 – 0.8492))
                                         = 0.8492 + 0.04524
                                         = 0.89444

UTL_MATCH Subprograms based on Jaro-Winkler Distance

Jaro-Winkler distance is implemented in UTL_MATCH using two functions:

JARO_WINKLER

Value calculated by the Jaro-Winkler formula. The function returns a Binary Double type, which can CAST in SQL as Number if so needed.

For example, the JARO_WINKLER value  on matching ‘Steven’ and ‘Stephen’ is 0.89444.

JARO_WINKLER_SIMILARITY

Normalized Jaro-Winkler result in the range of 0 (no match) and 100 (perfect match). Calculated as:

floor value of (JARO_WINKLER * 100)

For example, the JARO_WINKLER_SIMILARITY between ‘Steven’ and ‘Stephen’ is floor[100 * .89444] = 89.

 

UTL_MATCH.JARO_WINKLER in action:

select cast(utl_match.jaro_winkler
         ('Steven'
        , 'Stephen') as number(5,5)) jaro_winkler
from dual;

When run:

SQL> select cast(utl_match.jaro_winkler
  2           ('Steven'
  3          , 'Stephen') as number(5,5)) jaro_winkler
  4  from dual;

JARO_WINKLER
------------
      .89444

UTL_MATCH.JARO_WINKLER_SIMILARITY in action:

select utl_match.jaro_winkler_similarity
         ('Steven'
        , 'Stephen') jaro_winkler_similarity
from dual;

When run:

SQL> select utl_match.jaro_winkler_similarity
  2           ('Steven'
  3          , 'Stephen') jaro_winkler_similarity
  4  from dual;

JARO_WINKLER_SIMILARITY
-----------------------
                     89

Deep Dive Recap: Let’s try UTL_MATCH’s superheroic fuzzy matching on SUPERMAN / SPIDERMAN

Now that we know the fuzzy match algorithms, we’ll build the expected result on a test case and then validate UTL_MATCH’s actual result.

What does it take to change SPIDERMAN to SUPERMAN?

Leaving aside musings on the lines of "stop him from swinging, make him start flying", it takes 3 steps:

Step 1: SPIDERMAN –> Insert U –> SUPIDERMAN
Step 2: SUPIDERMAN –> Remove D –> SUPERIMAN
Step 3: SUPERIMAN –> Remove I –> SUPERMAN

So the Levenshtein distance between SPIDERMAN and SUPERMAN is 3.

length(SPIDERMAN) = 9 i.e. the longer string
length(SUPERMAN) = 8

The normalized Levenshtein distance between SPIDERMAN and SUPERMAN is ceiling(100 – 3/9*100) = ceil(66.666…) = 67

How much in agreement are SPIDERMAN and SUPERMAN?

I won’t belabor a joke by saying both strive to save the world wearing blue-red costumes and go straight to metrics for Jaro-Winkler distance.

match window = floor(max(9, 8), 2) – 1 = 3

P,E,R,M,A,N are not in the same positions in the two strings but since they are displaced by not more than the match window, they are considered matching characters.

m [Number of matching characters] = (S,P,E,R,M,A,N) =  7
t [Half the number of transpositions] = 0
l [length of common prefix at the start] = S = 1

So
Jaro score = 1/3 (7/9 +7/8 + (7 – 0)/7) 
                       = 1/3 (.77778 + .875 + 1) 
                       = 0.88426

and
Jaro-Winkler score = Jaro score 
                                                  + (length of common prefix  * p (1 – Jaro score))
                                         = 0.88426 + (1 * 0.1 (1 – 0.88426))
                                         = 0.88426 + 0.011574
                                         = 0.89583

Which means that, on fuzzy matching of strings SPIDERMAN and SUPERMAN:

  • UTL_MATCH.EDIT_DISTANCE should return 3
  • UTL_MATCH_EDIT_DISTANCE_SIMILARITY should return 67
  • UTL_MATCH.JARO_WINKLER should return 0.89583
  •  UTL_MATCH.JARO_WINKLER_SIMILARITY should return 89

Do the results tally with the estimates?

Edit distance results:

SQL> select utl_match.edit_distance
  2           ('Superman'
  3          , 'Spiderman') edit_distance
  4       , utl_match.edit_distance_similarity
  5           ('Superman'
  6          , 'Spiderman') edit_distance_similarity
  7  from dual;

EDIT_DISTANCE EDIT_DISTANCE_SIMILARITY
------------- ------------------------
            3                       67

Jaro-Winkler results:

SQL> select cast(utl_match.jaro_winkler
  2           ('Superman'
  3          , 'Spiderman') as number(5,5)) jaro_winkler
  4       , utl_match.jaro_winkler_similarity
  5           ('Superman'
  6          , 'Spiderman') jaro_winkler_similarity
  7  from dual;

JARO_WINKLER JARO_WINKLER_SIMILARITY
------------ -----------------------
      .89583                      89

Yes, UTL_MATCH gives back fuzzy match results exactly as estimated.

Summary

Oracle’s UTL_MATCH package has functions that perform fuzzy matching between two strings and return a score. We saw with examples how these functions work:

EDIT_DISTANCE
EDIT_DISTANCE_SIMILARITY
JARO_WINKLER
JARO_WINKLER_SIMILARITY

UTL_MATCH functions are powerful tools for finding similarities between two records based on identity info (name, email, etc) and spotting potential duplicates.

For Further Reading

{ 0 comments… add one now }

Leave a Comment

Previous post:

Next post: