Oracle’s UTL_MATCH package contains functions to perform fuzzy matching between two strings based on one of these algorithms:

- Levenshtein Distance
- Jaro-Winkler Distance

Let’s understand the algorithms and see UTL_MATCH subprograms in action.

*Note:* It is not necessary to understand the nitty-gritties of the algorithms to use UTL_MATCH. The deep dive is for the keen. For others and for all practical purposes, a broad ("fuzzy", shall we say) understanding of the subprogams will serve you just fine.

### Understanding Levenshtein Distance

The **Levenshtein distance** between two strings is the **minimum number of single-character edits** (insertions, deletions or substitutions) required to change one string into the other. Levenshtein distance is named after Russian scientist Vladimir Levenshtein, who came up with this algorithm in 1965.

**Deep Dive:** The Levenshtein distance between ‘Steven’ and ‘Stephen’ is 2, since the minimum number of single-character edits to convert ‘Steven’ to ‘Stephen’ is 2:

Step 1: STEVEN -> substitute P for V -> STEPEN

Step 2: STEPEN -> insert H –> STEPHEN

### UTL_MATCH Subprograms based on Levenshtein Distance

Levenshtein distance is implemented in UTL_MATCH using two functions:

EDIT_DISTANCE |
Numeric value of Levenshtein distance – i.e. the minimum number of single-character edits required to change one word into the other. For example, the EDIT_DISTANCE between ‘Steven’ and ‘Stephen’ is 2. |

EDIT_DISTANCE_SIMILARITY |
Normalized Levenshtein distance in the range of 0 (no match) and 100 (perfect match). Calculated as: ceiling value of For example, the EDIT_DISTANCE_SIMILARITY between ‘Steven’ and ‘Stephen’ is ceiling [100 – (2/7*100)] = 72. |

#### UTL_MATCH.EDIT_DISTANCE in action:

select utl_match.edit_distance ('Steven' , 'Stephen') edit_distance from dual;

When run:

SQL> select utl_match.edit_distance 2 ('Steven' 3 , 'Stephen') edit_distance 4 from dual; EDIT_DISTANCE ------------- 2

#### UTL_MATCH.EDIT_DISTANCE_SIMILARITY in action:

select utl_match.edit_distance_similarity ('Steven' , 'Stephen') edit_distance_similarity from dual;

When run:

SQL> select utl_match.edit_distance_similarity 2 ('Steven' 3 , 'Stephen') edit_distance_similarity 4 from dual; EDIT_DISTANCE_SIMILARITY ------------------------ 72

### Understanding Jaro-Winkler Distance

The **Jaro-Winkler distance** is a **measure of agreement between two strings**. This algorithm was originally proposed as Jaro distance in 1989 by Matthew Jaro, and a variant was added in 1999 by William Winkler.

**Deep Dive:** The formula for Jaro-Winkler distance is more involved than the one for Levenshtein distance.

Jaro distance between strings S1 and S2 is calculated as:

where

m = Number of matching characters between S1 and S2, provided the matching characters fall within a "match window" (defined below*)

|S1| = Length of string S1

|S2| = Length of string S2

t = Half the number of transpositions

e.g. if S1 = JOHNSON and S2 = JONHSON, the characters H/N and N/H are transposed leading to t = half of (2 transpositions) = 1

*match window = floor(max (|S1|, |S2|), 2) – 1

To Jaro distance, Winkler added a prefix scale p which considers strings to have a stronger match if they match from the beginning.

Jaro-Winkler distance between strings S1 and S2 is calculated as:

where

dj = Jaro score

l = length of common prefix at the start of the strings up to max 4 characters

p = constant scaling factor for how much the score is adjusted upwards for having common prefixes. Usually – and in Oracle’s UTL_MATCH implementation - this constant is set to 0.1.

Let’s apply the Jaro-Winkler formula to our old friends ‘Steven’ and ‘Stephen’.

m [Number of matching characters] = (S,T,E,E,N) = 5

|S1| = length(‘Steven’) = 6

|S2| = length(‘Stephen’) = 7

t [Half the number of transpositions] = 0

l [length of common prefix at the start] = Ste = 3

match window = floor(max(6, 7), 2) – 1 = 2

So

Jaro score = 1/3 (5/6 +5/7 + (5 – 0)/5)

= 0.8492

and

Jaro-Winkler score = Jaro score

+ (length of common prefix * p (1 – Jaro score))

= 0.8492 + (3 * 0.1 (1 – 0.8492))

= 0.8492 + 0.04524

= 0.89444

### UTL_MATCH Subprograms based on Jaro-Winkler Distance

Jaro-Winkler distance is implemented in UTL_MATCH using two functions:

JARO_WINKLER |
Value calculated by the Jaro-Winkler formula. The function returns a Binary Double type, which can CAST in SQL as Number if so needed. For example, the JARO_WINKLER value on matching ‘Steven’ and ‘Stephen’ is 0.89444. |

JARO_WINKLER_SIMILARITY |
Normalized Jaro-Winkler result in the range of 0 (no match) and 100 (perfect match). Calculated as: floor value of (JARO_WINKLER * 100) For example, the JARO_WINKLER_SIMILARITY between ‘Steven’ and ‘Stephen’ is floor[100 * .89444] = 89. |

#### UTL_MATCH.JARO_WINKLER in action:

select cast(utl_match.jaro_winkler ('Steven' , 'Stephen') as number(5,5)) jaro_winkler from dual;

When run:

SQL> select cast(utl_match.jaro_winkler 2 ('Steven' 3 , 'Stephen') as number(5,5)) jaro_winkler 4 from dual; JARO_WINKLER ------------ .89444

#### UTL_MATCH.JARO_WINKLER_SIMILARITY in action:

select utl_match.jaro_winkler_similarity ('Steven' , 'Stephen') jaro_winkler_similarity from dual;

When run:

SQL> select utl_match.jaro_winkler_similarity 2 ('Steven' 3 , 'Stephen') jaro_winkler_similarity 4 from dual; JARO_WINKLER_SIMILARITY ----------------------- 89

### Deep Dive Recap: Let’s try UTL_MATCH’s superheroic fuzzy matching on SUPERMAN / SPIDERMAN

Now that we know the fuzzy match algorithms, we’ll build the expected result on a test case and then validate UTL_MATCH’s actual result.

**What does it take to change SPIDERMAN to SUPERMAN?**

Leaving aside musings on the lines of "stop him from swinging, make him start flying", it takes 3 steps:

Step 1: SPIDERMAN –> Insert U –> SUPIDERMAN

Step 2: SUPIDERMAN –> Remove D –> SUPERIMAN

Step 3: SUPERIMAN –> Remove I –> SUPERMAN

So the Levenshtein distance between SPIDERMAN and SUPERMAN is 3.

length(SPIDERMAN) = 9 i.e. the longer string

length(SUPERMAN) = 8

The normalized Levenshtein distance between SPIDERMAN and SUPERMAN is ceiling(100 – 3/9*100) = ceil(66.666…) = 67

**How much in agreement are SPIDERMAN and SUPERMAN?**

I won’t belabor a joke by saying both strive to save the world wearing blue-red costumes and go straight to metrics for Jaro-Winkler distance.

match window = floor(max(9, 8), 2) – 1 = 3

P,E,R,M,A,N are not in the same positions in the two strings but since they are displaced by not more than the match window, they are considered matching characters.

m [Number of matching characters] = (S,P,E,R,M,A,N) = 7

t [Half the number of transpositions] = 0

l [length of common prefix at the start] = S = 1

So

Jaro score = 1/3 (7/9 +7/8 + (7 – 0)/7)

= 1/3 (.77778 + .875 + 1)

= 0.88426

and

Jaro-Winkler score = Jaro score

+ (length of common prefix * p (1 – Jaro score))

= 0.88426 + (1 * 0.1 (1 – 0.88426))

= 0.88426 + 0.011574

= 0.89583

Which means that, on fuzzy matching of strings SPIDERMAN and SUPERMAN:

**UTL_MATCH.EDIT_DISTANCE**should return 3**UTL_MATCH_EDIT_DISTANCE_SIMILARITY**should return 67**UTL_MATCH.JARO_WINKLER**should return 0.89583-
**UTL_MATCH.JARO_WINKLER_SIMILARITY**should return 89

Do the results tally with the estimates?

Edit distance results:

SQL> select utl_match.edit_distance 2 ('Superman' 3 , 'Spiderman') edit_distance 4 , utl_match.edit_distance_similarity 5 ('Superman' 6 , 'Spiderman') edit_distance_similarity 7 from dual; EDIT_DISTANCE EDIT_DISTANCE_SIMILARITY ------------- ------------------------ 3 67

Jaro-Winkler results:

SQL> select cast(utl_match.jaro_winkler 2 ('Superman' 3 , 'Spiderman') as number(5,5)) jaro_winkler 4 , utl_match.jaro_winkler_similarity 5 ('Superman' 6 , 'Spiderman') jaro_winkler_similarity 7 from dual; JARO_WINKLER JARO_WINKLER_SIMILARITY ------------ ----------------------- .89583 89

Yes, UTL_MATCH gives back fuzzy match results exactly as estimated.

### Summary

Oracle’s UTL_MATCH package has functions that perform fuzzy matching between two strings and return a score. We saw with examples how these functions work:

EDIT_DISTANCE

EDIT_DISTANCE_SIMILARITY

JARO_WINKLER

JARO_WINKLER_SIMILARITY

UTL_MATCH functions are powerful tools for finding similarities between two records based on identity info (name, email, etc) and spotting potential duplicates.

{ 0 comments… add one now }