To tackle a real world problem of the hiring department performing a new hire instead of a rehire for seasonal contractors, I decided to implement the Jaro-Winkler algorithm in SAP. Since there is a policy to not store key information on contractors (such as SSN), there is only the ability to match on a person’s name. Unfortunately, the name is not always typed correctly which leads to the inability to find a previous hire and this leads to hiring a new contractor instead of rehiring a contractor. Using the Jaro-Winkler algorithm, we are now able to suggest possible similar contractors based on the string comparison of first and last name. Jaro-Winkler calculates the distance (a measure of similarity) between strings. The measurement scale is 0.0 to 1.0, where 0.0 is the least likely and 1.0 is a positive match. For our purposes, anything below a 0.8 is not considered useful.
“Jaro–Winkler distance (Winkler, 1990) is a measure of similarity between two strings”.
See Wiki link: http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance for more in depth information.
The Class: ZCL_JARO_WINKLER
Method: STRINGDISTANCE
method stringdistance.
data: firstlen type i, secondlen type i, halflen type i, commonmatches type i, common1 type string, common2 type string.
data: transpositions type i, i type i, j type i.
if ( not ( firstword is initial ) and not ( secondword is initial ) ).
if ( firstword eq secondword ).
stringdistance = totalmatchscore.
else.
firstlen = strlen( firstword ).
secondlen = strlen( secondword ).
halflen = zcl_jaro_winkler=>math_min( num1 = firstlen num2 = secondlen ) / 2 + 1.
common1 = zcl_jaro_winkler=>getcommoncharacters( firstword = firstword secondword = secondword distance = halflen ).
commonmatches = strlen( common1 ).
if ( commonmatches eq 0 ).
stringdistance = totalmismatchscore.
else.
common2 = zcl_jaro_winkler=>getcommoncharacters( firstword = secondword secondword = firstword distance = halflen ).
if ( commonmatches ne strlen( common2 ) ).
stringdistance = totalmismatchscore.
else.
transpositions = 0.
i = 0.
while ( i lt commonmatches ).
if ( common1+i(1) ne common2+i(1) ).
transpositions = transpositions + 1.
endif.
i = i + 1.
endwhile.
transpositions = transpositions / 2.
stringdistance
= commonmatches / ( 3 * firstlen ) + commonmatches / ( 3 * secondlen ) + ( commonmatches -transpositions ) / ( 3 * commonmatches ).
endif.
endif.
endif.
else.
stringdistance = totalmismatchscore.
endif.
endmethod.
Method: GETCOMMONCHARACTERS
method getcommoncharacters.
data: firstlen type i, secondlen type i, i type i, j type i, ch type c, foundit type c, secondword_copy type string.
data: first_half type string, second_half type string, next_start_point type i, remaining_length type i.
if ( not ( firstword is initial ) and not ( secondword is initial ) ).
secondword_copy = secondword.
firstlen = strlen( firstword ).
secondlen = strlen( secondword ).
i = 0.
while i lt firstlen.
ch = firstword+i(1).
foundit = bool_false.
j = zcl_jaro_winkler=>math_max( num1 = 0 num2 = ( i - distance ) ).
while ( ( foundit = bool_false ) and ( j lt zcl_jaro_winkler=>math_min( num1 = ( i + distance ) num2 = secondlen ) ) ).
if ( secondword_copy+j(1) eq ch ).
foundit = bool_true.
concatenate commons ch into commons.
move secondword_copy+0(j) to first_half.
remaining_length = ( secondlen - j ) - 1.
next_start_point = j + 1.
move secondword_copy+next_start_point(remaining_length) to second_half.
concatenate first_half '#' second_half into secondword_copy.
endif.
j = j + 1.
endwhile.
i = i + 1.
endwhile.
else.
clear commons.
endif.
endmethod.
Creating a test program:
This test program requires that a first name and last name be input by the user and then these names are compared against every employee for possible matches.
loop at t_pa0002 assigning <fs>.
translate <fs>-nachn to upper case.
translate <fs>-vorna to upper case.
move <fs>-nachn to l_compare.
call method zcl_jaro_winkler=>stringdistance
exporting
firstword = s_nachn
secondword = l_compare
receiving
stringdistance = <fs>-nachn_score.
move <fs>-vorna to l_compare.
call method zcl_jaro_winkler=>stringdistance
exporting
firstword = s_vorna
secondword = l_compare
receiving
stringdistance = <fs>-vorna_score.
<fs>-total_score = <fs>-nachn_score + <fs>-vorna_score.
endloop.
sort t_pa0002 by total_score descending.
format color col_normal.
write: 80 'Total Score ' color col_total.
uline.
loop at t_pa0002 assigning <fs>.
write: / <fs>-nachn(20), <fs>-nachn_score, <fs>-vorna(20), <fs>-vorna_score.
write 80 <fs>-total_score color col_total.
endloop.
Deliberately misspelling my last name and shortening my first name:
Results:
I only show the top three results. In fact, since I’m creating a combined score of two different measurements, I would likely consider anything below a 1.60 as not useful in real world. So now we have a safeguard to alleviate hiring when we should be re-hiring. Of course, there are many uses for this algorith, such as zip code verification for zip+4. Also, another use is using as a dictionary check for valid words.