Quantcast
Channel: SCN : Document List - ABAP Development
Viewing all articles
Browse latest Browse all 935

Jaro–Winkler Distance Algorithm

$
0
0

To tackle a real world problem of the hiring department performing a new hire instead of a rehire for seasonal contractors, I decided to implement the Jaro-Winkler algorithm in SAP.   Since there is a policy to not store key information on contractors (such as SSN), there is only the ability to match on a person’s name.  Unfortunately, the name is not always typed correctly which leads to the inability to find a previous hire and this leads to hiring a new contractor instead of rehiring a contractor.  Using the Jaro-Winkler algorithm, we are now able to suggest possible similar contractors based on the string comparison of first and last name.  Jaro-Winkler calculates the distance (a measure of similarity) between strings.  The measurement scale is 0.0 to 1.0, where 0.0 is the least likely and 1.0 is a positive match.  For our purposes, anything below a 0.8 is not considered useful.

 

“Jaro–Winkler distance (Winkler, 1990) is a measure of similarity between two strings”.

See Wiki link:  http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance  for more in depth information.

p.jpg

 

The Class:  ZCL_JARO_WINKLER

 

Method: STRINGDISTANCE

 

method stringdistance.

 
data: firstlen type i, secondlen type i, halflen type i, commonmatches type i, common1 type string, common2 type string.

 
data: transpositions type i, i type i, j type i.

 
if ( not ( firstword is initial ) and not ( secondword is initial ) ).

   
if ( firstword eq secondword ).

      stringdistance
= totalmatchscore.

   
else.

      firstlen
= strlen( firstword ).

      secondlen
= strlen( secondword ).

      halflen
= zcl_jaro_winkler=>math_min( num1 = firstlen num2 = secondlen ) / 2 + 1.

      common1
= zcl_jaro_winkler=>getcommoncharacters( firstword = firstword secondword = secondword distance = halflen ).

      commonmatches
= strlen( common1 ).

     
if ( commonmatches eq 0 ).

        stringdistance
= totalmismatchscore.

     
else.

        common2
= zcl_jaro_winkler=>getcommoncharacters( firstword = secondword secondword = firstword distance = halflen ).

       
if ( commonmatches ne strlen( common2 ) ).

          stringdistance
= totalmismatchscore.

       
else.

          transpositions
= 0.

         
i = 0.

         
while ( i lt commonmatches ).

           
if ( common1+i(1) ne common2+i(1) ).

              transpositions
= transpositions + 1.

           
endif.

           
i = i + 1.

         
endwhile.

          transpositions
= transpositions / 2.

          stringdistance

               = commonmatches / ( 3 * firstlen ) + commonmatches / ( 3 * secondlen ) + ( commonmatches -transpositions ) / ( 3 * commonmatches ).

       
endif.

     
endif.

   
endif.

 
else.

    stringdistance
= totalmismatchscore.

 
endif.

endmethod.

 

 

Method: GETCOMMONCHARACTERS

 

method getcommoncharacters.

 
data: firstlen type i, secondlen type i, i type i, j type i, ch type c, foundit type c, secondword_copy type string.

 
data: first_half type string, second_half type string, next_start_point type i, remaining_length type i.

 
if ( not ( firstword is initial ) and not ( secondword is initial ) ).

    secondword_copy
= secondword.

    firstlen
= strlen( firstword ).

    secondlen
= strlen( secondword ).

   
i = 0.

   
while i lt firstlen.

      ch
= firstword+i(1).

      foundit
= bool_false.

      j
= zcl_jaro_winkler=>math_max( num1 = 0 num2 = ( i - distance ) ).

     
while ( ( foundit = bool_false ) and ( j lt zcl_jaro_winkler=>math_min( num1 = ( i + distance ) num2 = secondlen ) ) ).

       
if ( secondword_copy+j(1) eq ch ).

          foundit
= bool_true.

         
concatenate commons ch into commons.

         
move secondword_copy+0(j) to first_half.

          remaining_length
= ( secondlen - j ) - 1.

          next_start_point
= j + 1.

         
move secondword_copy+next_start_point(remaining_length) to second_half.

         
concatenate first_half '#' second_half into secondword_copy.

       
endif.

        j
= j + 1.

     
endwhile.

     
i = i + 1.

   
endwhile.

 
else.

   
clear commons.

 
endif.

endmethod.

 

 

Creating a test program:

 

This test program requires that a first name and last name be input by the user and then these names are compared against every employee for possible matches.

 

    loop at t_pa0002 assigning <fs>.

     
translate <fs>-nachn to upper case.

     
translate <fs>-vorna to upper case.

     
move <fs>-nachn to l_compare.

     
call method zcl_jaro_winkler=>stringdistance

       
exporting

          firstword     
= s_nachn

          secondword    
= l_compare

        receiving

          stringdistance
= <fs>-nachn_score.

     
move <fs>-vorna to l_compare.

     
call method zcl_jaro_winkler=>stringdistance

       
exporting

          firstword     
= s_vorna

          secondword    
= l_compare

        receiving

          stringdistance
= <fs>-vorna_score.

      <fs>
-total_score = <fs>-nachn_score + <fs>-vorna_score.

   
endloop.

   
sort t_pa0002 by total_score descending.

   
format color col_normal.

   
write: 80 'Total Score      ' color col_total.

   
uline.

   
loop at t_pa0002 assigning <fs>.

     
write: / <fs>-nachn(20), <fs>-nachn_score, <fs>-vorna(20), <fs>-vorna_score.

     
write 80 <fs>-total_score color col_total.

   
endloop.

 

Deliberately misspelling my last name and shortening my first name:

p.jpg

Results:

p.jpg

 

I only show the top three results.  In fact, since I’m creating a combined score of two different measurements, I would likely consider anything below a 1.60 as not useful in real world.  So now we have a safeguard to alleviate hiring when we should be re-hiring.  Of course, there are many uses for this algorith, such as zip code verification for zip+4.  Also, another use is using as a dictionary check for valid words.


Viewing all articles
Browse latest Browse all 935

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>