Microsoft KB Archive/100365: Difference between revisions
(importing KB archive) |
m (Text replacement - ">" to ">") |
||
(2 intermediate revisions by the same user not shown) | |||
Line 54: | Line 54: | ||
== SUMMARY == | == SUMMARY == | ||
The '''DIFFERENCE'''() and '''SOUNDEX'''() functions in Transact-SQL allow searches on character strings that | The '''DIFFERENCE'''() and '''SOUNDEX'''() functions in Transact-SQL allow searches on character strings that "sound similar." '''SOUNDEX'''() converts each string into a 4-digit code. '''DIFFERENCE'''() can then be used to evaluate the level of similarity between the soundexes for two strings as returned by '''SOUNDEX'''(). For example, these functions could be used in a case where you wanted to look at all rows that sound like "Erickson," so it should find those with "Erickson," "Erikson," "Ericson," "Ericksen," "Ericsen," and so on. | ||
</div> | </div> | ||
Line 67: | Line 67: | ||
sx2 = soundex(a2) | sx2 = soundex(a2) | ||
where soundex returns: | where soundex returns: | ||
<alpha><numeric><numeric><numeric> | |||
</pre> | </pre> | ||
<br /> | <br /> | ||
Line 76: | Line 76: | ||
Otherwise:<br /> | Otherwise:<br /> | ||
* if | * if <alpha> of sx1 is same as <alpha> of sx2, then starting level is 1; otherwise starting level is 0. Now, looping through sx1 and sx2, the level starts to "grow" by comparing one character in sx2 to all characters in sx1. If there is a match, then we increment the level and the next scan on sx1 will start from the location of the match. If no match exists, we compare the next character in sx2 to the entire 4 character list of sx1. Thus, our pointer in sx2 moves along one character at a time for every iteration and the pointer to sx1 always starts at the beginning of sx1 except in the case of a match and the location of the match becomes the starting point in sx1 for the next iteration. Whenever there is a match the level is increased. A 4 means all of the soundex characters are the same in both strings (location and value). A value of 0 means that there was no value match. | ||
The following example illustrates this process:<br /> | The following example illustrates this process:<br /> | ||
Line 82: | Line 82: | ||
<pre class="fixed_text">sx1 sx2 | <pre class="fixed_text">sx1 sx2 | ||
------------ | ------------ | ||
A120 A102 | A120 A102 <-- soundex values | ||
.... .... | .... .... | ||
0123 0123 | 0123 0123 <-- character position | ||
</pre> | </pre> | ||
<br /> | <br /> |
Latest revision as of 09:16, 20 July 2020
Article ID: 100365
Article Last Modified on 6/29/2004
APPLIES TO
- Microsoft SQL Server 7.0 Standard Edition
- Microsoft SQL Server 4.21a Standard Edition
- Microsoft SQL Server 6.0 Standard Edition
- Microsoft SQL Server 6.5 Standard Edition
- Microsoft SQL Server 6.5 Service Pack 1
- Microsoft SQL Server 6.5 Service Pack 2
- Microsoft SQL Server 6.5 Service Pack 3
- Microsoft SQL Server 6.5 Service Pack 4
- Microsoft SQL Server 6.5 Service Pack 5a
- Microsoft SQL Server 6.5 Service Pack 5a
- Microsoft SQL Server 7.0 Service Pack 1
This article was previously published under Q100365
SUMMARY
The DIFFERENCE() and SOUNDEX() functions in Transact-SQL allow searches on character strings that "sound similar." SOUNDEX() converts each string into a 4-digit code. DIFFERENCE() can then be used to evaluate the level of similarity between the soundexes for two strings as returned by SOUNDEX(). For example, these functions could be used in a case where you wanted to look at all rows that sound like "Erickson," so it should find those with "Erickson," "Erikson," "Ericson," "Ericksen," "Ericsen," and so on.
MORE INFORMATION
The algorithm for determining the level of similarity between two character strings is outlined below:
given: level = difference(a1, a2) then sx1 = soundex(a1) sx2 = soundex(a2) where soundex returns: <alpha><numeric><numeric><numeric>
The algorithm first generates the soundex of a1 (sx1) and a2 (sx2) then:
- if all characters in sx2 match all characters in sx1 (position respective) then level = 4.
Otherwise:
- if <alpha> of sx1 is same as <alpha> of sx2, then starting level is 1; otherwise starting level is 0. Now, looping through sx1 and sx2, the level starts to "grow" by comparing one character in sx2 to all characters in sx1. If there is a match, then we increment the level and the next scan on sx1 will start from the location of the match. If no match exists, we compare the next character in sx2 to the entire 4 character list of sx1. Thus, our pointer in sx2 moves along one character at a time for every iteration and the pointer to sx1 always starts at the beginning of sx1 except in the case of a match and the location of the match becomes the starting point in sx1 for the next iteration. Whenever there is a match the level is increased. A 4 means all of the soundex characters are the same in both strings (location and value). A value of 0 means that there was no value match.
The following example illustrates this process:
sx1 sx2 ------------ A120 A102 <-- soundex values .... .... 0123 0123 <-- character position
iteration 1: level starts at 1 because sx2[0] == sx1[0]
iteration 2:
start compare with sx2[1] and sx1[1] because of match
sx2[1] == sx1[1] so level is now 2
iteration 3:
start compare with sx2[2] and sx1[2] because of match
sx2[2] == sx1[3] so level is now 3
We have now run out of characters in sx1 as a match was on the last
character of sx1 so difference returns a value of 3.
Some more examples (remember we are always comparing sx2 to sx1):
sx1 sx2 difference -------------------------- A123 A123 4 A123 B123 3 A321 A123 2 ^ ^ ^^ |--+----|| (1)|-----| (2)
Additional query words: Windows NT
Keywords: kbinfo kbother KB100365