In the beginning of the last century the US census needed an efficient and intelligent system to index information about people. Soundex, NYSIIS and METAPHONE are some of the algorithms that were used to store information about people by indexing person last names. The problems listed below are exist in almost all phonetic algorithms. In simple terms, Soundex is phonetic hashing system used to code 26 English letters in a one letter plus three number combination. This code describes what the original word sounds like. Soundex was Invented/Patented by Robert C. Russell of Pittsburgh, Pennsylvania. He received U.S. patent 1,261,167 on April 2, 1918 on for it (The Patent has expired now). Lot of literature can be found on the web that covers Soundex details. Soundex has been incorporated in almost all major databases as built-in function. Other algorithms may not have implementation in commercial relational databases like Oracle, SQL Server, UDB etc. But with the simplicity of Soundex did not come the sophistication needed in today's world. Soundex worked great a century ago and years to follow, until people started noticing problems. First of all, Soundex was designed only for the 26 Letters of English and the sound mapping covers Anglo-Saxon Names only. It fails for several other names such as Latin Names and Chinese names. Other algorithms such as NYSIIS, Metaphone and double metaphone came into existence later. These took care of some of the phonetic issues but still lacked serious quality. As described below with examples, Soundex, NYSIIS & Metaphone are not accurate and should not be used for "Professional Grade" applications. Consider the results of studies: - Only 33% of the matches that would be returned by Soundex would be correct. Even more significant was the finding that fully 25% of correct matches would fail to be discovered by Soundex. (Alan Stanier, September 1990, Computers in Genealogy, Vol. 3, No. 7)
- Only 36.37% of Soundex returns were correct, while more than 60% of correct names were never returned by Soundex. (A.J. Lait and B. Randell, 1996)
There are major problems with Soundex based name matching solution a few of them are listed below:
- Only 26 English Letters!
Soundex is English oriented. It does not support any characters beyond basic 26 characters in English. Extended character sets are not supported hence names with unusual letters (like æ, ø, or Ð) may not be retrieved correctly. The mapping for these characters by different operators may be completely different. This problem is very abundant with north European names. - First Letter in the name
Algorithms like Soundex and NYSIIS depend on the first letter of the "word" or "token" to generate the key. Someone looking for Phiffer, could hear and correct it to Fiffer. Tsunami and Sunami should be retrieved as part of either data search. Same is the case of Phillip and Fillip. There would be a lot of False Negatives in these cases. - Intolerance of Typos and Noise
It is impossible to guarantee that a $800/hour operator will type every name correctly let alone $8/hour data entry clerk. Typos and noise is fact of system data input. If the operator typed "Robetr" instead of "Robert" using the Key-based approach it will not be possible to fetch the "Robert" that we are looking for. At this point it does not matter whether you are using Soundex or NYSIIS - Same Sound Different Spelling
Soundex and NYSIIS fail for names that use silent letters and silent sounds. Some examples would be: - Crieghton ~ Kreiton ~ Cryton
- Mc Laughlin ~ MacLaulin
- DeSouza ~ D'Souza
- False Positives
Compare the Soundex code for "BUSH" in the census last name list. You will see 200 other last names will show up. Now imagine if some of the 200 last names were popular. Your name search result will be very huge and the user would end up wasting lot of their valuable time scanning through these false positives. Both Soundex, NYSIIS, Metaphone end up with these precision problems. - Name Sequence Variation
In today's global economy hence globalized name database systems, your applications should be able to support multi-cultural and ethnically diverse names. The American "First Name", "Middle Initial", "Last Name" style is not followed in the entire world. There are parts of world where people don't have middle names and there are other cultures where names contain more than seven components. Some cultures have last name first and first name last. Soundex and NYSIIS not only fail to address the problem but may create bigger problems in your application. The precision may suffer heavily if the code is generated for the entire name. - Multi-Cultural Diverse Name Databases
A name spelled one way in one country is spelled and pronounced very differently in its neighboring country. These problems exist within different tribes living in the same state. The problem is compounded by system user or operator who already knows a third spelling of the name. Simple key based approach cannot effectively address this huge problem. Imagine adding to this problem a character set transliteration solution! Not all names are written as 26 letter ASCII character strings going left to right. - Long Names with lots of Tokens/Components
Many cultures in the world have names that can be more than seven words long. Some of these words are very common and can be found in almost every name. Lots of names have title and suffix in them. Solutions with key-based approach become extremely inefficient if not useless. Consider the examples of :
- De Luca ~ DeLuca ~ D'Luca
- Abdul Rahim ~ Abd'al Rahim ~ Abderrahem
- Von Der Thun ~ Vanderthun
- Ali bin Ahmed bin saleh bin talal Al-watani
Soundex based approach would drive the operator against a wall. - Unranked results with no match score
Soundex or NYSIIS, at the most provide you with similar names. They have no built-in mechanism to intelligently rank the results based on phonetic similarity or string similarity. Result of such systems is the operator gives up looking for the name after looking at couple of pages. So even if the result was fetched the operator did not look at it because it was on the third page. - Nick Names and Variants
Nick names are a very important part of everyday life. People very frequently use alternate names without giving a lot of thought to it. The Soundex Codes for "John" and "Jack" are not the same. In addition to nick names a lot of times variants of these names are used. Soundex and NYSIIS fail to retrieve these equivalent names. - Initials, Abbreviations, Prefixes, Suffixes, Titles, Qualifiers
Everyday initials and abbreviations are used in names. Sometimes to conserve space and other times as preference by the client. If the database record for a person as been entered with W. H. Perry, the key-based application will not be able to fetch William Harry Perry. Mohammad can be abbreviated as Md., Mmd., Mhd. or Mohd. There are such numerous examples of abbreviations. Titles, Qualifies may occur at much higher frequency in such scenarios the key-based approach becomes over-whelming. - Company Names
Corporate names are very common in contact databases. Since Soundex & NYSIIS work on single token (one word) more effectively. Searching for a company name in a huge database may fetch you a lot of false positives. At the same time very relevant names could be missed from the mail search. Some of the words repeat in every company name (e.g. Company, Co., Inc. Corporation, Corp, LLC, Pvt Ltd. etc.) Imagine the false positives when you searched for a company name with one of the above words in it. - Aliases
Several times people are known by their aliases as well as full formal names. Soundex or other key-based algorithms do not have the intelligence to figure out aliases in name strings (e.g. Francis Hernandez A.K.A. Paco). - Joint Names
Numerous times we find names of husband and wife on the same line. (e.g. Mrs Jane and Mr John Doe). In this scenario, Soundex does not understand that there are two people in the name field and searching for any one or both of them should bring-up above record. Above is a small subset of problems that are obvious. There are such numerous examples that can prove that Soundex/NYSIIS are not the solution that should be implemented by "Professional Grade" applications. idMatch ™ technology addresses the problems mentioned above and is extremely flexible to incorporate rules that can address unforeseen problems and issues. For more information about idMatch technology please visit http://www.idmatchsystems.com/products/idmatch |