- Implement the Soundex algorithm (detailed below)
- Within the CRB::SpellCheck namespace, implement the SoundexSpellCheck class. This class will populate its dictionary from http://www.tartarus.org/martin/PorterStemmer/voc.txt and then have a method called spell_check(word) which searches the dictionary for whether the word exists or not. Return true if the word exists.
The Soundex algorithm, as explained by Knuth's The Art of Computer Programming, Volume 3, Sorting and Searching, goes like this:
- Retain the first letter of the name and drop all occurrences of a, e, h, i, o, u, w, y in other positions.
- Assign the following numbers to the remaining letters after the
first:
Letter(s) Value b, f, p, v 1 c, g, j, k, q, s, x, z 2 d, t 3 l 4 m, n 5 r 6 - If two or more letters with the same code were adjacent in the original name (before step 1), or adjacent except for intervening h's and w's, omit all but the first.
- Convert to the form "letter, digit, digit, digit" by adding trailing zeros (if there are less than three digits), or by dropping rightmost digits (if there are more than three).
For example, the names Euler, Gauss, Hilbert, Knuth, LLoyd, Lukasiewicz, and Wachs have the respective codes E460, G200, H416, K530, L300, L222, W200. Of course, this system will bring together names that are somewhat different, as well as names that are similar; the same seven codes would be obtained for Ellery, Ghosh, Heilbronn, Kant, Liddy, Lissajous, and Waugh. And on the otherhand a few names like Rogers and Rodgers, or Sinclair and St. Clair, or Tchebysheff and Chebyshev, remain separate.
In this writeup, we'll walk through a couple of implementations of the soundex algorithm in ruby and then how to use the algorithm as a spellchecker.
Let's start with a simple set of tests, the ones specified by Knuth:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
require 'test/unit' require 'soundex' class SoundexTest < Test::Unit::TestCase def test_soundex_cases test_cases = {"Euler" => "E460", "Gauss" => "G200", "Hilbert" => "H416", "Knuth" => "K530", "LLoyd" => "L300", "Lukasiewicz" => "L222", "Wachs" => "W200", "Ellery" => "E460", "Ghosh" => "G200", "Heilbronn" => "H416", "Kant" => "K530", "Liddy" => "L300", "Lissajous" => "L222", "Waugh" => "W200" } test_cases.keys.each do |key| assert_equal test_cases[key], key.soundex end end end |
We'll come back to the tests later and develop some better tests, but these will do for a start.
Here is a first implementation of the Soundex algorithm (liberal comments added):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
class String def soundex # convert ourself to uppercase and remove any non alpha characters s = self.upcase.gsub(/[^A-Z]/,'') # save the first character out = s.slice(0,1) # remove all instances of H & W s.gsub!(/[HW]/,'') # substitue our codes and remove duplicates s.tr!("BFPVCGJKQSXZDTLMNR","111122222222334556").squeeze! # if the first letter is a vowel, then adjust accordingly... n = (s.slice(0,1) =~ /[AEIOUY]/) ? 0 : 1 # remove vowels s.gsub!(/[AEIOUY]/,'') # add the first three digits to our output out<<s.slice(n,3) # while out output < 4 characters, then pad it to until out.length == 4 out << '0' end out end end |
Parts of this are a bit awkward, but it does pass the tests.
One thing to note is that the methods ending in '!' only return the
reference to the object if they change the object. If the object
was not changed, then the method returns nil. This means that you
need to be very careful when chaining methods with '!' in their name.
That said, this method will fail, even though it passes our test were
we to pass in "A".soundex:
>> "A".soundex
NoMethodError: undefined method `squeeze!' for nil:NilClass
from (irb):11:in `soundex'
from (irb):26
from :0
This is because the tr! returns nil.
We need some better tests!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
require 'test/unit' require 'soundex' class SoundexTest < Test::Unit::TestCase def test_a assert_equal "A000", "a".soundex, "'a' failed the soundex check" end def test_case_insensitivity assert_equal "test".soundex, "TEST".soundex, "Failed case insensitivity check." end def test_empty_string assert_equal "", "".soundex, "Failed empty string" end def test_soundex_cases test_cases = {"Euler" => "E460", "Gauss" => "G200", "Hilbert" => "H416", "Knuth" => "K530", "LLoyd" => "L300", "Lukasiewicz" => "L222", "Wachs" => "W200", "Ellery" => "E460", "Ghosh" => "G200", "Heilbronn" => "H416", "Kant" => "K530", "Liddy" => "L300", "Lissajous" => "L222", "Waugh" => "W200" } test_cases.keys.each do |key| assert_equal test_cases[key], key.soundex, "#{key} failed the soundex check." end end end |
I've gone in and manually calculated the Soundex value of 'A' -- A000 and used that as a test. Now that we know our initial code doesn't work, let's work on the code: (again, I've gone in and added liberal comments)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
class String def soundex # self removed; don't need it. Due to state, self is assumed. # upcase ourself and globally remove any non-alpa s = upcase.gsub(/[^A-Z]/,'') # If we're empty, then return an empty string return "" if s == "" # save our first character first = s[0,1] # delete all H & W s.delete!("HW") # Remap all characters and squeeze # X should *never* happen. It's a placeholder s.tr_s!("ABCDEFGHIJKLMNOPQRSTUVWXYZ", "0123012X02245501262301X202") # remove the first character; we don't want it s.slice!(0,1) # Now remove the vowels s.delete!("0") # output, padding as necessary "#{first}#{s}000"[0,4] end end |
Let's try the tests:
matt@hunin$ ruby soundex2_test.rb Loaded suite soundex2_test Started .... Finished in 0.003817 seconds. 4 tests, 17 assertions, 0 failures, 0 errors
I like this version a lot better. Not only does it work, but it just feels a lot better. There's less special cases -- we aren't checking for special cases, we've cut down on regular expressions and we've also handled the case where an "empty" string exists -- one in which there are no alpha characters. All in all, I think this is a lot cleaner.
So now on to the spellchecker. We'd decided on using a bucket hash to implement the spellchecker -- that is a hash table where when there are duplicate keys the duplicate entries are held in a data structure such as an array or linked list.
Let's write some tests of our spellchecker:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
require 'test/unit' require 'crb/soundex_spellcheck' require 'singleton' class Factory include Singleton attr_reader :spellchecker def initialize @spellchecker = CRB::SoundexSpellcheck.new end end class SoundexSpellcheckTest < Test::Unit::TestCase def test_empty assert_equal false, Factory.instance.spellchecker.spell_check(""), "Failed empty string check" end def test_cat_lowercase assert_equal true, Factory.instance.spellchecker.spell_check("cat"), "Failed to find cat" end def test_cat_capitalized assert_equal true, Factory.instance.spellchecker.spell_check("Cat"), "Failed to find Cat" end def test_cat_capitalized assert_equal true, Factory.instance.spellchecker.spell_check("CAT"), "Failed to find CAT" end def test_cat_misspelled assert_equal false, Factory.instance.spellchecker.spell_check("CTA"), "Thought CTA was a word." end def test_multiple_words_at_once assert_equal false, Factory.instance.spellchecker.spell_check("This is a test"), "Thought 'this is a test' was a word." end end |
Since I wrote these tests I've read Dan Manges' Fixing Fixtures with Factory, so these could be cleaned up a bit to simplify, but, that said, let's go on.
Here's the spellchecker implementation -- Ruby's hash and array make the bucket hash very easy to implement. I've added comments to explain what I'm doing:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
require 'soundex' require 'crb/word_list' module CRB class SoundexSpellcheck # this is an array of the words in our dictionary include CRB::WordList def initialize # Here's our hash @dictionary = { } # step through the array of words wordlist.each do |word| # get the soundex value of the word key = word.soundex # get the bucket; if the bucket does not exist, # create an empty bucket bucket = @dictionary[key] || [] # add the word to the bucket bucket << word # associate the bucket with the key @dictionary[key] = bucket end end def spell_check(word="") # get the soundex value for the word key = word.soundex # get the bucket associated with the soundex value bucket = @dictionary[key] || [] # is our word in the bucket (dictionary?) bucket.include? word.downcase end end end |
Pretty simple, eh? Now, the missing module looks something like this; I've truncated it because, well, it's rather large:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
module CRB module WordList WordList_words = %w[a aaron abaissiez abandon abandoned abase abash abate # ... zenith zephyrs zir zo zodiac zodiacs zone zounds zwagger] def wordlist WordList_words end end end |
Ok, so why did I do it this way? Well.... a couple of reasons.
First, this wasn't so much an exercise in file input/output as for
working with data structures. Also I didn't want to have dependencies
on a file which may or may not be available. Of course, this makes it
difficult to add words to the dictionary. That can be easily remedied
with an add_word method.
