Spellchecker pt 1: Soundex 0

Posted by Matt Williams

Our challenge is to:
  1. Implement the Soundex algorithm (detailed below)
  2. Within the CRB::SpellCheck namespace, implement the SoundexSpellCheck class. This class will populate its dictionary from http://www.tartarus.org/martin/PorterStemmer/voc.txt and then have a method called spell_check(word) which searches the dictionary for whether the word exists or not. Return true if the word exists.

The Soundex algorithm, as explained by Knuth's The Art of Computer Programming, Volume 3, Sorting and Searching, goes like this:

  1. Retain the first letter of the name and drop all occurrences of a, e, h, i, o, u, w, y in other positions.
  2. Assign the following numbers to the remaining letters after the first:
    Letter(s)Value
    b, f, p, v 1
    c, g, j, k, q, s, x, z 2
    d, t3
    l4
    m, n5
    r6
  3. If two or more letters with the same code were adjacent in the original name (before step 1), or adjacent except for intervening h's and w's, omit all but the first.
  4. Convert to the form "letter, digit, digit, digit" by adding trailing zeros (if there are less than three digits), or by dropping rightmost digits (if there are more than three).

For example, the names Euler, Gauss, Hilbert, Knuth, LLoyd, Lukasiewicz, and Wachs have the respective codes E460, G200, H416, K530, L300, L222, W200. Of course, this system will bring together names that are somewhat different, as well as names that are similar; the same seven codes would be obtained for Ellery, Ghosh, Heilbronn, Kant, Liddy, Lissajous, and Waugh. And on the otherhand a few names like Rogers and Rodgers, or Sinclair and St. Clair, or Tchebysheff and Chebyshev, remain separate.

In this writeup, we'll walk through a couple of implementations of the soundex algorithm in ruby and then how to use the algorithm as a spellchecker.

Let's start with a simple set of tests, the ones specified by Knuth:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

require 'test/unit'
require 'soundex'

class SoundexTest < Test::Unit::TestCase
  def test_soundex_cases
    test_cases = {"Euler" => "E460",
      "Gauss" => "G200",
      "Hilbert" => "H416",
      "Knuth" => "K530",
      "LLoyd" => "L300",
      "Lukasiewicz" => "L222",
      "Wachs" => "W200",
      "Ellery" => "E460",
      "Ghosh" => "G200",
      "Heilbronn" => "H416",
      "Kant" => "K530",
      "Liddy" => "L300",
      "Lissajous" => "L222",
      "Waugh" => "W200"
    }

    test_cases.keys.each do |key|  
      assert_equal test_cases[key], key.soundex
    end
  end
end

We'll come back to the tests later and develop some better tests, but these will do for a start.

Here is a first implementation of the Soundex algorithm (liberal comments added):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

class String
  def soundex
    # convert ourself to uppercase and remove any non alpha characters
    s = self.upcase.gsub(/[^A-Z]/,'')
    # save the first character
    out = s.slice(0,1)
    # remove all instances of H & W
    s.gsub!(/[HW]/,'')
    # substitue our codes and remove duplicates
    s.tr!("BFPVCGJKQSXZDTLMNR","111122222222334556").squeeze!
    # if the first letter is a vowel, then adjust accordingly...
    n = (s.slice(0,1) =~ /[AEIOUY]/) ? 0 : 1
    # remove vowels
    s.gsub!(/[AEIOUY]/,'')
    # add the first three digits to our output
    out<<s.slice(n,3)
    # while out output < 4 characters, then pad it to 
    until out.length == 4
      out << '0'
    end
    out
  end
end

Parts of this are a bit awkward, but it does pass the tests.

One thing to note is that the methods ending in '!' only return the reference to the object if they change the object. If the object was not changed, then the method returns nil. This means that you need to be very careful when chaining methods with '!' in their name. That said, this method will fail, even though it passes our test were we to pass in "A".soundex:

>> "A".soundex
NoMethodError: undefined method `squeeze!' for nil:NilClass
        from (irb):11:in `soundex'
        from (irb):26
        from :0

This is because the tr! returns nil.

We need some better tests!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

require 'test/unit'
require 'soundex'

class SoundexTest < Test::Unit::TestCase

  def test_a
    assert_equal "A000", "a".soundex, "'a' failed the soundex check"
  end
  
  def test_case_insensitivity
    assert_equal "test".soundex, "TEST".soundex, "Failed case insensitivity check."
  end
  
  def test_empty_string
    assert_equal "", "".soundex, "Failed empty string"
  end
  
  def test_soundex_cases
    test_cases = {"Euler" => "E460",
      "Gauss" => "G200",
      "Hilbert" => "H416",
      "Knuth" => "K530",
      "LLoyd" => "L300",
      "Lukasiewicz" => "L222",
      "Wachs" => "W200",
      "Ellery" => "E460",
      "Ghosh" => "G200",
      "Heilbronn" => "H416",
      "Kant" => "K530",
      "Liddy" => "L300",
      "Lissajous" => "L222",
      "Waugh" => "W200"
    }

    test_cases.keys.each do |key|  
      assert_equal test_cases[key], key.soundex, "#{key} failed the soundex check."
    end
  end
end

I've gone in and manually calculated the Soundex value of 'A' -- A000 and used that as a test. Now that we know our initial code doesn't work, let's work on the code: (again, I've gone in and added liberal comments)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

class String
  def soundex
    # self removed; don't need it.  Due to state, self is assumed.
    # upcase ourself and globally remove any non-alpa
    s = upcase.gsub(/[^A-Z]/,'')
    # If we're empty, then return an empty string
    return "" if s == ""
    # save our first character
    first = s[0,1]
    # delete all H & W
    s.delete!("HW")
    # Remap all characters and squeeze
    # X should *never* happen.  It's a placeholder
    s.tr_s!("ABCDEFGHIJKLMNOPQRSTUVWXYZ",
            "0123012X02245501262301X202")
    # remove the first character; we don't want it
    s.slice!(0,1) 
    # Now remove the vowels
    s.delete!("0")
    # output, padding as necessary
    "#{first}#{s}000"[0,4]
  end
end

Let's try the tests:

matt@hunin$ ruby soundex2_test.rb
Loaded suite soundex2_test
Started
....
Finished in 0.003817 seconds.

4 tests, 17 assertions, 0 failures, 0 errors

I like this version a lot better. Not only does it work, but it just feels a lot better. There's less special cases -- we aren't checking for special cases, we've cut down on regular expressions and we've also handled the case where an "empty" string exists -- one in which there are no alpha characters. All in all, I think this is a lot cleaner.

So now on to the spellchecker. We'd decided on using a bucket hash to implement the spellchecker -- that is a hash table where when there are duplicate keys the duplicate entries are held in a data structure such as an array or linked list.

Let's write some tests of our spellchecker:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43

require 'test/unit'
require 'crb/soundex_spellcheck'
require 'singleton'
class Factory
  include Singleton
  attr_reader :spellchecker
  def initialize
    @spellchecker = CRB::SoundexSpellcheck.new
  end
end

class SoundexSpellcheckTest < Test::Unit::TestCase
  def test_empty
    assert_equal false, Factory.instance.spellchecker.spell_check(""),
 "Failed empty string check"
  end
  
  def test_cat_lowercase
    assert_equal true, Factory.instance.spellchecker.spell_check("cat"),
 "Failed to find cat"
  end
  
  def test_cat_capitalized
    assert_equal true, Factory.instance.spellchecker.spell_check("Cat"),
 "Failed to find Cat"
  end

  def test_cat_capitalized
    assert_equal true, Factory.instance.spellchecker.spell_check("CAT"),
 "Failed to find CAT"
  end

  def test_cat_misspelled
    assert_equal false, Factory.instance.spellchecker.spell_check("CTA"),
 "Thought CTA was a word."
  end

  def test_multiple_words_at_once
    assert_equal false, Factory.instance.spellchecker.spell_check("This is a test"),
 "Thought 'this is a test' was a word."
  end
end

Since I wrote these tests I've read Dan Manges' Fixing Fixtures with Factory, so these could be cleaned up a bit to simplify, but, that said, let's go on.

Here's the spellchecker implementation -- Ruby's hash and array make the bucket hash very easy to implement. I've added comments to explain what I'm doing:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

require 'soundex'
require 'crb/word_list'
module CRB
  class SoundexSpellcheck
    # this is an array of the words in our dictionary
    include CRB::WordList

    def initialize
      # Here's our hash
      @dictionary = { }
      # step through the array of words
      wordlist.each do |word|
        # get the soundex value of the word
        key = word.soundex
        # get the bucket; if the bucket does not exist, 
        # create an empty bucket
        bucket = @dictionary[key] || []
        # add the word to the bucket
        bucket << word
        # associate the bucket with the key
        @dictionary[key] = bucket
      end
    end
    
    def spell_check(word="")
      # get the soundex value for the word
      key = word.soundex
      # get the bucket associated with the soundex value
      bucket = @dictionary[key] || []
      # is our word in the bucket (dictionary?)
      bucket.include? word.downcase
    end
  end
end

Pretty simple, eh? Now, the missing module looks something like this; I've truncated it because, well, it's rather large:

1
2
3
4
5
6
7
8
9
10
11
12
13
14

module CRB
  module WordList

    WordList_words = %w[a aaron abaissiez abandon abandoned abase abash abate
                       # ...
                      zenith zephyrs zir zo zodiac zodiacs zone zounds zwagger]

    def wordlist
      WordList_words
    end
  
  end
end

Ok, so why did I do it this way? Well.... a couple of reasons. First, this wasn't so much an exercise in file input/output as for working with data structures. Also I didn't want to have dependencies on a file which may or may not be available. Of course, this makes it difficult to add words to the dictionary. That can be easily remedied with an add_word method.

Comments

Leave a response

Comment