Honing in on Dollar Words and Hidden Costs of Code

Posted by Matt Williams

My first solution is as follows:

1
2
3
4
5
6
7
8
9
10
11

#!/usr/bin/env ruby

b = ('a' .. 'z').to_a

open(ARGV[0]).each { |s|
  sum = 0
  s.chomp.downcase.each_byte{|x| sum = sum + b.index(x.chr) + 1}
  puts s if sum == 100
}

I didn't remember that "A"[0] returns the ascii value of 'A', so I started out by creating an array. Then I step through and for each byte in the word, I'm checking the array for the value of the letter. There's actually a bug here -- it doesn't handle non-alpha characters cleanly.

Also, this doesn't look to be very clean, elegant or exhibit much ruby-foo. So, let's revise it.

Let's look at version two.

1
2
3
4

sum = 0
$_.chomp.downcase.each_byte{|x| sum = sum + x - 96}
puts $_ if sum == 100

Eeek! That's pretty perl-esque. It was an experiment, of sorts, for doing ruby on the commandline:

1
2

ruby -n -e 'n=0;$_.chomp.downcase.each_byte{|x|n=n+x-96};puts $_ if n==100' FILE

I'm looking at it now and, again, this one suffers from the same issue as the previous one (not handling non-alpha) -- a simple gsub(/^[a-z]/,"") might make the difference. But, in a simple, ideal world, this takes the following amount of time to process 50x a ~190k file containing 23530 words, dumping the results to /dev/null on my laptop:

matt@hunin$ ~/bin/run50.sh "./dollarwords.sh voc.txt >/dev/null"
16.27 user 
2.02 system 
0:18.31 elapsed 

Hmmmmm. Well, it runs, aside from the one problem. But, it's not very pretty. Let's try this again. The third time, after all, is the charm. Let's start with some tests!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

require 'test/unit'
require 'dollarwords3'

class TestDollarwords < Test::Unit::TestCase
  def test_known_true
    assert_equal true, "buzzy".dollarword?
  end
  
  def test_known_false
    assert_equal false, "bee".dollarword?
  end
  
  def test_case_insensitive
    assert_equal true, "bUzZy".dollarword?
    assert_equal true, "Buzzy".dollarword?
    assert_equal true, "BUZZY".dollarword?
    assert_equal false, "BEE".dollarword?
  end
  
  def test_case_ignore_non_alpha
    assert_equal true, "b9u 43z$&@&@$&@$$&@$zy".dollarword?
  end
end

Ok, they're not anything special in the way of tests, but they do test boundary cases, that is, words that we know add to 100, words that do not, case insensitivity, and ignoring non-alpha. We might choose to error out on non-alpha or decide that spaces split words, but there are some words which have spaces in them.

So.... How about some ruby code now?

1
2
3
4
5
6

class String
  def dollarword?
    self.upcase.scan(/[A-Z]/).inject(0){ |sum,s| sum + s[0] - ?A + 1 } == 100
  end
end

That's a mouthful, eh? Let's break it up.

First off, we're extending the String class and adding a new method to it, which means that we can then type "dog".dollarword? (which would, of course, be false, since 4 + 7 + 15 = 26).

The self refers to the string in question. In our "dog".dollarword? example, it would refer to "dog".

The upcase method transforms the characters to their uppercase values and returns a copy of the string. We haven't modified the original string. So, "dog" becomes "DOG"

The scan method looks for the pattern passed into it as an argument, in this case /[A-Z]/ matches any single capital character, and each time it finds the pattern it places that portion of the string into an array. This has the effect of turning "DOG" into ["D", "O", "G"] and returns this array.

The inject method iterates through the array ["D", "O", "G"] element by element and keeping track of a sum (which happens to be called sum) we are going to be taking the current element's ascii value (s[0]), subtracting the value of capital 'A' (?A), and adding one, which gets added to our previous sum and then carried forward. One potential "gotcha" with inject. If you don't pass an argument to inject, then it will use the first element as the accumulator. In this case, if we had said inject{|sum,s| ...} instead, it would first store "D" as sum and then attempt to perform arithmetic operations upon it. Which, to put it mildly, doesn't work very well.

So, at the end of our injection, the value of our sum is returned, which is (as I am recalling) 26, and compared with 100. 26 != 100, so, the entire expression resolves as false, and since that is the last result of the method, that is what is returned.

Let's run our tests:

matt@hunin$  ruby test_dollarwords.rb 
Loaded suite test_dollarwords
Started
....
Finished in 0.002836 seconds.

4 tests, 7 assertions, 0 failures, 0 errors

They passed. That's reassuring. That said, let's run this against our test file and see how it compares against the previous version, which, admittedly, was flawed.

Here's the commandline version of it (so as to have *something* along the same lines for our comparison):

1
2

ruby -n -e 'puts $_ if $_.upcase.chomp.scan(/[A-Z]/).inject(0){|sum,s|sum + s[0] - ?A + 1} ==  100' $1 
matt@hunin$ ~/bin/run50.sh "./dollarwords3.sh voc.txt >/dev/null"
40.94 user 
3.01 system 
0:44.01 elapsed

Ouch. That's rather slower. 2.5x slower, to be precise. Now here comes the analysis of the tradeoffs.

The main tradeoffs in terms of costs are speed and maintainability. I'll address maintainability first.

Maintainance is one of the hidden costs of any projects, but can be up to 90% of the cost of a project over its lifetime. Therefore ease of maintenance is likely to be pretty important in terms of tradeoffs.

The second version of the code has the number 96 in it. What's that number? Why was that number chosen? It happens to be the ASCII value of 'a', but unless one were familiar with the ASCII table, it would be difficult to maintain. The third version uses ?A to obtain the ASCII value of a letter, which to someone familiar with the language is far more understandable.

From a maintainability standpoint, the third version wins.

We've processed 1176500 words in anywhere from 16 to 40 seconds. How fast is fast enough? If this were a batch system, unless we had Sagans of words (Billions and Billions) which had to be processed daily, then it is likely that the slower version of the code is fast enough. In a realtime system, the question to ask is how often is this called? If it is a core component which is called all the time and where the difference in speed would make a difference, then the second version might be a good choice. However, in that case, I'd advocate cleaning up the second version somewhat -- changing 96 to ?a and untainting the input somehow.

Comments

Leave a response