My first solution is as follows:
1 2 3 4 5 6 7 8 9 10 11 |
#!/usr/bin/env ruby b = ('a' .. 'z').to_a open(ARGV[0]).each { |s| sum = 0 s.chomp.downcase.each_byte{|x| sum = sum + b.index(x.chr) + 1} puts s if sum == 100 } |
I didn't remember that "A"[0] returns the ascii value of 'A', so I started out by creating an array. Then I step through and for each byte in the word, I'm checking the array for the value of the letter. There's actually a bug here -- it doesn't handle non-alpha characters cleanly.
Also, this doesn't look to be very clean, elegant or exhibit much ruby-foo. So, let's revise it.
Let's look at version two.
1 2 3 4 |
sum = 0 $_.chomp.downcase.each_byte{|x| sum = sum + x - 96} puts $_ if sum == 100 |
Eeek! That's pretty perl-esque. It was an experiment, of sorts, for doing ruby on the commandline:
1 2 |
ruby -n -e 'n=0;$_.chomp.downcase.each_byte{|x|n=n+x-96};puts $_ if n==100' FILE |
I'm looking at it now and, again, this one suffers from the same issue
as the previous one (not handling non-alpha) -- a simple
gsub(/^[a-z]/,"") might make the difference. But, in a simple, ideal
world, this takes the following amount of time to process 50x a ~190k
file containing 23530 words, dumping the results to /dev/null on my
laptop:
matt@hunin$ ~/bin/run50.sh "./dollarwords.sh voc.txt >/dev/null" 16.27 user 2.02 system 0:18.31 elapsed
Hmmmmm. Well, it runs, aside from the one problem. But, it's not very pretty. Let's try this again. The third time, after all, is the charm. Let's start with some tests!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
require 'test/unit' require 'dollarwords3' class TestDollarwords < Test::Unit::TestCase def test_known_true assert_equal true, "buzzy".dollarword? end def test_known_false assert_equal false, "bee".dollarword? end def test_case_insensitive assert_equal true, "bUzZy".dollarword? assert_equal true, "Buzzy".dollarword? assert_equal true, "BUZZY".dollarword? assert_equal false, "BEE".dollarword? end def test_case_ignore_non_alpha assert_equal true, "b9u 43z$&@&@$&@$$&@$zy".dollarword? end end |
Ok, they're not anything special in the way of tests, but they do test boundary cases, that is, words that we know add to 100, words that do not, case insensitivity, and ignoring non-alpha. We might choose to error out on non-alpha or decide that spaces split words, but there are some words which have spaces in them.
So.... How about some ruby code now?
1 2 3 4 5 6 |
class String def dollarword? self.upcase.scan(/[A-Z]/).inject(0){ |sum,s| sum + s[0] - ?A + 1 } == 100 end end |
That's a mouthful, eh? Let's break it up.
First off, we're extending the String class and adding a new method to
it, which means that we can then type "dog".dollarword? (which would,
of course, be false, since 4 + 7 + 15 = 26).
The self refers to the string in question. In our "dog".dollarword?
example, it would refer to "dog".
The upcase method transforms the characters to their uppercase
values and returns a copy of the string. We haven't modified the
original string. So, "dog" becomes "DOG"
The scan method looks for the pattern passed into it as an argument,
in this case /[A-Z]/ matches any single capital character, and each
time it finds the pattern it places that portion of the string into an
array. This has the effect of turning "DOG" into ["D", "O", "G"] and
returns this array.
The inject method iterates through the array ["D", "O", "G"] element
by element and keeping track of a sum (which happens to be called sum)
we are going to be taking the current element's ascii value (s[0]),
subtracting the value of capital 'A' (?A), and adding one, which gets
added to our previous sum and then carried forward. One potential
"gotcha" with inject. If you don't pass an argument to inject, then
it will use the first element as the accumulator. In this case, if we
had said inject{|sum,s| ...} instead, it would first store "D" as sum
and then attempt to perform arithmetic operations upon it. Which, to
put it mildly, doesn't work very well.
So, at the end of our injection, the value of our sum is returned, which is (as I am recalling) 26, and compared with 100. 26 != 100, so, the entire expression resolves as false, and since that is the last result of the method, that is what is returned.
Let's run our tests:
matt@hunin$ ruby test_dollarwords.rb Loaded suite test_dollarwords Started .... Finished in 0.002836 seconds. 4 tests, 7 assertions, 0 failures, 0 errors
They passed. That's reassuring. That said, let's run this against our test file and see how it compares against the previous version, which, admittedly, was flawed.
Here's the commandline version of it (so as to have *something* along the same lines for our comparison):
1 2 |
ruby -n -e 'puts $_ if $_.upcase.chomp.scan(/[A-Z]/).inject(0){|sum,s|sum + s[0] - ?A + 1} == 100' $1 |
matt@hunin$ ~/bin/run50.sh "./dollarwords3.sh voc.txt >/dev/null" 40.94 user 3.01 system 0:44.01 elapsed
Ouch. That's rather slower. 2.5x slower, to be precise. Now here comes the analysis of the tradeoffs.
The main tradeoffs in terms of costs are speed and maintainability. I'll address maintainability first.
Maintainance is one of the hidden costs of any projects, but can be up to 90% of the cost of a project over its lifetime. Therefore ease of maintenance is likely to be pretty important in terms of tradeoffs.
The second version of the code has the number 96 in it. What's that
number? Why was that number chosen? It happens to be the ASCII value
of 'a', but unless one were familiar with the ASCII table, it would be
difficult to maintain. The third version uses ?A to obtain the ASCII
value of a letter, which to someone familiar with the language is far
more understandable.
From a maintainability standpoint, the third version wins.
We've processed 1176500 words in anywhere from 16 to 40 seconds. How
fast is fast enough? If this were a batch system, unless we had
Sagans of words (Billions and Billions) which had to be processed
daily, then it is likely that the slower version of the code is fast
enough. In a realtime system, the question to ask is how often is
this called? If it is a core component which is called all the time
and where the difference in speed would make a difference, then the
second version might be a good choice. However, in that case, I'd
advocate cleaning up the second version somewhat -- changing 96 to ?a
and untainting the input somehow.
