data = data.gsub("\xA0", " ")
data = data.gsub("\x99", "(TM)")
data = data.gsub("\x97", "-")
data = data.gsub(/\xE9/, "e")
data = data.gsub(/\xBD/, "1/2")
data = data.gsub(/\xBE/, "3/4")
data = data.gsub(/\xBC/, "1/4")
# Continue with another 15 search/replace pairs
Refactorings
No refactoring yet !
chrismo
August 27, 2008, August 27, 2008 03:40, permalink
I took a stab at 'optimizing' this, by only going through the data string once - but this new version performs terribly - even just going over it once takes much longer than your version 200 times in a row -- probably because the gsub method would call into C code, whereas my new version is a bunch of Ruby calls.
def process_orig(data)
data = data.gsub("\xA0", " ")
data = data.gsub("\x99", "(TM)")
data = data.gsub("\x97", "-")
data = data.gsub(/\xE9/, "e")
data = data.gsub(/\xBD/, "1/2")
data = data.gsub(/\xBE/, "3/4")
data = data.gsub(/\xBC/, "1/4")
return data
end
def process_one_pass(data)
subs = {
"\xA0" => " ",
"\x99" => "(TM)",
"\x97" => "-",
"\xE9" => "e",
"\xBD" => "1/2",
"\xBE" => "3/4",
"\xBC" => "1/4",
}
result = ""
data.each_byte { |ch|
append = nil
subs.each_pair { |find, repl|
append = repl if (ch.chr == find)
}
result << ((append.nil?) ? ch : append)
}
return result
end
if __FILE__ == $0
require 'profile'
require 'test/unit'
class TestProcess < Test::Unit::TestCase
def test_process_orig
filler = "*" * 10000
(1..200).each do
assert_equal(" (TM)-e1/23/41/4" + filler, process_orig("\xA0\x99\x97\xE9\xBD\xBE\xBC" + filler))
end
end
def test_process_one_pass
filler = "*" * 10000
(1..1).each do
assert_equal(" (TM)-e1/23/41/4" + filler, process_one_pass("\xA0\x99\x97\xE9\xBD\xBE\xBC" + filler))
end
end
end
end
chrismo
August 27, 2008, August 27, 2008 03:41, permalink
Probably then just a version that, as you said, puts the search/replace values into a hash and calls gsub in a loop would be a bit cleaner.
Adam
August 27, 2008, August 27, 2008 04:07, permalink
Why not use ActiveSupport::Multibyte::Chars instead?
data.chars.normalize!
John Sietsma
August 27, 2008, August 27, 2008 05:29, permalink
Use the speed of the dictionary hash lookup rather then an incremental search.
Your code:
Finished in 32.422 seconds.
With these changes:
Finished in 6.25 seconds.
result = ""
data.each_byte { |ch|
result << ( subs[ch.chr] || ch )
}
return result
kenr
August 27, 2008, August 27, 2008 11:16, permalink
Thanks Guys!
The multibyte characters is a good idea, but I was trying to avoid that because the data is being indexed for text searching. I think we will try processing each character/byte with the hash.
Adam
August 27, 2008, August 27, 2008 13:42, permalink
Perhaps I wasn't clear with my previous post. The normalize method does the same thing you are doing with gsub.
"\xBD + \xBC".chars.normalize.to_s => "1/2 + 1/4"
kenr
August 27, 2008, August 27, 2008 16:01, permalink
Thanks Adam. I should have read your post more carefully and checked the API.
chrismo
August 27, 2008, August 27, 2008 21:21, permalink
Thanks, John - I knew I was being stupid when I got to that portion. However, this version still performs much worse than just looping over gsubs.
John Sietsma
August 28, 2008, August 28, 2008 01:45, permalink
It depends what you're after kenr. Do you want to:
- Transform the escape string into an ascii representation. Your gsub method does this. "\xBD" becomes "1/2".
- Transform the escape string into utf-8. normalize does this. "\xBD" becomes the byte 0xBD.
btw, the normalise function takes 60.997 seconds to finish the test!
kenr
August 28, 2008, August 28, 2008 16:45, permalink
The normalizing is really what we want, but there is a limited set of input characters that need to be fixed. Here is what my current code looks like:
@@lookup = {}
@@lookup[0x92] = "'"
@@lookup[0x91] = "'"
@@lookup[0x96] = "-"
@@lookup[0xAE] = "(R)"
@@lookup[0x99] = "(TM)"
@@lookup[0xBD] = "1/2"
@@lookup[0xBE] = "3/4"
@@lookup[0xBC] = "1/4"
@@lookup[0xB0] = " degree "
@@lookup[0xBA] = " degrees "
# etc
def fix_text data
updated = ""
(0..(data.length)).each { |i| c = data[i] || " "; updated << (@@lookup[c] || c) }
end
Adam
August 28, 2008, August 28, 2008 18:50, permalink
class Normalizer < Struct.new(:string)
REPLACEMENTS = {
0x92 => "'",
0x91 => "'",
0x96 => "-",
0xAE => "(R)",
0x99 => "(TM)",
0xBD => "1/2",
0xBE => "3/4",
0xBC => "1/4",
0xB0 => " degree ",
0xBA => " degrees "
}
def normalize
string.enum_for(:each_byte).map { |char| REPLACEMENTS[char] || char.chr }.join
end
def normalize!
string.replace(normalize)
end
end
# Normalizer.new(data).normalize!
I could put the search/replace values into a hash and call gsub in a loop. But is there a more efficient way?