def convert_umlaut_to_base(input)
text = input.dup
%w[áäa ÃÄÅA óöo ÓÖO Ãi ÃI úüu ÚÜU ée ÉE ßs].each do |set|
text.gsub!(/[#{set[0..-2]}]/,set[-1..-1])
end
text
end
Refactorings
No refactoring yet !
Tj Holowaychuk
March 15, 2009, March 15, 2009 17:58, permalink
is it even working right at the moment? I got :
puts convert_umlaut_to_base('Éáöiu')
# => aEaaaoiu
# Bit cleaner IMO
def convert_umlaut_to_base input
%w( áäa ÃÄÅA óöo ÓÖO Ãi ÃI úüu ÚÜU ée ÉE ßs ).inject input.dup do |input, set|
input.gsub!(/[#{set[0..-2]}]/,set[-1..-1])
input
end
end
Tj Holowaychuk
March 15, 2009, March 15, 2009 18:01, permalink
def convert_umlaut_to_base input
%w( áäa ÃÄÅA óöo ÓÖO Ãi ÃI úüu ÚÜU ée ÉE ßs ).inject input.dup do |input, set|
input.gsub! /[#{set[0..-2]}]/, set[-1..-1]
input
end
end
pragmatig
March 15, 2009, March 15, 2009 19:25, permalink
$KCODE was missing... (works in the application, because it is set there already..)
i adopted your idea, normally i am no fan of inject since it adds some complexity, but here it seems to be ok...
so thats already a bit better, but i am not yet satisfied, since it is still to complex :(
def convert_umlaut_to_base(input)
$KCODE='u'
%w( áäa ÃÄÅA óöo ÓÖO Ãi ÃI úüu ÚÜU ée ÉE ßs ).inject(input.dup) do |input, set|
input.gsub(/[#{set[0..-2]}]/, set[-1..-1])
end
end
Tj Holowaychuk
March 15, 2009, March 15, 2009 22:20, permalink
Nothing wrong with inject, helps eliminate the uglyness of having memo vars. I dont have much experience in working with multiple character sets but I would imagine ideally everything in %w() would be removed and defined with some sort of DSL so that others can extend it, maybe something like below, of course not using globals and naming the methods something less ambiguous
$KCODE='u'
def convert chars
($conversions ||= []) << chars.split
end
convert 'a á ä'
convert 'u ú ü'
def convert_umlaut_to_base input
$conversions.inject(input.dup) do |input, (to, *from)|
input.gsub! /[#{from}]/, to
end
end
puts convert_umlaut_to_base('äú') # => au
pragmatig
March 16, 2009, March 16, 2009 08:48, permalink
I really like the "to, *from" didnt know this trick...
For the time being i wont make it extend-able, the general idea is not that users have to add something,since it should include all weird characters.
It would be interesting to have something like this on array e.g. Array.sort_alphabetically ...
So my current solutions looks like below, thanks for you help tj!
def convert_umlaut_to_base(input)
$KCODE='u'
%w(aáä AÃÄÅ oóö OÓÖ ià Ià uúü UÚÜ eé EÉ sß).inject(input.dup) do |input, set|
to, *from = set.split('')
input.gsub(/[#{from}]/, to)
end
end
Marc-Andre
March 17, 2009, March 17, 2009 14:48, permalink
There are way more accents than these, like ç, ñ, ê, è and variations you have not included, like ë.
Here's a nifty way to strip them which will take care of all cases except for your ß. You decompose the code points (so é becomes e+´) and go from there. You could remove all complex code points (like +´ see strip_to_ascii), but then you strip also any complex letters like ß or ø. Instead I build ALL_ACCENTS from an example of all the accents I could find. That list is then used to strip all accents after they've been split from their underlying letters.
I don't speak german, but I believe the ß should be substituted with 'ss', not just a single 's' (see http://en.wikipedia.org/wiki/ß )
Note that there are ligatures to take care of (e.g. Å“ -> oe , ï¬ -> fi, see http://en.wikipedia.org/wiki/Typographical_ligature ), I don't know if there's a quick way to deal with all them without listing them all.
Here's the resulting code with both options. Your convert_umlaut_to_base should probably renamed to convert_to_ascii or something.
class String
# returns an array of unicode codepoints, in canonical order
def split_codepoints
ActiveSupport::Multibyte::Handlers::UTF8Handler.normalize(self,:d).split(//u)
end
ALL_ACCENTS = "ùúûüūůűÅũųģėě".split_codepoints.reject { |e| e.length == 1 }
def strip_accents
(split_codepoints - ALL_ACCENTS).join
end
def strip_to_ascii
split_codepoints.reject { |e| e.length > 1 }.join
end
end
# "Måŗç-Λñ∂řé".strip_to_ascii => "Marc-nre"
# "Måŗç-Λñ∂řé".strip_accents => "Marc-Λn∂re"
def convert_umlaut_to_base(input)
input.to_s.gsub(/ß/, 'ss').strip_accents
end
Tj Holowaychuk
March 17, 2009, March 17, 2009 15:30, permalink
I was going to name it to_ascii as well but like I said I dont know much about the multi-lingual world. I am sure you could just do some bit shifting or something to account for all of them programmatically
pragmatig
March 17, 2009, March 17, 2009 18:33, permalink
This looks really interesting @mark, I tought about something like this, but decided for a quick&dirty&simple solution ;)
Since I am using it for sorting, and not display these ligatures should not be a big problem. Ill give it a try, Ill build a version that does not enhance string and if possible does not rely on activesupport if I can...
And your both right, convert to ascii is a far better name...
Marc-Andre
March 17, 2009, March 17, 2009 19:52, permalink
@Tj: Bit-shifting? For all 14 accents, with variations of letter + lower/uppercase? Mmmm, I'll bet, say, $20 it's not possible! :-)
@pragmatig: Thanks! You probably won't be able to do the normalization without activesupport. Even ruby 1.9 doesn't give you a way to normalize strings (yet!)
Tj Holowaychuk
March 17, 2009, March 17, 2009 21:23, permalink
Like i said i dont know much about various character encodings :P but usually these things are formulated by a pattern, hence can be reverse via a logical pattern :P
pragmatig
March 18, 2009, March 18, 2009 10:10, permalink
I put the sorting logic into a small gem, that adds sort_alphabetical / sort_alphabetical_by to Enumerable
http://github.com/grosser/sort_alphabetical
The only new problem i ran into was that Marcs code did no longer work with ActiveSupport 2.3, so i had to do a small if/else...
Hope you like it!
This is used to sort a list alphabetically, normally it would sort ab äb c -> ab c äb, so all ä-s are converted to a, so that the correct sort order is preserved.
At first i tried it with String.tr, but this is not unicode aware.
I am curious if there is a cleaner solution...
a set e.g. áäa is converted to gsub(/[áä]/, 'a')