4fc8066b5aeba133f665ec8f30f92a9c

This is used to sort a list alphabetically, normally it would sort ab äb c -> ab c äb, so all ä-s are converted to a, so that the correct sort order is preserved.

At first i tried it with String.tr, but this is not unicode aware.
I am curious if there is a cleaner solution...

a set e.g. áäa is converted to gsub(/[áä]/, 'a')

def convert_umlaut_to_base(input)
    text = input.dup
    %w[áäa ÁÄÅA óöo ÓÖO íi ÍI úüu ÚÜU ée ÉE ßs].each do |set|
      text.gsub!(/[#{set[0..-2]}]/,set[-1..-1])
    end
    text
  end

Refactorings

No refactoring yet !

F1e3ab214a976a39cfd713bc93deb10f

Tj Holowaychuk

March 15, 2009, March 15, 2009 17:58, permalink

1 rating. Login to rate!

is it even working right at the moment? I got :

puts convert_umlaut_to_base('Éáöiu')
# => aEaaaoiu


# Bit cleaner IMO

def convert_umlaut_to_base input
  %w( áäa ÁÄÅA óöo ÓÖO íi ÍI úüu ÚÜU ée ÉE ßs ).inject input.dup do |input, set|
    input.gsub!(/[#{set[0..-2]}]/,set[-1..-1])
    input
  end
end
F1e3ab214a976a39cfd713bc93deb10f

Tj Holowaychuk

March 15, 2009, March 15, 2009 18:01, permalink

1 rating. Login to rate!
def convert_umlaut_to_base input
  %w( áäa ÁÄÅA óöo ÓÖO íi ÍI úüu ÚÜU ée ÉE ßs ).inject input.dup do |input, set|
    input.gsub! /[#{set[0..-2]}]/, set[-1..-1] 
    input
  end
end
4fc8066b5aeba133f665ec8f30f92a9c

pragmatig

March 15, 2009, March 15, 2009 19:25, permalink

No rating. Login to rate!

$KCODE was missing... (works in the application, because it is set there already..)

i adopted your idea, normally i am no fan of inject since it adds some complexity, but here it seems to be ok...
so thats already a bit better, but i am not yet satisfied, since it is still to complex :(

def convert_umlaut_to_base(input)
  $KCODE='u'
  %w( áäa ÁÄÅA óöo ÓÖO íi ÍI úüu ÚÜU ée ÉE ßs ).inject(input.dup) do |input, set|
    input.gsub(/[#{set[0..-2]}]/, set[-1..-1])
  end
end
F1e3ab214a976a39cfd713bc93deb10f

Tj Holowaychuk

March 15, 2009, March 15, 2009 22:20, permalink

1 rating. Login to rate!

Nothing wrong with inject, helps eliminate the uglyness of having memo vars. I dont have much experience in working with multiple character sets but I would imagine ideally everything in %w() would be removed and defined with some sort of DSL so that others can extend it, maybe something like below, of course not using globals and naming the methods something less ambiguous

$KCODE='u'

def convert chars
  ($conversions ||= []) << chars.split
end

convert 'a á ä'
convert 'u ú ü'

def convert_umlaut_to_base input
  $conversions.inject(input.dup) do |input, (to, *from)|
    input.gsub! /[#{from}]/, to 
  end
end

puts convert_umlaut_to_base('äú') # => au
4fc8066b5aeba133f665ec8f30f92a9c

pragmatig

March 16, 2009, March 16, 2009 08:48, permalink

No rating. Login to rate!

I really like the "to, *from" didnt know this trick...
For the time being i wont make it extend-able, the general idea is not that users have to add something,since it should include all weird characters.
It would be interesting to have something like this on array e.g. Array.sort_alphabetically ...

So my current solutions looks like below, thanks for you help tj!

def convert_umlaut_to_base(input)
  $KCODE='u'
  %w(aáä AÁÄÅ oóö OÓÖ ií IÍ uúü UÚÜ eé EÉ sß).inject(input.dup) do |input, set|
    to, *from = set.split('')
    input.gsub(/[#{from}]/, to)
  end
end
7f00244a6387413b37ee449f234ec045

Marc-Andre

March 17, 2009, March 17, 2009 14:48, permalink

1 rating. Login to rate!

There are way more accents than these, like ç, ñ, ê, è and variations you have not included, like ë.
Here's a nifty way to strip them which will take care of all cases except for your ß. You decompose the code points (so é becomes e+´) and go from there. You could remove all complex code points (like +´ see strip_to_ascii), but then you strip also any complex letters like ß or ø. Instead I build ALL_ACCENTS from an example of all the accents I could find. That list is then used to strip all accents after they've been split from their underlying letters.

I don't speak german, but I believe the ß should be substituted with 'ss', not just a single 's' (see http://en.wikipedia.org/wiki/ß )

Note that there are ligatures to take care of (e.g. œ -> oe , fi -> fi, see http://en.wikipedia.org/wiki/Typographical_ligature ), I don't know if there's a quick way to deal with all them without listing them all.

Here's the resulting code with both options. Your convert_umlaut_to_base should probably renamed to convert_to_ascii or something.

class String
  # returns an array of unicode codepoints, in canonical order
  def split_codepoints
    ActiveSupport::Multibyte::Handlers::UTF8Handler.normalize(self,:d).split(//u)
  end
  
  ALL_ACCENTS = "ùúûüūůűŭũųģėě".split_codepoints.reject { |e| e.length == 1 }

  def strip_accents
    (split_codepoints - ALL_ACCENTS).join
  end
  
  def strip_to_ascii
    split_codepoints.reject { |e| e.length > 1 }.join
  end
end

# "Måŗç-Λñ∂řé".strip_to_ascii  => "Marc-nre"
# "Måŗç-Λñ∂řé".strip_accents  => "Marc-Λn∂re"

def convert_umlaut_to_base(input)
  input.to_s.gsub(/ß/, 'ss').strip_accents
end
F1e3ab214a976a39cfd713bc93deb10f

Tj Holowaychuk

March 17, 2009, March 17, 2009 15:30, permalink

No rating. Login to rate!

I was going to name it to_ascii as well but like I said I dont know much about the multi-lingual world. I am sure you could just do some bit shifting or something to account for all of them programmatically

4fc8066b5aeba133f665ec8f30f92a9c

pragmatig

March 17, 2009, March 17, 2009 18:33, permalink

No rating. Login to rate!

This looks really interesting @mark, I tought about something like this, but decided for a quick&dirty&simple solution ;)
Since I am using it for sorting, and not display these ligatures should not be a big problem. Ill give it a try, Ill build a version that does not enhance string and if possible does not rely on activesupport if I can...

And your both right, convert to ascii is a far better name...

7f00244a6387413b37ee449f234ec045

Marc-Andre

March 17, 2009, March 17, 2009 19:52, permalink

No rating. Login to rate!

@Tj: Bit-shifting? For all 14 accents, with variations of letter + lower/uppercase? Mmmm, I'll bet, say, $20 it's not possible! :-)

@pragmatig: Thanks! You probably won't be able to do the normalization without activesupport. Even ruby 1.9 doesn't give you a way to normalize strings (yet!)

F1e3ab214a976a39cfd713bc93deb10f

Tj Holowaychuk

March 17, 2009, March 17, 2009 21:23, permalink

No rating. Login to rate!

Like i said i dont know much about various character encodings :P but usually these things are formulated by a pattern, hence can be reverse via a logical pattern :P

4fc8066b5aeba133f665ec8f30f92a9c

pragmatig

March 18, 2009, March 18, 2009 10:10, permalink

No rating. Login to rate!

I put the sorting logic into a small gem, that adds sort_alphabetical / sort_alphabetical_by to Enumerable
http://github.com/grosser/sort_alphabetical

The only new problem i ran into was that Marcs code did no longer work with ActiveSupport 2.3, so i had to do a small if/else...
Hope you like it!

Your refactoring





Format Copy from initial code

or Cancel