Created
June 23, 2011 13:33
-
-
Save thb/1042536 to your computer and use it in GitHub Desktop.
Unaccent method for a Ruby string
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
class String | |
UNACCENT_HASH = { | |
'A' => 'ÀÁÂÃÄÅĀĂǍẠẢẤẦẨẪẬẮẰẲẴẶǺĄ', | |
'a' => 'àáâãäåāăǎạảấầẩẫậắằẳẵặǻą', | |
'C' => 'ÇĆĈĊČ', | |
'c' => 'çćĉċč', | |
'D' => 'ÐĎĐ', | |
'd' => 'ďđ', | |
'E' => 'ÈÉÊËĒĔĖĘĚẸẺẼẾỀỂỄỆ', | |
'e' => 'èéêëēĕėęěẹẻẽếềểễệ', | |
'G' => 'ĜĞĠĢ', | |
'g' => 'ĝğġģ', | |
'H' => 'ĤĦ', | |
'h' => 'ĥħ', | |
'I' => 'ÌÍÎÏĨĪĬĮİǏỈỊ', | |
'J' => 'Ĵ', | |
'j' => 'ĵ', | |
'K' => 'Ķ', | |
'k' => 'ķ', | |
'L' => 'ĹĻĽĿŁ', | |
'l' => 'ĺļľŀł', | |
'N' => 'ÑŃŅŇ', | |
'n' => 'ñńņňʼn', | |
'O' => 'ÒÓÔÕÖØŌŎŐƠǑǾỌỎỐỒỔỖỘỚỜỞỠỢ', | |
'o' => 'òóôõöøōŏőơǒǿọỏốồổỗộớờởỡợð', | |
'R' => 'ŔŖŘ', | |
'r' => 'ŕŗř', | |
'S' => 'ŚŜŞŠ', | |
's' => 'śŝşš', | |
'T' => 'ŢŤŦ', | |
't' => 'ţťŧ', | |
'U' => 'ÙÚÛÜŨŪŬŮŰŲƯǓǕǗǙǛỤỦỨỪỬỮỰ', | |
'u' => 'ùúûüũūŭůűųưǔǖǘǚǜụủứừửữự', | |
'W' => 'ŴẀẂẄ', | |
'w' => 'ŵẁẃẅ', | |
'Y' => 'ÝŶŸỲỸỶỴ', | |
'y' => 'ýÿŷỹỵỷỳ', | |
'Z' => 'ŹŻŽ', | |
'z' => 'źżž', | |
# Ligatures | |
'AE' => 'Æ', | |
'ae' => 'æ', | |
'OE' => 'Œ', | |
'oe' => 'œ' | |
} | |
def unaccent | |
_str = self.clone | |
UNACCENT_HASH.each do |k, v| | |
_str.gsub!(/[#{v}]/, k) | |
end | |
_str | |
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# old unaccent method (without regexp) | |
# Note that in Ruby 1.9 the "index" method is "key" | |
# | |
def self.unaccent(str) | |
new_chars = [] | |
str.each_char do |cstr| | |
replaced = false | |
ACCENTS.values.each do |accent_string| | |
if accent_string.include?(cstr.to_s) | |
new_chars << ACCENTS.index(accent_string) | |
replaced = true | |
end | |
end | |
new_chars << cstr unless replaced | |
end | |
new_chars.to_s | |
end |
ð
is definitely not accented o
. It is a lowercase variant of Ð
, a letter used in Icelandic and Faroese. See http://en.wikipedia.org/wiki/%C3%90.
Maybe Have a look unac, it has a application called unaccent. and here is the home page, http://www.nongnu.org/unac/unaccent-man1.en.html
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Downcase i is missing.