Created
November 7, 2011 14:05
-
-
Save terrancesnyder/1345094 to your computer and use it in GitHub Desktop.
Regex for Japanese
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Regex for matching ALL Japanese common & uncommon Kanji (4e00 – 9fcf) ~ The Big Kahuna! | |
([一-龯]) | |
Regex for matching Hirgana or Katakana | |
([ぁ-んァ-ン]) | |
Regex for matching Non-Hirgana or Non-Katakana | |
([^ぁ-んァ-ン]) | |
Regex for matching Hirgana or Katakana or basic punctuation (、。’) | |
([ぁ-んァ-ン\w]) | |
Regex for matching Hirgana or Katakana and random other characters | |
([ぁ-んァ-ン!:/]) | |
Regex for matching Hirgana | |
([ぁ-ん]) | |
Regex for matching full-width Katakana (zenkaku 全角) | |
([ァ-ン]) | |
Regex for matching half-width Katakana (hankaku 半角) | |
([ァ-ン゙゚]) | |
Regex for matching full-width Numbers (zenkaku 全角) | |
([0-9]) | |
Regex for matching full-width Letters (zenkaku 全角) | |
([A-z]) | |
Regex for matching Hiragana codespace characters (includes non phonetic characters) | |
([ぁ-ゞ]) | |
Regex for matching full-width (zenkaku) Katakana codespace characters (includes non phonetic characters) | |
([ァ-ヶ]) | |
Regex for matching half-width (hankaku) Katakana codespace characters (this is an old character set so the order is inconsistent with the hiragana) | |
([ヲ-゚]) | |
Regex for matching Japanese Post Codes | |
/^¥d{3}¥-¥d{4}$/ | |
/^¥d{3}-¥d{4}$|^¥d{3}-¥d{2}$|^¥d{3}$/ | |
Regex for matching Japanese mobile phone numbers (keitai bangou) | |
/^¥d{3}-¥d{4}-¥d{4}$|^¥d{11}$/ | |
/^0¥d0-¥d{4}-¥d{4}$/ | |
Regex for matching Japanese fixed line phone numbers | |
/^[0-9-]{6,9}$|^[0-9-]{12}$/ | |
/^¥d{1,4}-¥d{4}$|^¥d{2,5}-¥d{1,4}-¥d{4}$/ |
@cb372, your list comes close to covering all the kana, but a few characters are still missing. You got 「ゞ」 but missed 「ゝ」 and 「ゟ」, and a few others. I believe this would cover all Hiragana and Katakana separately:
Hiragana = [ぁ-ゖ゛-ゟー]
Katakana = [゠-ヿ]
Combined Hiragana & Katakana would be:
Hiragana+Katakana = [ぁ-ゖ゛-ゟ゠-ヿ]
I used the above hiragana+katakana regex to validate the kana portions of the downloadable version of JMDICT and can confirm that apart from a few errors in the JMDICT data, the kana validation works.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
To be fair those kanjis are extremely rare and are not used (they would not show up in dictionnaires or rikaichan like extensions) and 99.99% Japanese would not know about them:
https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_Extension_B
Now you can match them with: [𠀀-𪛟]
and to match everything you would simply do: [𠀀-𪛟]|[一-龯]