Skip to content

Instantly share code, notes, and snippets.

@KL-7
Created July 11, 2012 20:46
Show Gist options
  • Save KL-7/f4ee3bd280a2257c5641 to your computer and use it in GitHub Desktop.
Save KL-7/f4ee3bd280a2257c5641 to your computer and use it in GitHub Desktop.
TwitterCLDR notes.

Tailoring specs results (diff with ICU)

  1. JA:
tw:  failures: [["x ", "x"], ["X ", "X"], ["xゞx", "xヽ"]]
icu: failures: [["x ", "x"], ["X ", "X"]]

Character 'ゞ', code point 0x309E, is not in NFD (its normalized version is 0x309D 0x3099), but there is an entry for denormalized version of this string in FCE table - 309E; [0E 25, 05, 05][, DA 95, 05]. As all strings are normalized first, we don't use this entry, but instead build collation elements for this character from CE's for 0x309D and 0x3099 that are [0E 25, 05, 05] and [, DA 95, 05]. That doesn't cause any issue in the default locale, because the results are identical. But when 'ゝ' (code point 0x309D) is tailored from [0E 25, 05, 05] to [0E 29, 5, 5] in JA locale we get wrong [0E 29, 05, 05][, DA 95, 05] collation elements for 'ゞ'.

Only one test failure, but in practice there might be more cases like this one. The problem is that FCE table contains denormalized code points and as we normalize all strings before collation we fail to find collation elements. It's a bit unexpected and I'm not sure how we can fix it.

Tests failures for all other locales are identical to the ones of ICU, that might be considered a good result if we think of ICU as a reference implementation.

@camertron
Copy link

Hey @KL-7, I've got a few small corrections for this (awesome) writeup:

  1. Under "Summary", #3 JS should be JA.
  2. Under "Summary", #4 should be prefixed with ZH-HANT like the other ones.
  3. The links to the CLDR Trac repo seem to be broken...

Otherwise, this rocks. Thanks!

@KL-7
Copy link
Author

KL-7 commented Jul 14, 2012

@camertron, I made the corrections, thanks. The links should be working, though. I believe they have some network issues today, because links from the official site are not opening either.

@camertron
Copy link

Uppercase-first sorting for Danish is finished - can you update this gist?

@KL-7
Copy link
Author

KL-7 commented Jul 25, 2012

Thanks for mentioning that. I completely removed Danish from the list, because we have only three failures with it now and all of them are identical to the failures of ICU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment