Skip to content

Instantly share code, notes, and snippets.

@mranney
Created January 30, 2012 23:05
Show Gist options
  • Save mranney/1707371 to your computer and use it in GitHub Desktop.
Save mranney/1707371 to your computer and use it in GitHub Desktop.
Why we can't process Emoji anymore
From: Chris DeSalvo <[email protected]>
Subject: Why we can't process Emoji anymore
Date: Thu, 12 Jan 2012 18:49:20 -0800
Message-Id: <[email protected]>
--Apple-Mail=_6DEAA046-886A-4A03-8508-6FD077D18F8B
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
charset=utf-8
If you are not interested in the technical details of why Emoji current =
do not work in our iOS client, you can stop reading now.
Many many years ago a Japanese cell phone carrier called SoftBank came =
up with the idea for emoji and built it into the cell phones that it =
sold for their network. The problem they had was in deciding how to =
represent the characters in electronic form. They decided to use =
Unicode code points in the private use areas. This is a perfectly valid =
thing to do as long as your data stays completely within your product. =
However, with text messages the data has to interoperate with other =
carriers' phones.
Unfortunately SoftBank decided to copyright their entire set of images, =
their encoding, etc etc etc and refused to license them to anyone. So, =
when NTT and KDDI (two other Japanese carriers) decided that they wanted =
emoji they had to do their own implementations. To make things even =
more sad they decided not to work with each other and gang up on =
SoftBank. So, in Japan, there were three competing emoji standards that =
did not interoperate.
In 2010 Apple released iOS 2.2 and added support for the SoftBank =
implementation of emoji. Since SoftBank would not license their emoji =
out for use on networks other than their own Apple agreed to only make =
the emoji keyboard visible on iPhones that were on the SoftBank network. =
That's why you used to have to run an ad-ware app to make that keyboard =
visible.
Later in 2010 the Unicode consortium released version 6.0 of the Unicode =
standard. (In case any cares, Unicode originated in 1987 as a joint =
research project between Xerox and Apple.) The smart Unicode folks =
added all of emoji (about 740 glyphs) plus the new Indian Rupee sign, =
more symbols needed for several African languages, and hundreds more CJK =
symbols for, well, Chinese/Japanese/Korean (CJK also covers Vietnamese, =
but now, like then, nobody gives Vietnam any credit).
With iOS 5.0 Apple (wisely) decided to adopt Unicode 6.0. The emoji =
keyboard was made available to all users and generates code points from =
their new Unicode 6.0 locations. Apple also added this support to OS X =
Lion.
You may be asking, "So this all sounds great. Why can't I type a smiley =
in Voxer and have the damn thing show up?" Glad you asked. Consider =
the following glyph:
=F0=9F=98=84
SMILING FACE WITH OPEN MOUTH AND SMILING EYES
Unicode: U+1F604 (U+D83D U+DE04), UTF-8: F0 9F 98 84
You can get this info for any character that OS X can render by bringing =
up the Character Viewer panel and right-clicking on a glyph and =
selecting "Copy Character Info". So, what this shows us is that for =
this smiley face the Unicode code point is 0x1F604. For those of you =
who are not hex-savvy that is the decimal number 128,516. That's a =
pretty big number.
The code point that SoftBank had used was 0xFB55 (or 64,341 decimal). =
That's a pretty tiny number. You can represent 64,341 with just 16 =
bits. Dealing with 16 bits is something computers do really well. To =
represent 0x1F604 you need 17 bits. Since bits come in 8-packs you end =
up using 24 total. Computers hate odd numbers and dealing with a group =
of 3 bytes is a real pain.
I have to make a side-trip now and explain Unicode character encodings. =
Different kinds of computer systems, and the networks that connect them, =
think of data in different ways. Inside of the computer the processor =
thinks of data in terms defined by its physical properties. An old =
Commodore 64 operated on one byte, 8 bits, at a time. Later computers =
had 16-bit hardware, then 32, and now most of the computers you will =
encounter on your desk prefer to operate on data 64-bits (8 bytes) at a =
time. Networks still like to think of data as a string of individual =
bytes and try to ignore any such logical groupings. To represent the =
entire Unicode code space you need 21 bits. That is a frustrating size. =
Also, if you tend to work in Latin script (English, French, Italian, =
etc) where all of the codes you'll ever use fit neatly in 8 bits (the =
ISO Latin-1 set) then it is wasteful to have to use 24 bits (21 rounded =
up to the next byte boundary) because those top 17 bits will always be =
unused. So what do you do? You make alternate encodings.
There are many encodings, the most common being UTF-8 and UTF-16. There =
is also a UTF-32, but it isn't very popular since it's not =
space-friendly. UTF-8 has the nice property that all of the original =
ASCII characters preserve their encoding. So far in this email every =
single character I've typed (other than the smiley) has been an ASCII =
character and fits neatly in 7 bits. One byte per character is really =
friendly to work with, fits nicely in memory, and doesn't take much =
space on disk. If you sometimes need to represent a big character, like =
that smiley up there, then you do that with a multi-byte sequence. As =
we can see in the info above the UTF-8 for that smiley is the 4-byte =
sequence [F0 9F 98 84]. Make a file with those four byes in it and open =
it in any editor that is UTF-8 aware and you'll get that smiley.
Some Unicode-aware programming languages such as Java, Objective-C, and =
(most) JavaScript systems use the UTF-16 encoding internally. UTF-16 =
has some really good properties of its own that I won't digress into =
here. The thing to note is that it uses 16 bits for most characters. =
So, whereas a small letter 'a' would be the single byte 0x61in ASCII or =
UTF-8, in UTF-16 it is the 2-byte 0x0061. Note that the SoftBank 0xFB55 =
fits nicely in that 16-bit space. Hmm, but our smiley has a Unicode =
value of U+1F604 (we use U+ when throwing Unicode values around in =
hexadecimal) and that will NOT fit in 16 bits. Remember, we need 17. =
So what do we do? Well, the Unicode guys are really smart (UTF-8 is =
fucking brilliant, no, really!) and they invented a thing called a =
"surrogate pair". With a surrogate pair you can use two 16-bit values =
to encode that code point that is too big to fit into a single 16-bit =
field. Surrogate pairs have a specific bit pattern in their top bits =
that lets UTF-16 compliant systems know that they are a surrogate pair =
that represent a single code point and not two separate UTF-16 code =
points. In the example smiley above we find that the UTF-16 surrogate =
pair that encodes U+1F604 is [U+D83D U+DE04]. Put those four bytes into =
a file and open it in any program that understands UTF-16 and you'll see =
that smiley. He really is quite cheery.
So, I've already said that Objective-C and Java and (most) JavaScript =
systems use UTF-16 internally so we should be all cool, right? Well, =
see, it was that "(most)" that is the problem.
Before there was UTF-16 there was another encoding used by Java and =
JavaScript called UCS-2. UCS-2 is a strict 16-bit encoding. You get 16 =
bits per character and no more. So how do you represent U+1F604 which =
needs 17 bits? You don't. Period. UCS-2 has no notion of surrogate =
pairs. Through most of time this was ok because the Unicode consortium =
hadn't defined many code points beyond the 16 bit range so there was =
nothing out there to encode. But in 1996 it was clear that to encode =
all the CJK languages (and Vietnamese!) that we'd start needing those =
17+ bit code points. SUN updated Java to stop using UCS-2 as its =
default encoding and switched to UTF-16. NeXT did the same thing with =
NeXTSTEP (the precursor to iOS). Many JavaScript systems updated as =
well.
Now, here's what you've all been waiting for: the V8 runtime for =
JavaScript, which is what our node.js severs are built on, use UCS-2 =
internally as their encoding and are not capable of handing any code =
point outside the base 16 bit range (we call that the BMP, or Basic =
Multilingual Plane). V8 fundamentally has no ability to represent the =
U+1F604 that we need to make that smiley.
Danny confirmed this with the node guys today. Matt Ranney is going to =
talk to the V8 guys about it and see what they want to do about it.
Wow, you read though all of that? You rock. I'm humbled that you gave =
me so much of your attention. I feel that we've accomplished something =
together. Together we are now part of the tiny community of people who =
actually know anything about Unicode. You may have guessed by now that =
I am a text geek. I have had to implement java.lang.String for three =
separate projects. I love this stuff. If you have any questions about =
anything I've written here, or want more info so that you don't have to =
read the 670 page Unicode 6.0 core specification (there are many, many =
addenda as well) then please don't hesitate to hit me up.
Love,
Chris
p.s. Remember that this narrative is almost all ASCII characters, and =
ASCII is a subset of UTF-8. That smiley is the only non-ASCII =
character. In UTF-8 this email (everything up to, but not including my =
signature) is 8,553 bytes. In UTF-16 it is 17,102 bytes. In UTF-32 it =
would be 34,204 bytes. These space considerations are one of the many =
reasons we have multiple encodings.=
@al45tair
Copy link

UTF16, UCS2 and BMP are insane crap produced by crazy committees. I always uses a UTF8 editor for all my programs or web pages. It supports full unicode. It is easier to handle UTF8 than any of the other encodings (except ascii or iso8859-*). The "wide" chars in C or java were a stupid mistake that makes the life of developers a lot more difficult.

Sorry, but that’s complete rubbish. UTF-16 is the preferred representation for Unicode text for a variety of very good reasons, which is why it’s used by the canonical reference implementation, the ICU project, as well as the Java and Objective-C runtimes.

As for C, the wide characters (and the associated wide and multibyte string routines) in C were never intended for use with Unicode; they were intended for use in East Asian countries with pre-existing standards. They were designed with the intent that a single wide character represented something that the end user would regard as a character (i.e. something that could be processed as an individual unit); this is not true even with UCS-4, and so using the wide character routines and wchar_t for Unicode (whether your wchar_t is 16 or 32-bit) is and always has been a mistake.

@dustin
Copy link

dustin commented Nov 28, 2012

Sorry, but that’s complete rubbish. UTF-16 is the preferred representation for Unicode text for a variety of very good reasons, which is why it’s used by the canonical reference implementation, the ICU project, as well as the Java and Objective-C runtimes.

Can you name any of them? UCS-2 may have been justifiable at some point, but I can't think of any good reason for UTF-16 to exist anywhere.

@xdamman
Copy link

xdamman commented Nov 29, 2012

@isaacs you are right, this issue has actually been solved with the migration from node 0.6.x to 0.8.x.

@apk
Copy link

apk commented Nov 29, 2012

The only reason people are using UTF-16 (esp. as programmer-visible internal representation) is that it was originally UCS-2 in the same language, and we are stuck with the strange java codepoint/index APIs that most people forget to use properly because that causes bugs that only appear in a few fringe languages (from a western-centric viewpoint), as opposed to utf-8 that has effects practically everywhere. Ironic that the emojis bring that problem back to the western world. :-)

@jonathanwcrane
Copy link

So psyched to learn about the history of Unicode and character encodings, especially the historical anomaly of the battle between competing Japanese wireless providers!

@MarcusJohnson91
Copy link

Hey Chris, I just stumbled upon this page while trying to understand the UTF-8 encoding.

I'm trying to write a basic UTF-8 string handling library in C (my idea it to basically define utf8_t as an unsigned char pointer)

Anyway, I've skimmed through the unicode pdf a few tiems, and tried googling it about a dozen times, and I can't find any good information on how the more complex features of this encoding are represented.

So, here goes.

How are emoji (like, flag emoji especially) represented? are they 2 code points? how do you know there's a following code point? I know that usually the leading byte of a codepoint will set the top 4 bits depending on how many bytes are in the code point, does this work for emoji flags too?

ALSO why is there sometimes a leading flag byte? how do you know when the flag byte will be separate, or part of the first coding byte?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment