The char Type in Java is Broken

Umer Mansoor Umer Mansoor Follow May 08, 2016 · 4 mins read

If I may be so brash, it is my opinion that the char type in Java is broken and should be avoided. It’s not suitable for modern needs, makes debugging a nightmare and should be retired. char is for representing characters (e.g. ‘a’, ‘b’, ‘c’) and has been supported in Java since it was released about 20 years ago. When Java first came out, the world was a simpler place. Windows 95 was the latest, greatest operating system, world’s first flip phone was just put on sale, and Unicode had less than 40,000 characters, all of which fit perfectly into the 16-bit space that char provides. But things have changed drastically. Unicode has outgrown the 16-bit space and now requires 21 bits for all of its 120,737 characters.

Java has supported Unicode since its first release and strings are internally represented using UTF-16 encoding. UTF-16 is a variable length encoding scheme. For characters that can fit into the 16 bits space, it uses 2 bytes to represent them. For all other characters, it uses 4 bytes. This is great. All possible Unicode characters in existence plus a lot more (1 million more) could be represented using UTF-16 and thus as Strings in Java.

But char is a different story altogether. Let’s look at its definition from the official source:

char: The char data type is a single 16-bit Unicode character. It has a minimum value of ‘\u0000’ (or 0) and a maximum value of ‘\uffff’ (or 65,535 inclusive).

“16-bit Unicode character”? I guess Joel was right:

Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you thought that, don’t feel bad.

There is no such thing as “16-bit Unicode character”. Please read Joel’s article if you don’t understand the last statement.

char uses 16 bits to store Unicode characters that fall in the 0 - 65,535 which isn’t enough to store all Unicode characters anymore. You might think: Gee, 65,535 is plenty already. I’ll never use that many. That’s true. But your users will. And when they send you a character that requires more than 16 bits, like these emojis 👦👩, the char methods like someString.charAt(0) or someString.substring(0,1) will break and give you only half the code point. And the worst part is that the compiler won’t even complain. Recently, a fellow developer told me that their “North American users” started complaining that the chat nicknames and messages “aren’t displaying properly”. After a lot of grief, they found the issue and had to undo all char manipulation in their software to handle emojis and other cool characters. (Use codePointAt(index) instead which returns an int that will fit all Unicode characters in existence.)

I have heard people say things like: “if internationalization isn’t a concern, you’d probably be fine using char or “don’t worry about it unless your program is going to be released in China or Japan”. This statement is wrong on two levels. First, at least I rarely come across applications where internationalization isn’t a concern anymore. Second, emojis characters are supported by all popular applications these days. Unicode isn’t just about internationalization anymore.

The char data type is broken. It might work for small or test applications, but if you start depending on it for string and character manipulation, it will disappoint. The application will ‘break’ silently and your users will see garbled text. May be, a UTF-16 character type from Oracle is the answer. Or at least a runtime exception when compiler detects that something bad is about to happen in the interim. Until then, we should probably avoid the char type. Even its official JavaDocs don’t sound all that convincing:

The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode Standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value. (Refer to the definition of the U+n notation in the Unicode Standard.)



Umer Mansoor
Written by Umer Mansoor Follow
Hi, I am Umer. I live in the San Francisco Bay Area and it is where I work. Since 2011, I have worked at startups, mobile gaming and internet companies. If you have any questions, feel free to message me.

Comments (4)


Bwd

Ok, what should a developer use instead ?


Umer Mansoor

Avoid using the `char` data type and the methods that return it like `charAt(index)`. `substring(i,j)` is safe to use if the indexes are obtained using the `indexOf(ch)` method.

To get the unicode code point, use `String.codePointAt(index)` method. It returns an `int` that contains the code point value of the character.


Codecompile

100% truth!


Jan Larres

Found this gem https://docs.oracle.com/jav....
Oracle promoting disastrously bad char based string operations.


Speak Your Mind

Join Newsletter
Enter your email address to receive a very occasional newsletter. I'll never spam or sell your email.