The char Type in Java is Broken?

Umer Umer Follow May 08, 2016 · 4 mins read

If I may be so brash, it is my opinion that the char type in Java is dangerous and should be avoided if you are going to use Unicode characters. char is used for representing characters (e.g. ‘a’, ‘b’, ‘c’) and has been supported in Java since it was released about 20 years ago. When Java first came out, the world was a simpler place. Windows 95 was the latest, greatest operating system, world’s first flip phone was just put on sale, and Unicode had less than 40,000 characters, all of which fit perfectly into the 16-bit space that char provides. But things have changed drastically. Unicode has outgrown the 16-bit space and now requires 21 bits for all of its 120,737 characters.

Java has supported Unicode since its first release and strings are internally represented using UTF-16 encoding. UTF-16 is a variable length encoding scheme. For characters that can fit into the 16 bits space, it uses 2 bytes to represent them. For all other characters, it uses 4 bytes. This is great. All possible Unicode characters in existence plus a lot more (1 million more) could be represented using UTF-16 and thus as Strings in Java.

But char is a different story altogether. Let’s look at its definition from the official source:

char: The char data type is a single 16-bit Unicode character. It has a minimum value of ‘\u0000’ (or 0) and a maximum value of ‘\uffff’ (or 65,535 inclusive).

“16-bit Unicode character”? I guess Joel was right:

Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you thought that, don’t feel bad.

There is no such thing as “16-bit Unicode character”. Please read Joel’s article if you don’t understand the last statement.

char uses 16 bits to store Unicode characters that fall in the 0 - 65,535 which isn’t enough to store all Unicode characters anymore. You might think: Gee, 65,535 is plenty already. I’ll never use that many. That’s true. But your users will. And when they send you a character that requires more than 16 bits, like these emojis 👦👩, the char methods like someString.charAt(0) or someString.substring(0,1) will break and give you only half the code point. And the worst part is that the compiler won’t even complain. Recently, a fellow developer told me that their “North American users” started complaining that the chat nicknames and messages “aren’t displaying properly”. After a lot of grief, they found the issue and had to undo all char manipulation in their software to handle emojis and other cool characters. (Use codePointAt(index) instead which returns an int that will fit all Unicode characters in existence.)

I have heard people say things like: “if internationalization isn’t a concern, you’d probably be fine using char or “don’t worry about it unless your program is going to be released in China or Japan”.

First, I rarely come across applications where internationalization isn’t a concern anymore. My last three jobs all required internationalization at their core. Second, emojis characters are supported by all popular applications these days. Unicode isn’t just about internationalization anymore.

To be fair to char, it will work fine most of the time for many applications. It isn’t broken but it has a flaw which could ‘break’ your application silently and make your users see garbled text. May be, a UTF-16 character type from Oracle is the answer. Or at least a runtime exception when compiler detects that something bad is about to happen in the interim. Until then, we should probably avoid the char type.

Even its official JavaDocs don’t sound all that convincing to me:

The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode Standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value. (Refer to the definition of the U+n notation in the Unicode Standard.)

🤷

#java #popular

You May Also Enjoy


If you like this post, please share using the buttons above. It will help CodeAhoy grow and add new content. Thank you!


Comments (6)


Bwd

Ok, what should a developer use instead ?


Umer Mansoor

Avoid using the char data type and the methods that return it like charAt(index). substring(i,j) is safe to use if the indexes are obtained using the indexOf(ch) method.

To get the unicode code point, use String.codePointAt(index) method. It returns an int that contains the code point value of the character.


Jan Larres

Found this gem https://docs.oracle.com/jav….
Oracle promoting disastrously bad char based string operations.


Codecompile

100% truth!


DarnellKes

This information is true


Kevinphows

Certainly is not present.


Speak Your Mind