Unicode and You

kalid · May 15, 2014, 6:30pm

I'm a Unicode newbie. But like many newbies, I had an urge to learn once my interest was piqued by an introduction to Unicode.

This is a companion discussion topic for the original entry at http://betterexplained.com/articles/unicode/

Anonymous_User · February 21, 2007, 1:14pm

[…] Binary files can get confusing. Problems happen when computers have different ways of reading data. There’s something called the “NUXI” or byte-order problem, which happens when 2 computers with different architectures (Mac and PC, for example) try to transfer binary data. Regular text stored in single bytes is unambiguous (but be careful with unicode). […]

jim · July 23, 2007, 1:04pm

Great guide!!!

Anonymous_User · November 28, 2007, 9:48am

[…] Unicode and You […]

Anonymous_User · February 12, 2008, 4:44pm

[…] Unicode and You […]

Anonymous_User · March 27, 2008, 7:51pm

[…] At least if you haven’t had “a chance” to deal with this subject the way I would have liked to yet you’d read Unicode and You. I’m very much assured that it will help you to understand the basics. […]

rc0 · March 28, 2008, 5:39am

That was a wonderful article!!!

kalid · March 28, 2008, 7:30am

Thanks rc, glad it was helpful for you.

Anonymous_User · July 28, 2008, 8:10pm

The “this program can break” example didn’t work for me. It opened fine in my hex editor. Am I missing a step?

kalid · July 28, 2008, 11:22pm

Hi Steven, I was unclear in my instructions (just updated). Try opening the file in notepad, rather than a hex editor to see the result :).

Anonymous_User · July 29, 2008, 4:39pm

Thanks, I see what you mean now. I still haven’t figured out why it happens. Maybe it’s a Windows joke?

kalid · July 30, 2008, 2:59am

Yeah, I believe it’s an issue with Notepad trying to decipher what encoding the file is – it guesses based on the distribution of bytes! That particular English sentence has a byte structure similar to the unicode characters of other languages. More here: http://blogs.msdn.com/michkap/archive/2006/06/14/631016.aspx

Anonymous_User · August 19, 2008, 10:52am

it is reaaly a great job.
to simplify concepts that way is not easy at all.
i am a computer engineer working with telecommunication engineers that has no previous background in programming and i suffer when discussing such subjects with them.

nihal · August 19, 2008, 10:54am

it is reaaly a great job.
to simplify concepts that way is not easy at all.
i am a computer engineer working with telecommunication engineers that has no previous background in programming and i suffer when discussing such subjects with them.

Anonymous_User · February 14, 2009, 12:15am

Excellent overview of Unicode. And a great site.

One correction: “8-bit clean” does not mean the encoding only uses 7 bits. It’s a property of the system handling the bytes, and it means the system can transparently handle 8 bit bytes (hence the encoding is ALLOWED to use 8-bit encodings). So UTF-7 is designed for systems which are not 8-bit clean.

Also you haven’t described UTF-16, which is not the same as UCS-2 – it is a superset, able to encode the entire repertoire, not just the first 65536 characters as UCS-2 can.

alonso · April 3, 2009, 3:05am

Great article. Very well explained.

There is a little typo here:

“Why does XML store data in UCS-8 instead of UCS-2? Is space or processing power more important when reading XML documents?”

I think you meant UTF-8 instead of UCS-8.

kalid · April 3, 2009, 6:51am

@Nihal: Thank you! Glad you liked it.

@Matt: Awesome, thanks for the correction – I was definitely off in my understanding. I wonder if there is a name for data that has the highest bit cleared on every byte? (I think UTF-16 would be a good addition to the discussion also).

@Alonso: Thanks for the correction, I’ll change that now.

kaikajus · May 3, 2009, 9:55am

Short and handy. Thanks!

A nice kick-off to get a better understanding

Anonymous_User · June 21, 2009, 12:08pm

Very good article - the most clear article I have encountered on the web.
Keep up the good work Kalid!
Two things I didn’t understand:

“Purists probably didn’t like this, because the full Latin character sets were defined elsewhere, and now one letter had 2 codepoints”. “Defines elsewhere”, “2 codepoints” - I didn’t get it.
Regarding the first byte count which indicates on the number of bytes in. I wrote “avi” and saved it as UTF8. I looked at hex editor and saw that the first byte was 1111…
The bytes there were: EF BB BF 61 76 69 so it amounts to 6 bytes -can somebody explain this issue to me.

kalid · June 22, 2009, 7:27pm

@Kai: Thanks, glad you liked it.

@Avi: Thank you! Great questions

Unicode gives a number (called a code point) to every symbol, so “a” “b” and “c” each have their own number. Sometimes the same exact symbol (like a) will have two different numbers that represent it, to be compatible with the old formats like ASCII.

From a purist point of view, it’d be nice to have every symbol have exactly 1 number, but from a practical standpoint the system needs to be backwards compatible. So latin characters appear where they are today, in the ASCII range under 127, and also in another “proper” location defined by the unicode standard.

Great question. You’re right, in UTF-8, avi should only have 3 bytes (61 76 69). The preceding ones (EF BB BF) are part of the BOM (byte order mark) that defines whether the data is big or little endian. You can read more about it here:

http://betterexplained.com/articles/understanding-big-and-little-endian-byte-order/

(Scroll down for the part about Unicode).