Unicode and You

I'm a Unicode newbie. But like many newbies, I had an urge to learn once my interest was piqued by an introduction to Unicode.


This is a companion discussion topic for the original entry at http://betterexplained.com/articles/unicode/

[…] Binary files can get confusing. Problems happen when computers have different ways of reading data. There’s something called the “NUXI” or byte-order problem, which happens when 2 computers with different architectures (Mac and PC, for example) try to transfer binary data. Regular text stored in single bytes is unambiguous (but be careful with unicode). […]

Great guide!!!

[…] Unicode and You […]

[…] Unicode and You […]

[…] At least if you haven’t had “a chance” to deal with this subject the way I would have liked to yet you’d read Unicode and You. I’m very much assured that it will help you to understand the basics. […]

That was a wonderful article!!!

Thanks rc, glad it was helpful for you.

The “this program can break” example didn’t work for me. It opened fine in my hex editor. Am I missing a step?

Hi Steven, I was unclear in my instructions (just updated). Try opening the file in notepad, rather than a hex editor to see the result :).

Thanks, I see what you mean now. I still haven’t figured out why it happens. Maybe it’s a Windows joke? :slight_smile:

Yeah, I believe it’s an issue with Notepad trying to decipher what encoding the file is – it guesses based on the distribution of bytes! That particular English sentence has a byte structure similar to the unicode characters of other languages. More here: http://blogs.msdn.com/michkap/archive/2006/06/14/631016.aspx

it is reaaly a great job.
to simplify concepts that way is not easy at all.
i am a computer engineer working with telecommunication engineers that has no previous background in programming and i suffer when discussing such subjects with them.

it is reaaly a great job.
to simplify concepts that way is not easy at all.
i am a computer engineer working with telecommunication engineers that has no previous background in programming and i suffer when discussing such subjects with them.

Excellent overview of Unicode. And a great site.

One correction: “8-bit clean” does not mean the encoding only uses 7 bits. It’s a property of the system handling the bytes, and it means the system can transparently handle 8 bit bytes (hence the encoding is ALLOWED to use 8-bit encodings). So UTF-7 is designed for systems which are not 8-bit clean.

Also you haven’t described UTF-16, which is not the same as UCS-2 – it is a superset, able to encode the entire repertoire, not just the first 65536 characters as UCS-2 can.

Great article. Very well explained.

There is a little typo here:

“Why does XML store data in UCS-8 instead of UCS-2? Is space or processing power more important when reading XML documents?”

I think you meant UTF-8 instead of UCS-8.

@Nihal: Thank you! Glad you liked it.

@Matt: Awesome, thanks for the correction – I was definitely off in my understanding. I wonder if there is a name for data that has the highest bit cleared on every byte? (I think UTF-16 would be a good addition to the discussion also).

@Alonso: Thanks for the correction, I’ll change that now.

Short and handy. Thanks!

A nice kick-off to get a better understanding :slight_smile:

Very good article - the most clear article I have encountered on the web.
Keep up the good work Kalid!
Two things I didn’t understand:

  1. “Purists probably didn’t like this, because the full Latin character sets were defined elsewhere, and now one letter had 2 codepoints”. “Defines elsewhere”, “2 codepoints” - I didn’t get it.
  2. Regarding the first byte count which indicates on the number of bytes in. I wrote “avi” and saved it as UTF8. I looked at hex editor and saw that the first byte was 1111…
    The bytes there were: EF BB BF 61 76 69 so it amounts to 6 bytes -can somebody explain this issue to me.

@Kai: Thanks, glad you liked it.

@Avi: Thank you! Great questions

  1. Unicode gives a number (called a code point) to every symbol, so “a” “b” and “c” each have their own number. Sometimes the same exact symbol (like a) will have two different numbers that represent it, to be compatible with the old formats like ASCII.

From a purist point of view, it’d be nice to have every symbol have exactly 1 number, but from a practical standpoint the system needs to be backwards compatible. So latin characters appear where they are today, in the ASCII range under 127, and also in another “proper” location defined by the unicode standard.

  1. Great question. You’re right, in UTF-8, avi should only have 3 bytes (61 76 69). The preceding ones (EF BB BF) are part of the BOM (byte order mark) that defines whether the data is big or little endian. You can read more about it here:

http://betterexplained.com/articles/understanding-big-and-little-endian-byte-order/

(Scroll down for the part about Unicode).