I'm a Unicode newbie. But like many newbies, I had an urge to learn once my interest was piqued by an introduction to Unicode.
This is a companion discussion topic for the original entry at http://betterexplained.com/articles/unicode/
I'm a Unicode newbie. But like many newbies, I had an urge to learn once my interest was piqued by an introduction to Unicode.
[…] Binary files can get confusing. Problems happen when computers have different ways of reading data. There’s something called the “NUXI” or byte-order problem, which happens when 2 computers with different architectures (Mac and PC, for example) try to transfer binary data. Regular text stored in single bytes is unambiguous (but be careful with unicode). […]
Great guide!!!
[…] Unicode and You […]
[…] Unicode and You […]
[…] At least if you haven’t had “a chance” to deal with this subject the way I would have liked to yet you’d read Unicode and You. I’m very much assured that it will help you to understand the basics. […]
That was a wonderful article!!!
Thanks rc, glad it was helpful for you.
The “this program can break” example didn’t work for me. It opened fine in my hex editor. Am I missing a step?
Hi Steven, I was unclear in my instructions (just updated). Try opening the file in notepad, rather than a hex editor to see the result :).
Thanks, I see what you mean now. I still haven’t figured out why it happens. Maybe it’s a Windows joke?
Yeah, I believe it’s an issue with Notepad trying to decipher what encoding the file is – it guesses based on the distribution of bytes! That particular English sentence has a byte structure similar to the unicode characters of other languages. More here: http://blogs.msdn.com/michkap/archive/2006/06/14/631016.aspx
it is reaaly a great job.
to simplify concepts that way is not easy at all.
i am a computer engineer working with telecommunication engineers that has no previous background in programming and i suffer when discussing such subjects with them.
it is reaaly a great job.
to simplify concepts that way is not easy at all.
i am a computer engineer working with telecommunication engineers that has no previous background in programming and i suffer when discussing such subjects with them.
Excellent overview of Unicode. And a great site.
One correction: “8-bit clean” does not mean the encoding only uses 7 bits. It’s a property of the system handling the bytes, and it means the system can transparently handle 8 bit bytes (hence the encoding is ALLOWED to use 8-bit encodings). So UTF-7 is designed for systems which are not 8-bit clean.
Also you haven’t described UTF-16, which is not the same as UCS-2 – it is a superset, able to encode the entire repertoire, not just the first 65536 characters as UCS-2 can.
Great article. Very well explained.
There is a little typo here:
“Why does XML store data in UCS-8 instead of UCS-2? Is space or processing power more important when reading XML documents?”
I think you meant UTF-8 instead of UCS-8.
@Nihal: Thank you! Glad you liked it.
@Matt: Awesome, thanks for the correction – I was definitely off in my understanding. I wonder if there is a name for data that has the highest bit cleared on every byte? (I think UTF-16 would be a good addition to the discussion also).
@Alonso: Thanks for the correction, I’ll change that now.
Short and handy. Thanks!
A nice kick-off to get a better understanding
Very good article - the most clear article I have encountered on the web.
Keep up the good work Kalid!
Two things I didn’t understand:
@Kai: Thanks, glad you liked it.
@Avi: Thank you! Great questions
From a purist point of view, it’d be nice to have every symbol have exactly 1 number, but from a practical standpoint the system needs to be backwards compatible. So latin characters appear where they are today, in the ASCII range under 127, and also in another “proper” location defined by the unicode standard.
http://betterexplained.com/articles/understanding-big-and-little-endian-byte-order/
(Scroll down for the part about Unicode).