Unicode and You

Great article. Very well explained.

There is a little typo here:

“Why does XML store data in UCS-8 instead of UCS-2? Is space or processing power more important when reading XML documents?”

I think you meant UTF-8 instead of UCS-8.

@Nihal: Thank you! Glad you liked it.

@Matt: Awesome, thanks for the correction – I was definitely off in my understanding. I wonder if there is a name for data that has the highest bit cleared on every byte? (I think UTF-16 would be a good addition to the discussion also).

@Alonso: Thanks for the correction, I’ll change that now.

Short and handy. Thanks!

A nice kick-off to get a better understanding :slight_smile:

Very good article - the most clear article I have encountered on the web.
Keep up the good work Kalid!
Two things I didn’t understand:

  1. “Purists probably didn’t like this, because the full Latin character sets were defined elsewhere, and now one letter had 2 codepoints”. “Defines elsewhere”, “2 codepoints” - I didn’t get it.
  2. Regarding the first byte count which indicates on the number of bytes in. I wrote “avi” and saved it as UTF8. I looked at hex editor and saw that the first byte was 1111…
    The bytes there were: EF BB BF 61 76 69 so it amounts to 6 bytes -can somebody explain this issue to me.

@Kai: Thanks, glad you liked it.

@Avi: Thank you! Great questions

  1. Unicode gives a number (called a code point) to every symbol, so “a” “b” and “c” each have their own number. Sometimes the same exact symbol (like a) will have two different numbers that represent it, to be compatible with the old formats like ASCII.

From a purist point of view, it’d be nice to have every symbol have exactly 1 number, but from a practical standpoint the system needs to be backwards compatible. So latin characters appear where they are today, in the ASCII range under 127, and also in another “proper” location defined by the unicode standard.

  1. Great question. You’re right, in UTF-8, avi should only have 3 bytes (61 76 69). The preceding ones (EF BB BF) are part of the BOM (byte order mark) that defines whether the data is big or little endian. You can read more about it here:

http://betterexplained.com/articles/understanding-big-and-little-endian-byte-order/

(Scroll down for the part about Unicode).

Cool, thanks for the first answer, Kalid. I’ll look over the article that you mentioned.

Avi

I remember my problams with unicode. There are wasn’t any essay help for reading about this!

Thank you for the information…

Hi everybody,
very good article, but I’d like to ask where the ANSI encoding fits in all this schema? Is it on 1 byte as the ascii? How one gets to know wether it’s ANSI? Thanks

@Diane: Hrm, not sure about notepad – you might try another editor like Notepad++ (http://notepad-plus-plus.org/)

I cannot open my notepad once I saved it as unicode big endian. Can you please explain how to do this.

Good stuff, but just a heads up: the Notepad thing was fixed in Windows Vista and 7

Thanks Alex – this article was written when using Win XP ;).

I had a logfile 20mb created with log4net on a windows server and after i did a simple filter rule (from within powershell):
PS E:>type Web.log | findstr /v “community edition” > web1.log
i had a file with less lines (nicely stripepd out), but larger in size.
I was puzzled until i found out that the original file was written ASCII and the powershell environment changed it to UCS-2 Little Endian
Now i know also why it is bigger and what i need to watch out for when using similar files in what you would expect are the same environments.
with batch processing and large files this could be a nasty surprise.
Thanks for enlightening me!

Awesome, glad it helped Paul!

Man… u are gooood… I really appreciate the work you are doing…
one question though, suppose my code point turns out to be 223 in decimal representation, so how will this be represented in UTF-8??

Good question. So, 223 in decimal is 0x00df in hex, or 11011111 in binary.

Unfortunately, this is too large to store in a single UTF8 byte (which goes from 0 to 7f), but it will fit in 2 bytes. The format for UTF8 with 2 bytes is:

110xxxxx 10xxxxxx

the x’s are where our number needs to fit. We can fit 6 bits into the last byte, and 5 bits in the first, which we can pad out. So

11011111

becomes

[00011] [011111]

and then we fit this into our UTF8 bytes:

11000011 10011111

Hope that helps!

Awesome… :slight_smile:
This was what I was expecting, thank you for the verification.

successive understanding achieved!!!

hi,
i don’t know how elementary of a problem this is but:
lots of unicode has E as a part of the code… what on earth does that mean?
i am trying to get a list of handy unicode script i can have without having to go through the character map, namely the ‘lemniscate’ key which is listed as U + 221E
there are countless others and i’m sure it is similar to the F-coding you list here for different symbols
i don’t know how old this forum might be, but if you can respond… it is very much appreciated
thanks