Great article. Very well explained.
There is a little typo here:
“Why does XML store data in UCS-8 instead of UCS-2? Is space or processing power more important when reading XML documents?”
I think you meant UTF-8 instead of UCS-8.
Great article. Very well explained.
There is a little typo here:
“Why does XML store data in UCS-8 instead of UCS-2? Is space or processing power more important when reading XML documents?”
I think you meant UTF-8 instead of UCS-8.
@Nihal: Thank you! Glad you liked it.
@Matt: Awesome, thanks for the correction – I was definitely off in my understanding. I wonder if there is a name for data that has the highest bit cleared on every byte? (I think UTF-16 would be a good addition to the discussion also).
@Alonso: Thanks for the correction, I’ll change that now.
Short and handy. Thanks!
A nice kick-off to get a better understanding
Very good article - the most clear article I have encountered on the web.
Keep up the good work Kalid!
Two things I didn’t understand:
@Kai: Thanks, glad you liked it.
@Avi: Thank you! Great questions
From a purist point of view, it’d be nice to have every symbol have exactly 1 number, but from a practical standpoint the system needs to be backwards compatible. So latin characters appear where they are today, in the ASCII range under 127, and also in another “proper” location defined by the unicode standard.
http://betterexplained.com/articles/understanding-big-and-little-endian-byte-order/
(Scroll down for the part about Unicode).
Cool, thanks for the first answer, Kalid. I’ll look over the article that you mentioned.
Avi
Thank you for the information…
Hi everybody,
very good article, but I’d like to ask where the ANSI encoding fits in all this schema? Is it on 1 byte as the ascii? How one gets to know wether it’s ANSI? Thanks
@Diane: Hrm, not sure about notepad – you might try another editor like Notepad++ (http://notepad-plus-plus.org/)
I cannot open my notepad once I saved it as unicode big endian. Can you please explain how to do this.
Good stuff, but just a heads up: the Notepad thing was fixed in Windows Vista and 7
Thanks Alex – this article was written when using Win XP ;).
I had a logfile 20mb created with log4net on a windows server and after i did a simple filter rule (from within powershell):
PS E:>type Web.log | findstr /v “community edition” > web1.log
i had a file with less lines (nicely stripepd out), but larger in size.
I was puzzled until i found out that the original file was written ASCII and the powershell environment changed it to UCS-2 Little Endian
Now i know also why it is bigger and what i need to watch out for when using similar files in what you would expect are the same environments.
with batch processing and large files this could be a nasty surprise.
Thanks for enlightening me!
Awesome, glad it helped Paul!
Man… u are gooood… I really appreciate the work you are doing…
one question though, suppose my code point turns out to be 223 in decimal representation, so how will this be represented in UTF-8??
Good question. So, 223 in decimal is 0x00df in hex, or 11011111 in binary.
Unfortunately, this is too large to store in a single UTF8 byte (which goes from 0 to 7f), but it will fit in 2 bytes. The format for UTF8 with 2 bytes is:
110xxxxx 10xxxxxx
the x’s are where our number needs to fit. We can fit 6 bits into the last byte, and 5 bits in the first, which we can pad out. So
11011111
becomes
[00011] [011111]
and then we fit this into our UTF8 bytes:
11000011 10011111
Hope that helps!
Awesome…
This was what I was expecting, thank you for the verification.
successive understanding achieved!!!
hi,
i don’t know how elementary of a problem this is but:
lots of unicode has E as a part of the code… what on earth does that mean?
i am trying to get a list of handy unicode script i can have without having to go through the character map, namely the ‘lemniscate’ key which is listed as U + 221E
there are countless others and i’m sure it is similar to the F-coding you list here for different symbols
i don’t know how old this forum might be, but if you can respond… it is very much appreciated
thanks