Unicode and You

Cool, thanks for the first answer, Kalid. I’ll look over the article that you mentioned.

Avi

I remember my problams with unicode. There are wasn’t any essay help for reading about this!

Thank you for the information…

Hi everybody,
very good article, but I’d like to ask where the ANSI encoding fits in all this schema? Is it on 1 byte as the ascii? How one gets to know wether it’s ANSI? Thanks

@Diane: Hrm, not sure about notepad – you might try another editor like Notepad++ (http://notepad-plus-plus.org/)

I cannot open my notepad once I saved it as unicode big endian. Can you please explain how to do this.

Good stuff, but just a heads up: the Notepad thing was fixed in Windows Vista and 7

Thanks Alex – this article was written when using Win XP ;).

I had a logfile 20mb created with log4net on a windows server and after i did a simple filter rule (from within powershell):
PS E:>type Web.log | findstr /v “community edition” > web1.log
i had a file with less lines (nicely stripepd out), but larger in size.
I was puzzled until i found out that the original file was written ASCII and the powershell environment changed it to UCS-2 Little Endian
Now i know also why it is bigger and what i need to watch out for when using similar files in what you would expect are the same environments.
with batch processing and large files this could be a nasty surprise.
Thanks for enlightening me!

Awesome, glad it helped Paul!

Man… u are gooood… I really appreciate the work you are doing…
one question though, suppose my code point turns out to be 223 in decimal representation, so how will this be represented in UTF-8??

Good question. So, 223 in decimal is 0x00df in hex, or 11011111 in binary.

Unfortunately, this is too large to store in a single UTF8 byte (which goes from 0 to 7f), but it will fit in 2 bytes. The format for UTF8 with 2 bytes is:

110xxxxx 10xxxxxx

the x’s are where our number needs to fit. We can fit 6 bits into the last byte, and 5 bits in the first, which we can pad out. So

11011111

becomes

[00011] [011111]

and then we fit this into our UTF8 bytes:

11000011 10011111

Hope that helps!

Awesome… :slight_smile:
This was what I was expecting, thank you for the verification.

successive understanding achieved!!!

hi,
i don’t know how elementary of a problem this is but:
lots of unicode has E as a part of the code… what on earth does that mean?
i am trying to get a list of handy unicode script i can have without having to go through the character map, namely the ‘lemniscate’ key which is listed as U + 221E
there are countless others and i’m sure it is similar to the F-coding you list here for different symbols
i don’t know how old this forum might be, but if you can respond… it is very much appreciated
thanks