Unicode and You

Anonymous_User · June 23, 2009, 1:31pm

Cool, thanks for the first answer, Kalid. I’ll look over the article that you mentioned.

Avi

larah · August 17, 2010, 10:11am

I remember my problams with unicode. There are wasn’t any essay help for reading about this!

anitha · August 18, 2010, 8:57am

Thank you for the information…

jivko · October 11, 2010, 3:57pm

Hi everybody,
very good article, but I’d like to ask where the ANSI encoding fits in all this schema? Is it on 1 byte as the ascii? How one gets to know wether it’s ANSI? Thanks

kalid · February 12, 2012, 3:47am

@Diane: Hrm, not sure about notepad – you might try another editor like Notepad++ (http://notepad-plus-plus.org/)

dianehalstead · February 10, 2012, 6:16pm

I cannot open my notepad once I saved it as unicode big endian. Can you please explain how to do this.

alex · December 22, 2012, 6:20am

Good stuff, but just a heads up: the Notepad thing was fixed in Windows Vista and 7

kalid · June 13, 2013, 8:14pm

Thanks Alex – this article was written when using Win XP ;).

paulfijma · October 25, 2013, 1:34pm

I had a logfile 20mb created with log4net on a windows server and after i did a simple filter rule (from within powershell):
PS E:>type Web.log | findstr /v “community edition” > web1.log
i had a file with less lines (nicely stripepd out), but larger in size.
I was puzzled until i found out that the original file was written ASCII and the powershell environment changed it to UCS-2 Little Endian
Now i know also why it is bigger and what i need to watch out for when using similar files in what you would expect are the same environments.
with batch processing and large files this could be a nasty surprise.
Thanks for enlightening me!

kalid · October 25, 2013, 5:24pm

Awesome, glad it helped Paul!

pulkitbhardwaj · December 11, 2014, 7:36pm

Man… u are gooood… I really appreciate the work you are doing…
one question though, suppose my code point turns out to be 223 in decimal representation, so how will this be represented in UTF-8??

kalid · December 11, 2014, 7:50pm

Good question. So, 223 in decimal is 0x00df in hex, or 11011111 in binary.

Unfortunately, this is too large to store in a single UTF8 byte (which goes from 0 to 7f), but it will fit in 2 bytes. The format for UTF8 with 2 bytes is:

110xxxxx 10xxxxxx

the x’s are where our number needs to fit. We can fit 6 bits into the last byte, and 5 bits in the first, which we can pad out. So

11011111

becomes

[00011] [011111]

and then we fit this into our UTF8 bytes:

11000011 10011111

Hope that helps!

pulkitbhardwaj · December 11, 2014, 7:53pm

Awesome…
This was what I was expecting, thank you for the verification.

abhijeetkharat · January 18, 2015, 7:26am

successive understanding achieved!!!

bryan · August 16, 2015, 6:08am

hi,
i don’t know how elementary of a problem this is but:
lots of unicode has E as a part of the code… what on earth does that mean?
i am trying to get a list of handy unicode script i can have without having to go through the character map, namely the ‘lemniscate’ key which is listed as U + 221E
there are countless others and i’m sure it is similar to the F-coding you list here for different symbols
i don’t know how old this forum might be, but if you can respond… it is very much appreciated
thanks