python - What is the difference between UTF8-in literal and unicode point? -

- April 15, 2012

i came cross website show unicode table.

when print letter 'ספר':

>>> x = 'ספר' >>> x '\xd7\xa1\xd7\xa4\xd7\xa8'

i characters '\xd7\xa1\xd7\xa4\xd7\xa8'.

i think python encode word 'ספר' utf-8 unicode, because it's default, right?

but when run code:

>>> x = u'ספר' >>> x u'\u05e1\u05e4\u05e8'

i u'\u05e1\u05e4\u05e8', unicode point, right?

how convert utf8-literal unicode point?

@in first sample created byte string (type str). terminal determined encoding (utf-8 in case).

in second sample, created unicode string (type unicode). python auto-detected encoding terminal uses (from sys.stdin.encoding) , decoded bytes utf-8 unicode code points.

you can make same conversion byte string unicode string decoding:

unicode_x = bytestring_x.decode('utf8')

to go other direction, need encode:

bytestring_x = unicode_x.encode('utf8')

you specified literals using actual utf-8 bytes characters; works fine in terminal not in python source code; python 2 source code loaded ascii text only. can change setting source code encoding declaration. specified in pep 263; has first or second line in source file. example:

# encoding: utf-8

or can stick \uhhhh , \xhh escape sequences represent non-ascii characters.

you want read difference between unicode , encoded (binary) byte strings, , how relates python:

the absolute minimum every software developer absolutely, positively must know unicode , character sets (no excuses!) joel spolsky
the python unicode howto
pragmatic unicode ned batchelder

Search This Blog

GCM

python - What is the difference between UTF8-in literal and unicode point? -

Comments

Post a Comment

Popular posts from this blog

android - Hide only the Action bar on Scroll not action bar tabs -

matlab - "Contour not rendered for non-finite ZData" -

delphi - Indy UDP Read Contents of Adata -