python - What is the difference between UTF8-in literal and unicode point? -


i came cross website show unicode table.

when print letter 'ספר':

>>> x = 'ספר' >>> x '\xd7\xa1\xd7\xa4\xd7\xa8' 

i characters '\xd7\xa1\xd7\xa4\xd7\xa8'.

i think python encode word 'ספר' utf-8 unicode, because it's default, right?

but when run code:

>>> x = u'ספר' >>> x u'\u05e1\u05e4\u05e8' 

i u'\u05e1\u05e4\u05e8', unicode point, right?

how convert utf8-literal unicode point?

@in first sample created byte string (type str). terminal determined encoding (utf-8 in case).

in second sample, created unicode string (type unicode). python auto-detected encoding terminal uses (from sys.stdin.encoding) , decoded bytes utf-8 unicode code points.

you can make same conversion byte string unicode string decoding:

unicode_x = bytestring_x.decode('utf8') 

to go other direction, need encode:

bytestring_x = unicode_x.encode('utf8') 

you specified literals using actual utf-8 bytes characters; works fine in terminal not in python source code; python 2 source code loaded ascii text only. can change setting source code encoding declaration. specified in pep 263; has first or second line in source file. example:

# encoding: utf-8 

or can stick \uhhhh , \xhh escape sequences represent non-ascii characters.

you want read difference between unicode , encoded (binary) byte strings, , how relates python:


Comments

Popular posts from this blog

matlab - "Contour not rendered for non-finite ZData" -

delphi - Indy UDP Read Contents of Adata -

javascript - Any ideas when Firefox is likely to implement lengthAdjust and textLength? -