python - What is the difference between UTF8-in literal and unicode point? -
i came cross website show unicode table.
when print letter 'ספר':
>>> x = 'ספר' >>> x '\xd7\xa1\xd7\xa4\xd7\xa8'
i characters '\xd7\xa1\xd7\xa4\xd7\xa8'
.
i think python encode word 'ספר' utf-8 unicode, because it's default, right?
but when run code:
>>> x = u'ספר' >>> x u'\u05e1\u05e4\u05e8'
i u'\u05e1\u05e4\u05e8'
, unicode point, right?
how convert utf8-literal unicode point?
@in first sample created byte string (type str
). terminal determined encoding (utf-8 in case).
in second sample, created unicode string (type unicode
). python auto-detected encoding terminal uses (from sys.stdin.encoding
) , decoded bytes utf-8 unicode code points.
you can make same conversion byte string unicode string decoding:
unicode_x = bytestring_x.decode('utf8')
to go other direction, need encode:
bytestring_x = unicode_x.encode('utf8')
you specified literals using actual utf-8 bytes characters; works fine in terminal not in python source code; python 2 source code loaded ascii text only. can change setting source code encoding declaration. specified in pep 263; has first or second line in source file. example:
# encoding: utf-8
or can stick \uhhhh
, \xhh
escape sequences represent non-ascii characters.
you want read difference between unicode , encoded (binary) byte strings, , how relates python:
Comments
Post a Comment