unicode - How can I handle mal-encoded character with Python 2? -
the html file fetching has characters not supported encoding specified in html header:
i found following ones not supported shift_jis encoding used. browser can correctly show characters.
- ∑ n-ary summation u+2211
- ゚ halfwidth katakana semi-voiced sound mark u+ff9f
- Д cyrillic capital letter de u+414
when try read html file , decode processing, unicodedecodeerror.
url = 'http://matsucon.net/material/dic/kao09.html' response = urllib2.urlopen(url) response.read().decode('shift_jis_2004')
any way process html has mal-encoded characters without getting error?
try this:
response.read().decode('shift_jis_2004',errors='ignore')
Comments
Post a Comment