unicode - How can I handle mal-encoded character with Python 2? -


the html file fetching has characters not supported encoding specified in html header:

i found following ones not supported shift_jis encoding used. browser can correctly show characters.

  • ∑ n-ary summation u+2211
  • ゚ halfwidth katakana semi-voiced sound mark u+ff9f
  • Д cyrillic capital letter de u+414

when try read html file , decode processing, unicodedecodeerror.

url = 'http://matsucon.net/material/dic/kao09.html' response = urllib2.urlopen(url) response.read().decode('shift_jis_2004') 

any way process html has mal-encoded characters without getting error?

try this:

response.read().decode('shift_jis_2004',errors='ignore') 

Comments

Popular posts from this blog

javascript - Any ideas when Firefox is likely to implement lengthAdjust and textLength? -

matlab - "Contour not rendered for non-finite ZData" -

delphi - Indy UDP Read Contents of Adata -