Patches item #1862810, was opened at 2008-01-03 01:41 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=603140&aid=1862810...
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Nicolas Dumazet (nicdumz) Assigned to: Nobody/Anonymous (nobody) Summary: wikipedia.py:html2unicode : html chars from #128 to #159
Initial Comment: Codepoints from #128 to #159 are unused in both ISO-8859-1 and Unicode, hence html entities numbered in this range *are* illegal.
But the fact is that a lot of websites do use these characters, our browsers now all print these characters, and some of these entities can be found on our wikis.
I found this working on a page containing #155 ( › ) : html2unicode would convert it to unichr(155), which is , way more unfit than the previous one. (Yes, actually that's the result you get on a wiki page)
#128, which produces €, was also being converted to unichr(128)... ()
Cheers,
Nicolas Dumazet.
----------------------------------------------------------------------
You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=603140&aid=1862810...
pywikipedia-l@lists.wikimedia.org