[Pywikipedia-l] [ pywikipediabot-Patches-1862810 ] wikipedia.py:html2unicode : html chars from #128 to #159
SourceForge.net
noreply at sourceforge.net
Thu Jan 3 00:41:12 UTC 2008
Patches item #1862810, was opened at 2008-01-03 01:41
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603140&aid=1862810&group_id=93107
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Nicolas Dumazet (nicdumz)
Assigned to: Nobody/Anonymous (nobody)
Summary: wikipedia.py:html2unicode : html chars from #128 to #159
Initial Comment:
Codepoints from #128 to #159 are unused in both ISO-8859-1 and Unicode, hence html entities numbered in this range *are* illegal.
But the fact is that a lot of websites do use these characters, our browsers now all print these characters, and some of these entities can be found on our wikis.
I found this working on a page containing #155 ( › ) : html2unicode would convert it to unichr(155), which is , way more unfit than the previous one. (Yes, actually that's the result you get on a wiki page)
#128, which produces €, was also being converted to unichr(128)... ()
Cheers,
Nicolas Dumazet.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603140&aid=1862810&group_id=93107
More information about the Pywikipedia-l
mailing list