[Pywikipedia-l] [ pywikipediabot-Patches-1862810 ] wikipedia.py:html2unicode : html chars from #128 to #159

SourceForge.net noreply at sourceforge.net
Thu Jan 3 00:41:12 UTC 2008


Patches item #1862810, was opened at 2008-01-03 01:41
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=603140&aid=1862810&group_id=93107

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Nicolas Dumazet (nicdumz)
Assigned to: Nobody/Anonymous (nobody)
Summary: wikipedia.py:html2unicode : html chars from #128 to #159

Initial Comment:
Codepoints from #128 to #159 are unused in both ISO-8859-1 and Unicode, hence html entities numbered in this range *are* illegal.

But the fact is that a lot of websites do use these characters, our browsers now all print these characters, and some of these entities can be found on our wikis.

I found this working on a page containing #155 ( › ) : html2unicode would convert it to unichr(155), which is ›, way more unfit than the previous one. (Yes, actually that's the result you get on a wiki page)

#128, which produces €, was also being converted to unichr(128)... (€)

Cheers,

Nicolas Dumazet.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=603140&aid=1862810&group_id=93107



More information about the Pywikipedia-l mailing list