http://www.mediawiki.org/wiki/Special:Code/pywikipedia/11679
Revision: 11679
Author: legoktm
Date: 2013-06-21 18:18:10 +0000 (Fri, 21 Jun 2013)
Log Message:
-----------
Add a few more ignore values to html2unicode, contributed by Betacommand
Modified Paths:
--------------
trunk/pywikipedia/wikipedia.py
Modified: trunk/pywikipedia/wikipedia.py
===================================================================
--- trunk/pywikipedia/wikipedia.py 2013-06-21 07:51:56 UTC (rev 11678)
+++ trunk/pywikipedia/wikipedia.py 2013-06-21 18:18:10 UTC (rev 11679)
@@ -5655,6 +5655,16 @@
# also entities that might be named entities.
entityR = re.compile(
r'&(?:amp;)?(#(?P<decimal>\d+)|#x(?P<hex>[0-9a-fA-F]+)|(?P<name>[A-Za-z]+));')
+
+ ignore.extend((38, # Ampersand (&)
+ 39, # Bugzilla 24093
+ 60, # Less than (<)
+ 62, # Great than (>)
+ 91, # Opening bracket - sometimes used intentionally inside links
+ 93, # Closing bracket - sometimes used intentionally inside links
+ 124, # Vertical bar (??) - used intentionally in navigation bar templates on de:
+ 160,))
+
# These characters are Html-illegal, but sadly you *can* find some of
# these and converting them to unichr(decimal) is unsuitable
convertIllegalHtmlEntities = {