http://www.mediawiki.org/wiki/Special:Code/pywikipedia/11679
Revision: 11679 Author: legoktm Date: 2013-06-21 18:18:10 +0000 (Fri, 21 Jun 2013) Log Message: ----------- Add a few more ignore values to html2unicode, contributed by Betacommand
Modified Paths: -------------- trunk/pywikipedia/wikipedia.py
Modified: trunk/pywikipedia/wikipedia.py =================================================================== --- trunk/pywikipedia/wikipedia.py 2013-06-21 07:51:56 UTC (rev 11678) +++ trunk/pywikipedia/wikipedia.py 2013-06-21 18:18:10 UTC (rev 11679) @@ -5655,6 +5655,16 @@ # also entities that might be named entities. entityR = re.compile( r'&(?:amp;)?(#(?P<decimal>\d+)|#x(?P<hex>[0-9a-fA-F]+)|(?P<name>[A-Za-z]+));') + + ignore.extend((38, # Ampersand (&) + 39, # Bugzilla 24093 + 60, # Less than (<) + 62, # Great than (>) + 91, # Opening bracket - sometimes used intentionally inside links + 93, # Closing bracket - sometimes used intentionally inside links + 124, # Vertical bar (??) - used intentionally in navigation bar templates on de: + 160,)) + # These characters are Html-illegal, but sadly you *can* find some of # these and converting them to unichr(decimal) is unsuitable convertIllegalHtmlEntities = {
pywikipedia-svn@lists.wikimedia.org