Xqt has submitted this change and it was merged.
Change subject: Change title whitelist to title blacklist ......................................................................
Change title whitelist to title blacklist
Titles with characters outside the BMP [1] (>\uFFFF) are now no longer detected as illegal. See this thread: [2]
[1] https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane [2] http://thread.gmane.org/gmane.comp.python.pywikipediabot.general/13197/
This list of characters was generated by using the old re and by enumerating characters:
import re m = re.compile(u'''[^ %!"$&'()*,\-.\/0-9:;=?@A-Z\\^_`a-z~\u0080-\uFFFF+]''') for x in range(0,0x80): if m.match(unichr(x)): print "%x" % x,
0 1 2 3 4 5 6 7 8 9 a b c d e f 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f 23 3c 3e 5b 5d 7b 7c 7d 7f
Change-Id: I02c26be9ad814ce11d9adf2f997d3d1e05764fd1 --- M pywikibot/page.py 1 file changed, 2 insertions(+), 2 deletions(-)
Approvals: Xqt: Looks good to me, approved jenkins-bot: Verified
diff --git a/pywikibot/page.py b/pywikibot/page.py index e51977c..58debb7 100644 --- a/pywikibot/page.py +++ b/pywikibot/page.py @@ -2853,8 +2853,8 @@
""" illegal_titles_pattern = re.compile( - # Matching titles will be held as illegal. - u'''[^ %!"$&'()*,\-.\/0-9:;=?@A-Z\\^_`a-z~\u0080-\uFFFF+]''' + # Matching titles will be held as illegal. + ur'''[\x00-\x1f\x23\x3c\x3e\x5b\x5d\x7b\x7c\x7d\x7f]''' # URL percent encoding sequences interfere with the ability # to round-trip titles -- you can't link to them consistently. u'|%[0-9A-Fa-f]{2}'