https://bugzilla.wikimedia.org/show_bug.cgi?id=73124
Bug ID: 73124 Summary: additional space causes crash Product: Pywikibot Version: core (2.0) Hardware: All OS: All Status: NEW Severity: major Priority: Unprioritized Component: interwiki.py Assignee: Pywikipedia-bugs@lists.wikimedia.org Reporter: jan.dudik@gmail.com Web browser: --- Mobile Platform: ---
When is in interwiki link space betveen namespace and name, bot crashes:
pwb.py interwiki -async -family:wiktionary -cleanup -continue
... Retrieving pages from wiktionary:fr. WARNING: loadpageinfo: Query on [[fr:Categorie: Abreviations en italien]] returned data on 'Categorie:Abreviations en italien' Dump cs (wiktionary) written. Traceback (most recent call last): File "D:\Py\rewrite\pwb.py", line 178, in <module> run_python_file(fn, argv, argvu) File "D:\Py\rewrite\pwb.py", line 75, in run_python_file exec(compile(source, filename, "exec"), main_mod.__dict__) File "D:\Py\rewrite\scripts\interwiki.py", line 2646, in <module> main() File "D:\Py\rewrite\scripts\interwiki.py", line 2621, in main bot.run() File "D:\Py\rewrite\scripts\interwiki.py", line 2365, in run self.queryStep() File "D:\Py\rewrite\scripts\interwiki.py", line 2338, in queryStep self.oneQuery() File "D:\Py\rewrite\scripts\interwiki.py", line 2334, in oneQuery subject.batchLoaded(self) File "D:\Py\rewrite\scripts\interwiki.py", line 1305, in batchLoaded if not page.exists(): File "D:\Py\rewrite\pywikibot\page.py", line 564, in exists return self.site.page_exists(self) File "D:\Py\rewrite\pywikibot\site.py", line 2288, in page_exists return page._pageid > 0 AttributeError: 'Page' object has no attribute '_pageid' <type 'exceptions.AttributeError'> CRITICAL: Waiting for 1 network thread(s) to finish. Press ctrl-c to abort
Because of impossibility of change dumpfile (https://bugzilla.wikimedia.org/show_bug.cgi?id=72943 ) I modified this page https://cs.wiktionary.org/w/index.php?title=Kategorie:Italsk%C3%A9_zkratky&a...
so if anyone wants to reproduce, must edit another page
https://bugzilla.wikimedia.org/show_bug.cgi?id=73124
John Mark Vandenberg jayvdb@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |CommodoreFabianus@gmx.de, | |jayvdb@gmail.com
--- Comment #1 from John Mark Vandenberg jayvdb@gmail.com --- The hint here is "Query on [[fr:Categorie: Abreviations en italien]] returned data on 'Categorie:Abreviations en italien'" That is only a warning in _update_page , and it because of Site.sametitle.
I set up a test case:
https://en.wikipedia.org/wiki/User:John_Vandenberg/test is:
fooo
[[fr:Catégorie: Pantonyme]]
https://pt.wikipedia.org/wiki/Usu%C3%A1rio:John_Vandenberg/test is:
fooo
[[en:User:John Vandenberg/test]]
Then:
$ python pwb.py interwiki -page:"Usuário:John_Vandenberg/test" -family:wikipedia -lang:pt NOTE: Number of pages queued is 0, trying to add 50 more. Retrieving 1 pages from wikipedia:pt. [[pt:Usuário(a):John Vandenberg/test]]: [[pt:Usuário(a):John Vandenberg/test]] gives new interwiki [[en:User:John Vandenberg/test]] Retrieving 1 pages from wikipedia:en. WARNING: [[pt:Usuário(a):John Vandenberg/test]] is in namespace 2, but [[fr:Catégorie: Abréviations en italien]] is in namespace 14. Follow it anyway? ([y]es, [n]o, [a]dd an alternative, [g]ive up) y [[pt:Usuário(a):John Vandenberg/test]]: [[en:User:John Vandenberg/test]] gives new interwiki [[fr:Catégorie: Abréviations en italien]] Retrieving 1 pages from wikipedia:fr. WARNING: preloadpages: Query returned unexpected title'Catégorie:Abréviations en italien' WARNING: loadpageinfo: Query on [[fr:Catégorie: Abréviations en italien]] returned data on 'Catégorie:Abréviations en italien' Dump pt (wikipedia) appended. Traceback (most recent call last): File "pwb.py", line 178, in <module> run_python_file(fn, argv, argvu) File "pwb.py", line 75, in run_python_file exec(compile(source, filename, "exec"), main_mod.__dict__) File "scripts/interwiki.py", line 2646, in <module> main() File "scripts/interwiki.py", line 2621, in main bot.run() File "scripts/interwiki.py", line 2365, in run self.queryStep() File "scripts/interwiki.py", line 2338, in queryStep self.oneQuery() File "scripts/interwiki.py", line 2334, in oneQuery subject.batchLoaded(self) File "scripts/interwiki.py", line 1305, in batchLoaded if not page.exists(): File ".../pywikibot/page.py", line 564, in exists return self.site.page_exists(self) File ".../pywikibot/site.py", line 2306, in page_exists return page._pageid > 0 AttributeError: 'Page' object has no attribute '_pageid' <type 'exceptions.AttributeError'> CRITICAL: Waiting for 1 network thread(s) to finish. Press ctrl-c to abort
Using this patch fixes the problem for me (so, I'll review it now)
https://gerrit.wikimedia.org/r/#/c/151809/
And https://gerrit.wikimedia.org/r/172108/ is also needed because of a recently created bug.
https://bugzilla.wikimedia.org/show_bug.cgi?id=73124
John Mark Vandenberg jayvdb@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Summary|additional space causes |additional space in |crash |langlinks data causes crash
--- Comment #2 from John Mark Vandenberg jayvdb@gmail.com --- API langlinks data retains the space.
https://en.wikipedia.org/w/api.php?action=query&prop=langlinks&title...
{ "query": { "pages": { "40071800": { "pageid": 40071800, "ns": 2, "title": "User:John Vandenberg/test", "langlinks": [ { "lang": "fr", "*": "Cat\u00e9gorie: Pantonyme" } ] } } } }
api.py update_page uses pywikibot.Link.langlinkUnsafe to create a Link object, and that doesnt remove spaces.
s = pywikibot.Site() l = pywikibot.Link.langlinkUnsafe('fr', 'Catégorie: Pantonyme', source=s) l.title
' Pantonyme'
https://bugzilla.wikimedia.org/show_bug.cgi?id=73124
John Mark Vandenberg jayvdb@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Blocks| |70936
https://bugzilla.wikimedia.org/show_bug.cgi?id=73124
--- Comment #3 from JAn Dudík jan.dudik@gmail.com --- see laso https://bugzilla.wikimedia.org/show_bug.cgi?id=73415
pywikipedia-bugs@lists.wikimedia.org