Hi all,
I'm getting a strange InvalidTitle error while iterating through each of the articles in the English Wikipedia's "Unprintworthy redirects" category using the .articles() function.
In particular, if you run this code:
import pywikibot site = pywikibot.Site("en", "wikipedia"); site.login() cat = pywikibot.Category(site, "Category:Unprintworthy redirects") for each_article in cat.articles(namespaces=(0)): print(each_article.title(withNamespace=True), each_article.pageid)
Then it'll run for a while, printing out a bunch of titles and page IDs, and then crash:
Traceback (most recent call last): File "/data/project/apersonbot/test-redir-bann.py", line 5, in <module> print(each_article.title(withNamespace=True), each_article.pageid) File "/shared/pywikipedia/core/pywikibot/tools/__init__.py", line 1446, in wrapper return obj(*__args, **__kw) File "/shared/pywikipedia/core/pywikibot/page.py", line 322, in title title = self._link.canonical_title() File "/shared/pywikipedia/core/pywikibot/page.py", line 5737, in canonical_title if self.namespace != Namespace.MAIN: File "/shared/pywikipedia/core/pywikibot/page.py", line 5698, in namespace self.parse() File "/shared/pywikipedia/core/pywikibot/page.py", line 5669, in parse raise pywikibot.InvalidTitle("The link does not contain a page " pywikibot.exceptions.InvalidTitle: The link does not contain a page title CRITICAL: Closing network session.
Any ideas? I don't think this is expected behavior, but I could be wrong.
- Daniel
The page title could be invalid. But I’m fortunately we don’t know the title yet and have to find out. I had wrong database entries in past which gave such an exception I guess. Better file a phabricator task for this issue.
Best xqt
Von meinem iPhone gesendet
Am 18.06.2018 um 22:22 schrieb Daniel Glus danielhglus@gmail.com:
Hi all,
I'm getting a strange InvalidTitle error while iterating through each of the articles in the English Wikipedia's "Unprintworthy redirects" category using the .articles() function.
In particular, if you run this code:
import pywikibot site = pywikibot.Site("en", "wikipedia"); site.login() cat = pywikibot.Category(site, "Category:Unprintworthy redirects") for each_article in cat.articles(namespaces=(0)): print(each_article.title(withNamespace=True), each_article.pageid)
Then it'll run for a while, printing out a bunch of titles and page IDs, and then crash:
Traceback (most recent call last): File "/data/project/apersonbot/test-redir-bann.py", line 5, in <module> print(each_article.title(withNamespace=True), each_article.pageid) File "/shared/pywikipedia/core/pywikibot/tools/__init__.py", line 1446, in wrapper return obj(*__args, **__kw) File "/shared/pywikipedia/core/pywikibot/page.py", line 322, in title title = self._link.canonical_title() File "/shared/pywikipedia/core/pywikibot/page.py", line 5737, in canonical_title if self.namespace != Namespace.MAIN: File "/shared/pywikipedia/core/pywikibot/page.py", line 5698, in namespace self.parse() File "/shared/pywikipedia/core/pywikibot/page.py", line 5669, in parse raise pywikibot.InvalidTitle("The link does not contain a page " pywikibot.exceptions.InvalidTitle: The link does not contain a page title CRITICAL: Closing network session.
Any ideas? I don't think this is expected behavior, but I could be wrong.
- Daniel
pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot
Hi Daniel,
Changing the loop to the below tells me the first problematic pageid is 28644448 https://en.wikipedia.org/wiki/Special:Redirect/page/28644448, which is the character \x85.
for each_article in cat.articles(namespaces=(0)):
... try: ... print(each_article.title(withNamespace=True), each_article.pageid) ... except pywikibot.exceptions.InvalidTitle: ... print(each_article.pageid) ... raise ...
str.strip() removes this character resulting an empty string, so the exception is raised. (page.py#L5666-L5670 https://github.com/wikimedia/pywikibot/blob/16a31c88b67c7af1966ca00ed998db01f76c2adb/pywikibot/page.py#L5666-L5670 )
Regards, JJ
On Mon, Jun 18, 2018 at 1:23 PM Daniel Glus danielhglus@gmail.com wrote:
Hi all,
I'm getting a strange InvalidTitle error while iterating through each of the articles in the English Wikipedia's "Unprintworthy redirects" category using the .articles() function.
In particular, if you run this code:
import pywikibot site = pywikibot.Site("en", "wikipedia"); site.login() cat = pywikibot.Category(site, "Category:Unprintworthy redirects") for each_article in cat.articles(namespaces=(0)): print(each_article.title(withNamespace=True), each_article.pageid)
Then it'll run for a while, printing out a bunch of titles and page IDs, and then crash:
Traceback (most recent call last): File "/data/project/apersonbot/test-redir-bann.py", line 5, in <module> print(each_article.title(withNamespace=True), each_article.pageid) File "/shared/pywikipedia/core/pywikibot/tools/__init__.py", line 1446, in wrapper return obj(*__args, **__kw) File "/shared/pywikipedia/core/pywikibot/page.py", line 322, in title title = self._link.canonical_title() File "/shared/pywikipedia/core/pywikibot/page.py", line 5737, in canonical_title if self.namespace != Namespace.MAIN: File "/shared/pywikipedia/core/pywikibot/page.py", line 5698, in namespace self.parse() File "/shared/pywikipedia/core/pywikibot/page.py", line 5669, in parse raise pywikibot.InvalidTitle("The link does not contain a page " pywikibot.exceptions.InvalidTitle: The link does not contain a page title CRITICAL: Closing network session.
Any ideas? I don't think this is expected behavior, but I could be wrong.
- Daniel
pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot
https://phabricator.wikimedia.org/T197642 created for followup.