I'm new to pywikibot and trying to use weblinkchecker to find problematic links in my wiki, http://horawiki.org
Having installed and configured, I run "python pwb.py weblinkchecker -start:! " which processes several dozen pages, then stops with:
Traceback (most recent call last): File "/home/larrydenenberg7/pywikibot/pwb.py", line 40, in <module> sys.exit(main()) ...(many frames omitted)... File "/home/larrydenenberg7/pywikibot/scripts/weblinkchecker.py", line 573, in treat_page thread.name = removeprefix( TypeError: descriptor 'removeprefix' for 'str' objects doesn't apply to a 'NoneType' object CRITICAL: Exiting due to uncaught exception TypeError: descriptor 'removeprefix' for 'str' objects doesn't apply to a 'NoneType' object
I'm not 100% sure (how do I tell?) but this may be the page being processed: https://horawiki.org/page/Folk_Dance_Problem_Solver or it might be this one: https://horawiki.org/page/First_Steps (I tried running with max_external_links=1 and starting at specific pages but I'm still not sure which page caused the exception.)
I find nothing in phabricator that seems related. Any wisdom gratefully appreciated.
And as long as I'm asking, many pages give me this sequence: WARNING: Unknown or invalid encoding 'ISO-8859-1,utf-8;q=0.7,*;q=0.7' WARNING: Http response status 403 *[[en:<page name>]] links to http://israelidances.com/ - Forbidden. This seems to arise for almost all (but definitely not all) links to site israelidances.com, and these links are certainly working. Example page: https://horawiki.org/page/Eten_Bamidbar Where is this "unknown or invalid" encoding coming from?
Again, thanks for any information that may help.
Update: A little more experimentation makes it more likely that the page causing the exception is https://horawiki.org/page/Folk_Dance_Problem_Solver and not the other link in my original post.
It appears to me that the mailto: link in that page is causing this problem. Python's parseurl will not be returning a host name, which causes the error that is thrown, but in any case the actual checking code only uses http, so would not handle mailto:.
I can't see any way to ignore it using the script parameters.
Perhaps add an entry to the "ignorelist", by adding a line to the script after line 172, such as:
re.compile(r'mailto://'),
I haven't tested that. There are also other possible protocols that should be ignored, unless code is added to handle them.
On Sun, 16 Mar 2025, 02:11 , larry@denenberg.com wrote:
Update: A little more experimentation makes it more likely that the page causing the exception is https://horawiki.org/page/Folk_Dance_Problem_Solver and not the other link in my original post. _______________________________________________ pywikibot mailing list -- pywikibot@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/me... To unsubscribe send an email to pywikibot-leave@lists.wikimedia.org
You've definitely got it. Proof: I removed the mailto: link (by rewriting it in words: "send email to SFDhist at gmail.com") and reran weblinkchecker, which sailed right on past that page and crashed again on another page with a mailto: link.
I should have realized this since the mailto: link is the only unusual thing on that page.
Open questions:
1) Is there a way to find all mailto: links on a wiki, other than fixing them one by one like I'm doing? (Please don't say "use pywikibot" even though that's the right answer.)
2) Should this be reported as a bug in Phabricator? Wiki pages should be able to have mailto: links, no?
3) Any idea about the "Unknown or invalid coding" then 403 error then "Forbidden link" business?
Many thanks for the help.
On Mar 16, 2025, at 1:22 PM, larry@denenberg.com wrote:
- Is there a way to find all mailto: links on a wiki, other than fixing them one by one like I'm doing? (Please don't say "use pywikibot" even though that's the right answer.)
My first thought was a search for insource:mailto, but I just tried that on your wiki and didn't get anything useful.
So maybe an SQL query against the underlying database? Or just use [[Special:Export]] to dump the entire database in XML and grep that for "mailto". Neither of those will be very efficient, but I'm assuming your wiki is small enough that it won't matter.
- Should this be reported as a bug in Phabricator? Wiki pages should be able to have mailto: links, no?
This is definitely a bug. I looked at the code a little earlier today. All it's using the hostname for is to give a name to a thread, presumably for debugging or logging purposes. It shouldn't crash on malformed input. At worst, it could invent some thread name like "unknown host" or use some other part of the parsed URL for a string.
You can list the mailto links using MediaWiki external links search.
https://horawiki.org/page/Special:LinkSearch?target=mailto%3A*&namespace...
On Sun, 16 Mar 2025, 17:23 , larry@denenberg.com wrote:
You've definitely got it. Proof: I removed the mailto: link (by rewriting it in words: "send email to SFDhist at gmail.com") and reran weblinkchecker, which sailed right on past that page and crashed again on another page with a mailto: link.
I should have realized this since the mailto: link is the only unusual thing on that page.
Open questions:
- Is there a way to find all mailto: links on a wiki, other than fixing
them one by one like I'm doing? (Please don't say "use pywikibot" even though that's the right answer.)
- Should this be reported as a bug in Phabricator? Wiki pages should be
able to have mailto: links, no?
- Any idea about the "Unknown or invalid coding" then 403 error then
"Forbidden link" business?
Many thanks for the help. _______________________________________________ pywikibot mailing list -- pywikibot@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/me... To unsubscribe send an email to pywikibot-leave@lists.wikimedia.org