Hi PWB friends,
I was about to write a bot script but figured I should ask her first. On
fawiki, I occasionally run into pages where the infobox template at the top
of the page is squished into one very long line, as opposed to being
multiline. Look at this example
<https://test.wikipedia.org/w/index.php?title=Infobox_tidy&action=edit> I
made on testwiki; I want to turn it to something looking like this
<https://test.wikipedia.org/w/index.php?title=Infobox_tidy&action=edit&oldid…>.
I have a reasonable sense of how to write the bot script and what possible
edge cases to watch for, but I wonder if someone may have already written a
script for this purpose.
LMK,
Huji
I'm new to pywikibot and trying to use weblinkchecker to find problematic links in my wiki, http://horawiki.org
Having installed and configured, I run "python pwb.py weblinkchecker -start:! "
which processes several dozen pages, then stops with:
Traceback (most recent call last):
File "/home/larrydenenberg7/pywikibot/pwb.py", line 40, in <module>
sys.exit(main())
...(many frames omitted)...
File "/home/larrydenenberg7/pywikibot/scripts/weblinkchecker.py", line 573, in treat_page
thread.name = removeprefix(
TypeError: descriptor 'removeprefix' for 'str' objects doesn't apply to a 'NoneType' object
CRITICAL: Exiting due to uncaught exception TypeError: descriptor 'removeprefix' for 'str' objects doesn't apply to a 'NoneType' object
I'm not 100% sure (how do I tell?) but this may be the page being processed: https://horawiki.org/page/Folk_Dance_Problem_Solver
or it might be this one: https://horawiki.org/page/First_Steps
(I tried running with max_external_links=1 and starting at specific pages but I'm still not sure which page caused the exception.)
I find nothing in phabricator that seems related. Any wisdom gratefully appreciated.
And as long as I'm asking, many pages give me this sequence:
WARNING: Unknown or invalid encoding 'ISO-8859-1,utf-8;q=0.7,*;q=0.7'
WARNING: Http response status 403
*[[en:<page name>]] links to http://israelidances.com/ - Forbidden.
This seems to arise for almost all (but definitely not all) links to site israelidances.com, and these links are certainly working.
Example page: https://horawiki.org/page/Eten_Bamidbar
Where is this "unknown or invalid" encoding coming from?
Again, thanks for any information that may help.
--
/Larry D
I recently discovered pyquery <https://www.pyquery.org/> It looks like an interesting alternative to using straight lxml. Does anybody have any experience using it?