[Pywikipedia-l] Diverse Coding Projects - pywikibot

7 Apr 2014


      Dear Pywikipedia team
I have pushed a few of my coding projects using pywikipedia (the compat
version) to github and I thought that some of you might be interested in the
code. I had some time recently to clean up the code and bring it into a
(hopefully) useable format and I would be willing to make further adjustments
if you think the code would be more useful to you if I changed a few things.
Ultimately my hope would be that the code will find its home in the pywikipedia
repository. Also some of the code that I wrote might be duplicated and already
present in the core, if so I would apologize and you can happily ignore it.
== Template parser ==
https://github.com/hroest/pywikibot-compat/tree/feature/template_parser
For one bot project on the German Wikipedia I had to parse rather complex
templates and replace specific fields. The templates would contain nested
templates, math formulas and references inside. I thus wrote a template parser
which would parse these templates and return them as key-value pairs which
would make it easy to query specific keys and replace their values. The code
worked well on several thousand templates of the German chemistry project and
should be rather straightforward to use. This is library code, so there is no
bot associated with it, see templateparser.py and tests/test_templateparser.py
In order to correctly handle nesting and properly differentiate equal signs
belonging to key-value pairs from those in mathematical formulas etc, I also
had to write a partial wikimedia syntax parser which would recognize such
syntax in wikitext. This code is in textrange_parser.py and allows to extract
specific parts of a text (e.g. wikitables, templates, wikilinks, weblinks),
tests are in tests/test_textrange_parser.py
== Spellchecking ==
https://github.com/hroest/pywikibot-compat/tree/feature/spellcheck
I added two new spellchecking bots, one based on hunspell which is the same
spellchecker that also libreoffice uses (spellcheck_hunspell.py) and another
one based on a negative list (spellcheck_blacklist.py). They run from the
commandline, both parse the given wiki text, skip text ranges that usually do
not only contain human-readable text (templates, tables etc) and check each
word against a spellchecking engine (again, either a simple blacklist or a
full-blown spellchecker that has stemming and morphological analysis like
hunspell). These spellcheckers may turn out to be useful since the understand
part of the Wiki markup and know which parts of a text to spellcheck and which
parts not.
The wrong words can be processed interactively and each word can be confirmed
individually and then sent to Wikipedia to be corrected. I have a bot with
which I do this semi-automatically and I have so far corrected 3000+ spelling
mistakes on the German Wikipedia
https://de.wikipedia.org/wiki/Spezial:Beitr%C3%A4ge/HRoestTypo
For large scale processing, one can process a complete Wikipeda XML dump and
for small-scale processing one can use the Wikipedia web-search functionality
to search for articles with a specific spelling error and then only process
these pages.
== Review edits ==
https://github.com/hroest/pywikibot-compat/tree/feature/review_pages
In the German Wikipedia, there is considerable work done reviewing individual
edits and marking them as reviewed. In the above feature/review_pages branch
there is a script called review_pages which allows to perform reviews of
revisions semi-automatically. It fetches for a given page the revision history
up to the last reviewed change and displays the changes between the current and
the last reviewed version of the article on the command line. The user can then
interactively decide to accept the review, undo the change or go to the next
unreviewed change.
For this bot, a mediawiki APIs are used and thus it may not actually be
suitable for the compat version of pywikipedia. Reviewing, undoing and
retrieving full version histories are done through the APIs and can be
performed fully asynchronous. This allows relatively fast interactive response
while the bot in the background fetches the revision histories and performs the
review/undo actions requested by the user.
== Summary ==
I provide this code in the hope that it is useful for people and if somebody
thinks that the described functionality could be provided from the pywikibot
project, I would be willing to work to make necessary adjustments for the code
to be merged.
Best regards
Hannes