Dear Pywikipedia team
I have pushed a few of my coding projects using pywikipedia (the compat version) to github and I thought that some of you might be interested in the code. I had some time recently to clean up the code and bring it into a (hopefully) useable format and I would be willing to make further adjustments if you think the code would be more useful to you if I changed a few things.
Ultimately my hope would be that the code will find its home in the pywikipedia repository. Also some of the code that I wrote might be duplicated and already present in the core, if so I would apologize and you can happily ignore it.
== Template parser ==
https://github.com/hroest/pywikibot-compat/tree/feature/template_parser
For one bot project on the German Wikipedia I had to parse rather complex templates and replace specific fields. The templates would contain nested templates, math formulas and references inside. I thus wrote a template parser which would parse these templates and return them as key-value pairs which would make it easy to query specific keys and replace their values. The code worked well on several thousand templates of the German chemistry project and should be rather straightforward to use. This is library code, so there is no bot associated with it, see templateparser.py and tests/test_templateparser.py
In order to correctly handle nesting and properly differentiate equal signs belonging to key-value pairs from those in mathematical formulas etc, I also had to write a partial wikimedia syntax parser which would recognize such syntax in wikitext. This code is in textrange_parser.py and allows to extract specific parts of a text (e.g. wikitables, templates, wikilinks, weblinks), tests are in tests/test_textrange_parser.py
== Spellchecking ==
https://github.com/hroest/pywikibot-compat/tree/feature/spellcheck
I added two new spellchecking bots, one based on hunspell which is the same spellchecker that also libreoffice uses (spellcheck_hunspell.py) and another one based on a negative list (spellcheck_blacklist.py). They run from the commandline, both parse the given wiki text, skip text ranges that usually do not only contain human-readable text (templates, tables etc) and check each word against a spellchecking engine (again, either a simple blacklist or a full-blown spellchecker that has stemming and morphological analysis like hunspell). These spellcheckers may turn out to be useful since the understand part of the Wiki markup and know which parts of a text to spellcheck and which parts not.
The wrong words can be processed interactively and each word can be confirmed individually and then sent to Wikipedia to be corrected. I have a bot with which I do this semi-automatically and I have so far corrected 3000+ spelling mistakes on the German Wikipedia https://de.wikipedia.org/wiki/Spezial:Beitr%C3%A4ge/HRoestTypo
For large scale processing, one can process a complete Wikipeda XML dump and for small-scale processing one can use the Wikipedia web-search functionality to search for articles with a specific spelling error and then only process these pages.
== Review edits ==
https://github.com/hroest/pywikibot-compat/tree/feature/review_pages
In the German Wikipedia, there is considerable work done reviewing individual edits and marking them as reviewed. In the above feature/review_pages branch there is a script called review_pages which allows to perform reviews of revisions semi-automatically. It fetches for a given page the revision history up to the last reviewed change and displays the changes between the current and the last reviewed version of the article on the command line. The user can then interactively decide to accept the review, undo the change or go to the next unreviewed change.
For this bot, a mediawiki APIs are used and thus it may not actually be suitable for the compat version of pywikipedia. Reviewing, undoing and retrieving full version histories are done through the APIs and can be performed fully asynchronous. This allows relatively fast interactive response while the bot in the background fetches the revision histories and performs the review/undo actions requested by the user.
== Summary ==
I provide this code in the hope that it is useful for people and if somebody thinks that the described functionality could be provided from the pywikibot project, I would be willing to work to make necessary adjustments for the code to be merged.
Best regards
Hannes
On 04/07/2014 05:35 AM, Hannes Röst wrote:
== Template parser ==
https://github.com/hroest/pywikibot-compat/tree/feature/template_parser
For one bot project on the German Wikipedia I had to parse rather complex templates and replace specific fields. The templates would contain nested templates, math formulas and references inside. I thus wrote a template parser which would parse these templates and return them as key-value pairs which would make it easy to query specific keys and replace their values. The code worked well on several thousand templates of the German chemistry project and should be rather straightforward to use. This is library code, so there is no bot associated with it, see templateparser.py and tests/test_templateparser.py
In order to correctly handle nesting and properly differentiate equal signs belonging to key-value pairs from those in mathematical formulas etc, I also had to write a partial wikimedia syntax parser which would recognize such syntax in wikitext. This code is in textrange_parser.py and allows to extract specific parts of a text (e.g. wikitables, templates, wikilinks, weblinks), tests are in tests/test_textrange_parser.py
Have you looked into using mwparserfromhell[1]? It's a true parser which even has C speedups. Support for it is already in pywikibot, it's just not turned on by default.
[1] https://github.com/earwig/mwparserfromhell
-- Legoktm
Hi Legoktm
No I havent looked at it, I didnt actually know it existed. Looks interesting and well tested so it definitely looks better than what I wrote. It probably didnt exist at the time when I wrote my code so back then there was a need for it. As I said, what I have is a partial parser that worked to my needs, I hoped it might be useful for some people. At the time seemed to work reasonable well for my application of search and replace of template parameters but it seems that mwparserfromhell has much more functionality.
Hannes
On 12 April 2014 04:07, Legoktm legoktm.wikipedia@gmail.com wrote:
On 04/07/2014 05:35 AM, Hannes Röst wrote:
== Template parser ==
https://github.com/hroest/pywikibot-compat/tree/feature/template_parser
For one bot project on the German Wikipedia I had to parse rather complex templates and replace specific fields. The templates would contain nested templates, math formulas and references inside. I thus wrote a template parser which would parse these templates and return them as key-value pairs which would make it easy to query specific keys and replace their values. The code worked well on several thousand templates of the German chemistry project and should be rather straightforward to use. This is library code, so there is no bot associated with it, see templateparser.py and tests/test_templateparser.py
In order to correctly handle nesting and properly differentiate equal signs belonging to key-value pairs from those in mathematical formulas etc, I also had to write a partial wikimedia syntax parser which would recognize such syntax in wikitext. This code is in textrange_parser.py and allows to extract specific parts of a text (e.g. wikitables, templates, wikilinks, weblinks), tests are in tests/test_textrange_parser.py
Have you looked into using mwparserfromhell[1]? It's a true parser which even has C speedups. Support for it is already in pywikibot, it's just not turned on by default.
[1] https://github.com/earwig/mwparserfromhell
-- Legoktm
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
I'm very interested in the spellchecking: does it allow to load Mozilla/LibreOffice dictionaries/spellcheckers in other languages too?
Nemo
Hi Federico
Sorry for the late reply, I forgot to check the answers to this thread.
Yes, the spellchecking module should allow to load any hunspell dictionaries which are (I think) the same dictionaries used by Mozilla and LibreOffice, see also https://en.wikipedia.org/wiki/Hunspell. You can do this for example with this command
$ python spellcheck_hunspell.py Wikipedia -dictionary:/usr/share/hunspell/de_DE
which will check the page "Wikipedia" against the given hunspell dictionary (note that there are 2 files that must exist for this to work: /usr/share/hunspell/de_DE.aff and /usr/share/hunspell/de_DE.dic).
The advantage compared to loading this into your LibreOffice Wordprocessor or let Mozilla do the spellchecking is that the Python script will (attempt) to recognize which sections of the text are actually text and skip stuff like templates and references. This will hopefully reduce the number of false positives. On the German wikipedia I get for the page "Wikipedia" 97 hits of words that the hunspell checker does not know (out of a total of 8061 words), over 30 of which are names that appear after the "==Literatur==" section. Most of the rest are also names and English words which I do not expect hunspell to know. However, all of them are correct and thus on this article it seems that it flags about 1.2 % of all words as false which is probably way lower than what you would have if you parsed *all* words but still rather a high number of falsely flagged words.
This is the reason I also provide an implementation using a list of "known false" words.
Hannes
On 12 April 2014 10:35, Federico Leva (Nemo) nemowiki@gmail.com wrote:
I'm very interested in the spellchecking: does it allow to load Mozilla/LibreOffice dictionaries/spellcheckers in other languages too?
Nemo
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l