Luigi,
Here's an example where I use mwparserfromhell to extract links, see the analyse() method, particularly lines 24 and 36–44: https://github.com/nettrom/Wiki-Class/blob/master/wikiclass/features/metrics...
You can download the dumps and use Aaron's mwxml library example to process them, for instance by modifying the code so that it uses mwparserfromhell to parse the revision text (although that requires far more processing time) instead of regular expressions.
Cheers, Morten
On 18 January 2016 at 16:23, Luigi Assom itsawesome.yes@gmail.com wrote:
hi, thank you.
Where can I find documentation for an example to extract links https://github.com/earwig/mwparserfromhell or
https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/xmlreader.... ?
I'd be very grateful if you can point me to an example for links extraction and redirect. Shall I use them against the xml dump or as bot to api.wikimedia? I would like to use offline, but mwparserfromhell seems to use online against api.wikipedia..
where are documentation of scripts in mediawiki.org?
https://www.mediawiki.org/w/index.php?search=xmlparser&title=Special%3AS...
thank you!
On Mon, Jan 18, 2016 at 8:05 PM, Morten Wang nettrom@gmail.com wrote:
An alternative is Aaron Halfaker's mediawiki-utilities ( https://pypi.python.org/pypi/mediawiki-utilities) and mwparserfromhell ( https://github.com/earwig/mwparserfromhell) to parse the wikitext to extract the links, the latter is already a part of pywikibot, though.
Cheers, Morten
On 18 January 2016 at 10:45, Amir Ladsgroup ladsgroup@gmail.com wrote:
Hey, There is a really good module implemented in pywikibot called xmlreader.py https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/xmlreader.py. Also a help is built based on the source code https://doc.wikimedia.org/pywikibot/api_ref/pywikibot.html#module-pywikibot.xmlreader You can read the source code and write your own script. Some scripts also support xmlreader, read the manual for them in mediawiki.org
Best
On Mon, Jan 18, 2016 at 10:00 PM Luigi Assom itsawesome.yes@gmail.com wrote:
hello hello! about the use of pywikibot: is it possible to use to parse the xml dump?
I am interested in extracting links from pages (internal, external, with distinction from ones belonging to category). I also would like to handle transitive redirect. I would like to process the dump, without accessing wiki, either access wiki with proper limits in butch.
Is there maybe something in the package already taking care of this ? I 've seen in https://www.mediawiki.org/wiki/Manual:Pywikibot/Scripts there is a "ghost" extracting_links.py" script, I wonted to ask before re-inventing the wheel, and if pywikibot is suitable tool for the purpose.
Thank you, L. _______________________________________________ pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot
pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot
pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot
pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot