Hi All,
and thank your suggestions.
I will have a look to all them, I started from 'mwparserfromhell' - helluwa
of a name !
I found my way with:
import mwparserfromhellAPI_URL = "https://en.wikipedia.org/w/api.php"
def parse(title):
data = {"action": "query", "prop":
"revisions", "rvlimit": 1,
"rvprop": "content", "format": "json",
"titles": title}
raw = urlopen(API_URL, urlencode(data).encode()).read()
res = json.loads(raw)
text =
res["query"]["pages"].values()[0]["revisions"][0]["*"]
return mwparserfromhell.parse(text)
test = parse('DNA')
# and
test.filter_wikilinks()
Some links are like [[Gunther Stent|Stent, Gunther Siegmund]]
So with '|' in the middle. The first is the canonical form, but what does
the second token represent?
e.g. I try:
*parse('Stent, Gunther Siegmund')*
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 6, in parse
KeyError: 'revisions'
What does this error mean? Is it a redirect?
A few more questions about these tools:
1. Where can I find a documentation about the use of methods in
mwparserfromhell?
e.g. wikilinks() method takes argumetns, which can I use?
Could not find much here:
http://mwparserfromhell.readthedocs.org/en/latest/api/mwparserfromhell.html…
2. I would like to use with a generator from dump:
I understand this module would to do the job to fetch pages and pipe to mwh
[aka mwparserfromhell :)]
https://tools.wmflabs.org/paws/public/EpochFail/examples/mwxml.py.ipynb
Correct ?
Any doc around for this as well?
3. how to handle redirect AND/OR curid ?
as example, dbpedia analysed recursive redirect, they call it "transitiive
redirect".
Any module to handle transitive redirect / to handle redirect (I would do
it recursively)
E.g. if I use example above:
*parse('dna')*
u'#REDIRECT [[DNA]] {{R from other capitalisation}}'
I would like to obtain already 'DNA', or even better a module returning the
_ID of page (so far I build it myself with a dictionary, I'd like to ask to
MW team if they already could suggest some tool to handle recursive
redirect more efficiently).
On Wed, Jan 20, 2016 at 8:27 AM, Morten Wang <nettrom(a)gmail.com> wrote:
Luigi,
Here's an example where I use mwparserfromhell to extract links, see the
analyse() method, particularly lines 24 and 36–44:
https://github.com/nettrom/Wiki-Class/blob/master/wikiclass/features/metric…
You can download the dumps and use Aaron's mwxml library example to
process them, for instance by modifying the code so that it uses
mwparserfromhell to parse the revision text (although that requires far
more processing time) instead of regular expressions.
Cheers,
Morten
On 18 January 2016 at 16:23, Luigi Assom <itsawesome.yes(a)gmail.com> wrote:
hi, thank you.
Where can I find documentation for an example to extract links
https://github.com/earwig/mwparserfromhell
or
https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/xmlreader…
?
I'd be very grateful if you can point me to an example for links
extraction and redirect.
Shall I use them against the xml dump or as bot to api.wikimedia?
I would like to use offline, but mwparserfromhell seems to use online
against api.wikipedia..
where are documentation of scripts in mediawiki.org?
https://www.mediawiki.org/w/index.php?search=xmlparser&title=Special%3A…
thank you!
On Mon, Jan 18, 2016 at 8:05 PM, Morten Wang <nettrom(a)gmail.com> wrote:
An alternative is Aaron Halfaker's
mediawiki-utilities (
https://pypi.python.org/pypi/mediawiki-utilities) and mwparserfromhell (
https://github.com/earwig/mwparserfromhell) to parse the wikitext to
extract the links, the latter is already a part of pywikibot, though.
Cheers,
Morten
On 18 January 2016 at 10:45, Amir Ladsgroup <ladsgroup(a)gmail.com> wrote:
Hey,
There is a really good module implemented in pywikibot called
xmlreader.py
<https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/xmlreader.py>.
Also a help is built based on the source code
<https://doc.wikimedia.org/pywikibot/api_ref/pywikibot.html#module-pywikibot.xmlreader>
You can read the source code and write your own script. Some scripts also
support xmlreader, read the manual for them in
mediawiki.org
Best
On Mon, Jan 18, 2016 at 10:00 PM Luigi Assom <itsawesome.yes(a)gmail.com>
wrote:
> hello hello!
> about the use of pywikibot:
> is it possible to use to parse the xml dump?
>
> I am interested in extracting links from pages (internal, external,
> with distinction from ones belonging to category).
> I also would like to handle transitive redirect.
> I would like to process the dump, without accessing wiki, either
> access wiki with proper limits in butch.
>
> Is there maybe something in the package already taking care of this ?
> I 've seen in
https://www.mediawiki.org/wiki/Manual:Pywikibot/Scripts
> there is a "ghost" extracting_links.py" script,
> I wonted to ask before re-inventing the wheel, and if pywikibot is
> suitable tool for the purpose.
>
> Thank you,
> L.
> _______________________________________________
> pywikibot mailing list
> pywikibot(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/pywikibot
>
_______________________________________________
pywikibot mailing list
pywikibot(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/pywikibot
_______________________________________________
pywikibot mailing list
pywikibot(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/pywikibot
_______________________________________________
pywikibot mailing list
pywikibot(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/pywikibot
_______________________________________________
pywikibot mailing list
pywikibot(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/pywikibot