about parsing the dump

List overview All Threads
Download

newer

older

Fwd: [Wikitech-l]...

reflinks now using requests;...

Luigi Assom

18 Jan 2016 18 Jan '16

10:30 a.m.

hello hello! about the use of pywikibot: is it possible to use to parse the xml dump?

I am interested in extracting links from pages (internal, external, with distinction from ones belonging to category). I also would like to handle transitive redirect. I would like to process the dump, without accessing wiki, either access wiki with proper limits in butch.

Is there maybe something in the package already taking care of this ? I 've seen in https://www.mediawiki.org/wiki/Manual:Pywikibot/Scripts there is a "ghost" extracting_links.py" script, I wonted to ask before re-inventing the wheel, and if pywikibot is suitable tool for the purpose.

Thank you, L.

Attachments:

attachment.htm (text/html — 926 bytes)

Show replies by date

Amir Ladsgroup

18 Jan 18 Jan

10:45 a.m.

Hey, There is a really good module implemented in pywikibot called xmlreader.py https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/xmlreader.py. Also a help is built based on the source code https://doc.wikimedia.org/pywikibot/api_ref/pywikibot.html#module-pywikibot.xmlreader You can read the source code and write your own script. Some scripts also support xmlreader, read the manual for them in mediawiki.org

Best

On Mon, Jan 18, 2016 at 10:00 PM Luigi Assom itsawesome.yes@gmail.com wrote:

...

hello hello! about the use of pywikibot: is it possible to use to parse the xml dump?

I am interested in extracting links from pages (internal, external, with distinction from ones belonging to category). I also would like to handle transitive redirect. I would like to process the dump, without accessing wiki, either access wiki with proper limits in butch.

Is there maybe something in the package already taking care of this ? I 've seen in https://www.mediawiki.org/wiki/Manual:Pywikibot/Scripts there is a "ghost" extracting_links.py" script, I wonted to ask before re-inventing the wheel, and if pywikibot is suitable tool for the purpose.

Thank you, L. _______________________________________________ pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot

Morten Wang

11:05 a.m.

An alternative is Aaron Halfaker's mediawiki-utilities ( https://pypi.python.org/pypi/mediawiki-utilities) and mwparserfromhell ( https://github.com/earwig/mwparserfromhell) to parse the wikitext to extract the links, the latter is already a part of pywikibot, though.

Cheers, Morten

On 18 January 2016 at 10:45, Amir Ladsgroup ladsgroup@gmail.com wrote:

...

Hey, There is a really good module implemented in pywikibot called xmlreader.py https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/xmlreader.py. Also a help is built based on the source code https://doc.wikimedia.org/pywikibot/api_ref/pywikibot.html#module-pywikibot.xmlreader You can read the source code and write your own script. Some scripts also support xmlreader, read the manual for them in mediawiki.org

Best

On Mon, Jan 18, 2016 at 10:00 PM Luigi Assom itsawesome.yes@gmail.com wrote:

...
hello hello! about the use of pywikibot: is it possible to use to parse the xml dump?

I am interested in extracting links from pages (internal, external, with distinction from ones belonging to category). I also would like to handle transitive redirect. I would like to process the dump, without accessing wiki, either access wiki with proper limits in butch.

Is there maybe something in the package already taking care of this ? I 've seen in https://www.mediawiki.org/wiki/Manual:Pywikibot/Scripts there is a "ghost" extracting_links.py" script, I wonted to ask before re-inventing the wheel, and if pywikibot is suitable tool for the purpose.

Thank you, L. _______________________________________________ pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot

pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot

Luigi Assom

4:23 p.m.

hi, thank you.

Where can I find documentation for an example to extract links https://github.com/earwig/mwparserfromhell or https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/xmlreader.... ?

I'd be very grateful if you can point me to an example for links extraction and redirect. Shall I use them against the xml dump or as bot to api.wikimedia? I would like to use offline, but mwparserfromhell seems to use online against api.wikipedia..

where are documentation of scripts in mediawiki.org? https://www.mediawiki.org/w/index.php?search=xmlparser&title=Special%3AS...

thank you!

On Mon, Jan 18, 2016 at 8:05 PM, Morten Wang nettrom@gmail.com wrote:

...

An alternative is Aaron Halfaker's mediawiki-utilities ( https://pypi.python.org/pypi/mediawiki-utilities) and mwparserfromhell ( https://github.com/earwig/mwparserfromhell) to parse the wikitext to extract the links, the latter is already a part of pywikibot, though.

Cheers, Morten

On 18 January 2016 at 10:45, Amir Ladsgroup ladsgroup@gmail.com wrote:

...
Hey, There is a really good module implemented in pywikibot called xmlreader.py https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/xmlreader.py. Also a help is built based on the source code https://doc.wikimedia.org/pywikibot/api_ref/pywikibot.html#module-pywikibot.xmlreader You can read the source code and write your own script. Some scripts also support xmlreader, read the manual for them in mediawiki.org

Best

On Mon, Jan 18, 2016 at 10:00 PM Luigi Assom itsawesome.yes@gmail.com wrote:

...
hello hello! about the use of pywikibot: is it possible to use to parse the xml dump?

I am interested in extracting links from pages (internal, external, with distinction from ones belonging to category). I also would like to handle transitive redirect. I would like to process the dump, without accessing wiki, either access wiki with proper limits in butch.

Is there maybe something in the package already taking care of this ? I 've seen in https://www.mediawiki.org/wiki/Manual:Pywikibot/Scripts there is a "ghost" extracting_links.py" script, I wonted to ask before re-inventing the wheel, and if pywikibot is suitable tool for the purpose.

Thank you, L. _______________________________________________ pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot

pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot

pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot

Aaron Halfaker

9:13 p.m.

Here's an example using regular expressions and `mwxml` (a new offshoot of mediawiki-utilities referenced above) https://tools.wmflabs.org/paws/public/EpochFail/examples/mwxml.py.ipynb

The example extracts image links from English Wikipedia, but I imagine it would work for you with little modification.

-Aaron

On Mon, Jan 18, 2016 at 6:23 PM, Luigi Assom itsawesome.yes@gmail.com wrote:

...

hi, thank you.

Where can I find documentation for an example to extract links https://github.com/earwig/mwparserfromhell or

https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/xmlreader.... ?

I'd be very grateful if you can point me to an example for links extraction and redirect. Shall I use them against the xml dump or as bot to api.wikimedia? I would like to use offline, but mwparserfromhell seems to use online against api.wikipedia..

where are documentation of scripts in mediawiki.org?

https://www.mediawiki.org/w/index.php?search=xmlparser&title=Special%3AS...

thank you!

On Mon, Jan 18, 2016 at 8:05 PM, Morten Wang nettrom@gmail.com wrote:

...
An alternative is Aaron Halfaker's mediawiki-utilities ( https://pypi.python.org/pypi/mediawiki-utilities) and mwparserfromhell ( https://github.com/earwig/mwparserfromhell) to parse the wikitext to extract the links, the latter is already a part of pywikibot, though.

Cheers, Morten

On 18 January 2016 at 10:45, Amir Ladsgroup ladsgroup@gmail.com wrote:

...
Hey, There is a really good module implemented in pywikibot called xmlreader.py https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/xmlreader.py. Also a help is built based on the source code https://doc.wikimedia.org/pywikibot/api_ref/pywikibot.html#module-pywikibot.xmlreader You can read the source code and write your own script. Some scripts also support xmlreader, read the manual for them in mediawiki.org

Best

On Mon, Jan 18, 2016 at 10:00 PM Luigi Assom itsawesome.yes@gmail.com wrote:

...
hello hello! about the use of pywikibot: is it possible to use to parse the xml dump?

I am interested in extracting links from pages (internal, external, with distinction from ones belonging to category). I also would like to handle transitive redirect. I would like to process the dump, without accessing wiki, either access wiki with proper limits in butch.

Is there maybe something in the package already taking care of this ? I 've seen in https://www.mediawiki.org/wiki/Manual:Pywikibot/Scripts there is a "ghost" extracting_links.py" script, I wonted to ask before re-inventing the wheel, and if pywikibot is suitable tool for the purpose.

Thank you, L. _______________________________________________ pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot

pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot

pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot

pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot

John Mark Vandenberg

11:10 p.m.

On Tue, Jan 19, 2016 at 4:13 PM, Aaron Halfaker aaron.halfaker@gmail.com wrote:

...

Here's an example using regular expressions and `mwxml` (a new offshoot of mediawiki-utilities referenced above) https://tools.wmflabs.org/paws/public/EpochFail/examples/mwxml.py.ipynb

The example extracts image links from English Wikipedia, but I imagine it would work for you with little modification.

Well, other languages have different namespace names..., so that script is English only.

-- John Vandenberg

Aaron Halfaker

19 Jan 19 Jan

6:06 a.m.

Surely the regular expressions are editable ;)

On Tue, Jan 19, 2016 at 1:10 AM, John Mark Vandenberg jayvdb@gmail.com wrote:

...

On Tue, Jan 19, 2016 at 4:13 PM, Aaron Halfaker aaron.halfaker@gmail.com wrote:

...
Here's an example using regular expressions and `mwxml` (a new offshoot

of

...
mediawiki-utilities referenced above) https://tools.wmflabs.org/paws/public/EpochFail/examples/mwxml.py.ipynb

The example extracts image links from English Wikipedia, but I imagine it would work for you with little modification.

Well, other languages have different namespace names..., so that script is English only.

-- John Vandenberg

pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot

John Mark Vandenberg

1:04 p.m.

On Wed, Jan 20, 2016 at 1:06 AM, Aaron Halfaker aaron.halfaker@gmail.com wrote:

...

On Tue, Jan 19, 2016 at 1:10 AM, John Mark Vandenberg jayvdb@gmail.com wrote:

...
On Tue, Jan 19, 2016 at 4:13 PM, Aaron Halfaker aaron.halfaker@gmail.com wrote:

...
Here's an example using regular expressions and `mwxml` (a new offshoot of mediawiki-utilities referenced above) https://tools.wmflabs.org/paws/public/EpochFail/examples/mwxml.py.ipynb

The example extracts image links from English Wikipedia, but I imagine it would work for you with little modification.

Well, other languages have different namespace names..., so that script is English only.

Surely the regular expressions are editable ;)

Your walk through uses Dutch Wikipedia as its example, and currently it does not support Dutch Wikipedia. Would you please fix that.

-- John Vandenberg

Morten Wang

11:27 p.m.

Luigi,

Here's an example where I use mwparserfromhell to extract links, see the analyse() method, particularly lines 24 and 36–44: https://github.com/nettrom/Wiki-Class/blob/master/wikiclass/features/metrics...

You can download the dumps and use Aaron's mwxml library example to process them, for instance by modifying the code so that it uses mwparserfromhell to parse the revision text (although that requires far more processing time) instead of regular expressions.

Cheers, Morten

On 18 January 2016 at 16:23, Luigi Assom itsawesome.yes@gmail.com wrote:

...

hi, thank you.

Where can I find documentation for an example to extract links https://github.com/earwig/mwparserfromhell or

https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/xmlreader.... ?

I'd be very grateful if you can point me to an example for links extraction and redirect. Shall I use them against the xml dump or as bot to api.wikimedia? I would like to use offline, but mwparserfromhell seems to use online against api.wikipedia..

where are documentation of scripts in mediawiki.org?

https://www.mediawiki.org/w/index.php?search=xmlparser&title=Special%3AS...

thank you!

On Mon, Jan 18, 2016 at 8:05 PM, Morten Wang nettrom@gmail.com wrote:

...
An alternative is Aaron Halfaker's mediawiki-utilities ( https://pypi.python.org/pypi/mediawiki-utilities) and mwparserfromhell ( https://github.com/earwig/mwparserfromhell) to parse the wikitext to extract the links, the latter is already a part of pywikibot, though.

Cheers, Morten

On 18 January 2016 at 10:45, Amir Ladsgroup ladsgroup@gmail.com wrote:

...
Hey, There is a really good module implemented in pywikibot called xmlreader.py https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/xmlreader.py. Also a help is built based on the source code https://doc.wikimedia.org/pywikibot/api_ref/pywikibot.html#module-pywikibot.xmlreader You can read the source code and write your own script. Some scripts also support xmlreader, read the manual for them in mediawiki.org

Best

On Mon, Jan 18, 2016 at 10:00 PM Luigi Assom itsawesome.yes@gmail.com wrote:

...
hello hello! about the use of pywikibot: is it possible to use to parse the xml dump?

I am interested in extracting links from pages (internal, external, with distinction from ones belonging to category). I also would like to handle transitive redirect. I would like to process the dump, without accessing wiki, either access wiki with proper limits in butch.

Is there maybe something in the package already taking care of this ? I 've seen in https://www.mediawiki.org/wiki/Manual:Pywikibot/Scripts there is a "ghost" extracting_links.py" script, I wonted to ask before re-inventing the wheel, and if pywikibot is suitable tool for the purpose.

Thank you, L. _______________________________________________ pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot

pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot

pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot

pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot

Luigi Assom

22 Jan 22 Jan

3:03 p.m.

Hi All, and thank your suggestions.

I will have a look to all them, I started from 'mwparserfromhell' - helluwa of a name !

I found my way with:

import mwparserfromhellAPI_URL = "https://en.wikipedia.org/w/api.php" def parse(title): data = {"action": "query", "prop": "revisions", "rvlimit": 1, "rvprop": "content", "format": "json", "titles": title} raw = urlopen(API_URL, urlencode(data).encode()).read() res = json.loads(raw) text = res["query"]["pages"].values()[0]["revisions"][0]["*"] return mwparserfromhell.parse(text)

test = parse('DNA') # and

test.filter_wikilinks()

Some links are like [[Gunther Stent|Stent, Gunther Siegmund]]

So with '|' in the middle. The first is the canonical form, but what does the second token represent? e.g. I try:

*parse('Stent, Gunther Siegmund')*

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

File "<stdin>", line 6, in parse

KeyError: 'revisions' What does this error mean? Is it a redirect?

A few more questions about these tools:

1. Where can I find a documentation about the use of methods in mwparserfromhell? e.g. wikilinks() method takes argumetns, which can I use? Could not find much here: http://mwparserfromhell.readthedocs.org/en/latest/api/mwparserfromhell.html?...

2. I would like to use with a generator from dump: I understand this module would to do the job to fetch pages and pipe to mwh [aka mwparserfromhell :)] https://tools.wmflabs.org/paws/public/EpochFail/examples/mwxml.py.ipynb Correct ? Any doc around for this as well?

3. how to handle redirect AND/OR curid ? as example, dbpedia analysed recursive redirect, they call it "transitiive redirect". Any module to handle transitive redirect / to handle redirect (I would do it recursively)

E.g. if I use example above: *parse('dna')*

u'#REDIRECT [[DNA]] {{R from other capitalisation}}'

I would like to obtain already 'DNA', or even better a module returning the _ID of page (so far I build it myself with a dictionary, I'd like to ask to MW team if they already could suggest some tool to handle recursive redirect more efficiently).

On Wed, Jan 20, 2016 at 8:27 AM, Morten Wang nettrom@gmail.com wrote:

...

Luigi,

Here's an example where I use mwparserfromhell to extract links, see the analyse() method, particularly lines 24 and 36–44: https://github.com/nettrom/Wiki-Class/blob/master/wikiclass/features/metrics...

You can download the dumps and use Aaron's mwxml library example to process them, for instance by modifying the code so that it uses mwparserfromhell to parse the revision text (although that requires far more processing time) instead of regular expressions.

Cheers, Morten

On 18 January 2016 at 16:23, Luigi Assom itsawesome.yes@gmail.com wrote:

...
hi, thank you.

Where can I find documentation for an example to extract links https://github.com/earwig/mwparserfromhell or

https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/xmlreader.... ?

I'd be very grateful if you can point me to an example for links extraction and redirect. Shall I use them against the xml dump or as bot to api.wikimedia? I would like to use offline, but mwparserfromhell seems to use online against api.wikipedia..

where are documentation of scripts in mediawiki.org?

https://www.mediawiki.org/w/index.php?search=xmlparser&title=Special%3AS...

thank you!

On Mon, Jan 18, 2016 at 8:05 PM, Morten Wang nettrom@gmail.com wrote:

...
An alternative is Aaron Halfaker's mediawiki-utilities ( https://pypi.python.org/pypi/mediawiki-utilities) and mwparserfromhell ( https://github.com/earwig/mwparserfromhell) to parse the wikitext to extract the links, the latter is already a part of pywikibot, though.

Cheers, Morten

On 18 January 2016 at 10:45, Amir Ladsgroup ladsgroup@gmail.com wrote:

...
Hey, There is a really good module implemented in pywikibot called xmlreader.py https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/xmlreader.py. Also a help is built based on the source code https://doc.wikimedia.org/pywikibot/api_ref/pywikibot.html#module-pywikibot.xmlreader You can read the source code and write your own script. Some scripts also support xmlreader, read the manual for them in mediawiki.org

Best

On Mon, Jan 18, 2016 at 10:00 PM Luigi Assom itsawesome.yes@gmail.com wrote:

...
hello hello! about the use of pywikibot: is it possible to use to parse the xml dump?

I am interested in extracting links from pages (internal, external, with distinction from ones belonging to category). I also would like to handle transitive redirect. I would like to process the dump, without accessing wiki, either access wiki with proper limits in butch.

Is there maybe something in the package already taking care of this ? I 've seen in https://www.mediawiki.org/wiki/Manual:Pywikibot/Scripts there is a "ghost" extracting_links.py" script, I wonted to ask before re-inventing the wheel, and if pywikibot is suitable tool for the purpose.

Thank you, L. _______________________________________________ pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot

pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot

pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot

pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot

pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot

3108

Age (days ago)

3112

Last active (days ago)

pywikibot@lists.wikimedia.org

9 comments

5 participants

tags (0)

participants (5)

Aaron Halfaker
Amir Ladsgroup
John Mark Vandenberg
Luigi Assom
Morten Wang