[WARNING] Don't use dumpgenerator.py with API

List overview All Threads
Download

newer

older

Re: [Mediawiki-api]...

how to show changes in the...

Federico Leva (Nemo)

9 Nov 2012 9 Nov '12

10:27 a.m.

It's completely broken: https://code.google.com/p/wikiteam/issues/detail?id=56 It will download only a fraction of the wiki, 500 pages at most per namespace.

Let me reiterate that https://code.google.com/p/wikiteam/issues/detail?id=44 is a very urgent bug and we've seen no work on it in many months. We need an actual programmer with some knowledge of python to fix it and make the script work properly; I know there are several on this list (and elsewhere), please please help. The last time I, as a non-coder, tried to fix a bug, I made things worse (https://code.google.com/p/wikiteam/issues/detail?id=26).

Only after API is implemented/fixed, I'll be able to re-archive the 4-5 thousands wikis we've recently archived on archive.org (https://archive.org/details/wikiteam) and possibly many more. Many of those dumps contain errors and/or are just partial because of the script's unreliability, and wikis die on a daily basis. (So, quoting emijrp, there IS a deadline.)

Nemo

P.s.: Cc'ing some lists out of desperation; sorry for cross-posting.

Show replies by date

Hydriz Wikipedia

9 Nov 9 Nov

3:21 p.m.

New subject: [wikiteam-discuss:599] [WARNING] Don't use dumpgenerator.py with API

Hi all,

I am beginning work on a port to PHP due to some issues regarding unit testing for another project of mine (if you follow me on GitHub, you will know). I hope to help out with fixing the script, but it is a good idea to get someone who knows python (pywikipedia-l people) and the MediaWiki API (mediawiki-api people) to help.

On Fri, Nov 9, 2012 at 6:27 PM, Federico Leva (Nemo) nemowiki@gmail.comwrote:

...

It's completely broken: https://code.google.com/p/** wikiteam/issues/detail?id=56https://code.google.com/p/wikiteam/issues/detail?id=56 It will download only a fraction of the wiki, 500 pages at most per namespace.

Let me reiterate that https://code.google.com/p/** wikiteam/issues/detail?id=44https://code.google.com/p/wikiteam/issues/detail?id=44is a very urgent bug and we've seen no work on it in many months. We need an actual programmer with some knowledge of python to fix it and make the script work properly; I know there are several on this list (and elsewhere), please please help. The last time I, as a non-coder, tried to fix a bug, I made things worse (https://code.google.com/p/** wikiteam/issues/detail?id=26https://code.google.com/p/wikiteam/issues/detail?id=26 ).

Only after API is implemented/fixed, I'll be able to re-archive the 4-5 thousands wikis we've recently archived on archive.org ( https://archive.org/details/**wikiteam https://archive.org/details/wikiteam) and possibly many more. Many of those dumps contain errors and/or are just partial because of the script's unreliability, and wikis die on a daily basis. (So, quoting emijrp, there IS a deadline.)

Nemo

P.s.: Cc'ing some lists out of desperation; sorry for cross-posting.

-- Regards, Hydriz We've created the greatest collection of shared knowledge in history. Help protect Wikipedia. Donate now: http://donate.wikimedia.org

Brad Jorsch

3:50 p.m.

You're searching for the continue parameter as "apfrom", but this was changed to "apcontinue" a while back. Changing line 162 to something like this should probably do it:

m = re.findall(r'<allpages (?:apfrom|apcontinue)="([^>]+)" />', xml)

Note that for full correctness, you probably should omit both apfrom and apcontinue entirely from params the first time around, and send back whichever of the two is found by the above line in subsequent queries.

Also, why in the world aren't you using an XML parser (or a JSON parser with format=json) to process the API response instead of trying to parse the XML using regular expressions?!

On Fri, Nov 9, 2012 at 2:27 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:

...

It's completely broken: https://code.google.com/p/wikiteam/issues/detail?id=56 It will download only a fraction of the wiki, 500 pages at most per namespace.

Hydriz Wikipedia

3:59 p.m.

Hi Brad,

You mentioned "a while back" for "apcontinue", show recent was it? This dump generator is attempting to archive all sorts of versions of MediaWiki, or so unless we write a backward compatibility handler in the script itself.

...and I agree, the code is in a total mess. We need to get someone to rewrite the whole thing, soon.

On Fri, Nov 9, 2012 at 11:50 PM, Brad Jorsch bjorsch@wikimedia.org wrote:

...

You're searching for the continue parameter as "apfrom", but this was changed to "apcontinue" a while back. Changing line 162 to something like this should probably do it:
m = re.findall(r'<allpages (?:apfrom|apcontinue)="([^>]+)" />', xml)
Note that for full correctness, you probably should omit both apfrom and apcontinue entirely from params the first time around, and send back whichever of the two is found by the above line in subsequent queries.

Also, why in the world aren't you using an XML parser (or a JSON parser with format=json) to process the API response instead of trying to parse the XML using regular expressions?!

On Fri, Nov 9, 2012 at 2:27 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:

...
It's completely broken: https://code.google.com/p/wikiteam/issues/detail?id=56 It will download only a fraction of the wiki, 500 pages at most per namespace.

Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

-- Regards, Hydriz We've created the greatest collection of shared knowledge in history. Help protect Wikipedia. Donate now: http://donate.wikimedia.org

4:27 p.m.

On Fri, Nov 9, 2012 at 9:59 AM, Hydriz Wikipedia admin@alphacorp.tk wrote:

...

Hi Brad,

You mentioned "a while back" for "apcontinue", show recent was it? This dump generator is attempting to archive all sorts of versions of MediaWiki, or so unless we write a backward compatibility handler in the script itself.

Mayish[1], so 1.20 onwards.

1 - https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/core.git;a=commit;h=2b3f4d...

Brad Jorsch

4:30 p.m.

On Fri, Nov 9, 2012 at 7:59 AM, Hydriz Wikipedia admin@alphacorp.tk wrote:

...

You mentioned "a while back" for "apcontinue", show recent was it? This dump generator is attempting to archive all sorts of versions of MediaWiki, or so unless we write a backward compatibility handler in the script itself.

July 2012: http://lists.wikimedia.org/pipermail/mediawiki-api-announce/2012-July/000030...

Any wiki running version 1.19, or a 1.20 snapshot from before mid-July, would be returning the old parameter. If you do it right, though, there's little you have to do. Just use whichever keys are given you inside the <query-continue> node. Even with your regular expression mess, just capture which key is given as well as the value and use it as the key for your params dict.

Federico Leva (Nemo)

4:52 p.m.

Brad Jorsch, 09/11/2012 17:30:

...

On Fri, Nov 9, 2012 at 7:59 AM, Hydriz Wikipedia admin@alphacorp.tk wrote:

...
You mentioned "a while back" for "apcontinue", show recent was it? This dump generator is attempting to archive all sorts of versions of MediaWiki, or so unless we write a backward compatibility handler in the script itself.

July 2012: http://lists.wikimedia.org/pipermail/mediawiki-api-announce/2012-July/000030...

Any wiki running version 1.19, or a 1.20 snapshot from before mid-July, would be returning the old parameter. If you do it right, though, there's little you have to do. Just use whichever keys are given you inside the <query-continue> node. Even with your regular expression mess, just capture which key is given as well as the value and use it as the key for your params dict.

Thank you again for your useful suggestions! However, as already noted, https://www.mediawiki.org/wiki/API:Query#Continuing_queries doesn't give any info about supported releases.

Nemo

P.s.: Small unreliable "temporary" things in MediaWiki, like the "powered by MediaWiki" sentence we grep for, are usually the most permanent ones, although I don't like it.

Betacommand

4:57 p.m.

Question why dont you use Pywikipedia framework? I can see about 90% of your code becomes obsolete If you just use the existing framework and it handles the differences in MediaWiki versions automatically. (and can even fall back to screen scraping on sites that either have a ancient or missing API).

If you can write up a doc of how dumpgenerator.py should work (ignoring how it currently does and just focus on how it should and/or perfect process) and what you want the outcome to be writing up replacement will be easy. I just need specifics on exactly how/what you want the dump creator to do.

On Fri, Nov 9, 2012 at 11:52 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:

...

Brad Jorsch, 09/11/2012 17:30:

...
On Fri, Nov 9, 2012 at 7:59 AM, Hydriz Wikipedia admin@alphacorp.tk wrote:

...
You mentioned "a while back" for "apcontinue", show recent was it? This dump generator is attempting to archive all sorts of versions of MediaWiki, or so unless we write a backward compatibility handler in the script itself.

July 2012: http://lists.wikimedia.org/pipermail/mediawiki-api-announce/2012-July/000030...

Any wiki running version 1.19, or a 1.20 snapshot from before mid-July, would be returning the old parameter. If you do it right, though, there's little you have to do. Just use whichever keys are given you inside the <query-continue> node. Even with your regular expression mess, just capture which key is given as well as the value and use it as the key for your params dict.

Thank you again for your useful suggestions! However, as already noted, https://www.mediawiki.org/wiki/API:Query#Continuing_queries doesn't give any info about supported releases.

Nemo

P.s.: Small unreliable "temporary" things in MediaWiki, like the "powered by MediaWiki" sentence we grep for, are usually the most permanent ones, although I don't like it.

Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

Federico Leva (Nemo)

4:48 p.m.

Hydriz Wikipedia, 09/11/2012 16:59:

...

You mentioned "a while back" for "apcontinue", show recent was it? This dump generator is attempting to archive all sorts of versions of MediaWiki, or so unless we write a backward compatibility handler in the script itself.

+1 https://www.mediawiki.org/wiki/API:Allpages , https://www.mediawiki.org/wiki/API:Lists and https://www.mediawiki.org/wiki/API:Query#Continuing_queries don't really shed any light.

...

...and I agree, the code is in a total mess. We need to get someone to rewrite the whole thing, soon.

Well, that in an ideal world. In this one, the best would probably be suggestions for simple libraries to be used to solve such small problems? (Which can become very big if one doesn't follow API evolution very closely or know it's history from the beginning of time.)

Nemo

...

On Fri, Nov 9, 2012 at 11:50 PM, Brad Jorsch wrote:

You're searching for the continue parameter as "apfrom", but this was
changed to "apcontinue" a while back. Changing line 162 to something
like this should probably do it:

     m = re.findall(r'<allpages (?:apfrom|apcontinue)="([^>]+)" />',
xml)

Note that for full correctness, you probably should omit both apfrom
and apcontinue entirely from params the first time around, and send
back whichever of the two is found by the above line in subsequent
queries.

Also, why in the world aren't you using an XML parser (or a JSON
parser with format=json) to process the API response instead of trying
to parse the XML using regular expressions?!

On Fri, Nov 9, 2012 at 2:27 AM, Federico Leva (Nemo)
<nemowiki@gmail.com <mailto:nemowiki@gmail.com>> wrote:
 > It's completely broken:
 > https://code.google.com/p/wikiteam/issues/detail?id=56
 > It will download only a fraction of the wiki, 500 pages at most per
 > namespace.

_______________________________________________
Mediawiki-api mailing list
Mediawiki-api@lists.wikimedia.org
<mailto:Mediawiki-api@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

-- Regards, Hydriz

We've created the greatest collection of shared knowledge in history. Help protect Wikipedia. Donate now: http://donate.wikimedia.org http://donate.wikimedia.org/

Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

Brad Jorsch

5:01 p.m.

On Fri, Nov 9, 2012 at 8:48 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:

...

Well, that in an ideal world. In this one, the best would probably be suggestions for simple libraries to be used to solve such small problems?

Since you're using Python, pywikipedia is usually the go-to library. https://www.mediawiki.org/wiki/Manual:Pywikipediabot

On Fri, Nov 9, 2012 at 8:52 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:

...

However, as already noted, https://www.mediawiki.org/wiki/API:Query#Continuing_queries doesn't give any info about supported releases.

Perhaps it could be made more clear in the doc (I think I'll go fix that now), but clients shouldn't be depending on the particular keys given inside the query-continue node beyond identifying which one belongs to the generator.

Federico Leva (Nemo)

6:33 p.m.

Brad Jorsch, 09/11/2012 18:01:

...

On Fri, Nov 9, 2012 at 8:48 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:

...
Well, that in an ideal world. In this one, the best would probably be suggestions for simple libraries to be used to solve such small problems?

Since you're using Python, pywikipedia is usually the go-to library. https://www.mediawiki.org/wiki/Manual:Pywikipediabot

Thank you, looks like they can indeed help us.

...

On Fri, Nov 9, 2012 at 8:52 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:

...
However, as already noted, https://www.mediawiki.org/wiki/API:Query#Continuing_queries doesn't give any info about supported releases.

Perhaps it could be made more clear in the doc (I think I'll go fix that now), but clients shouldn't be depending on the particular keys given inside the query-continue node beyond identifying which one belongs to the generator.

Thank you very much for expanding the page. Was the query-continue node the same since the beginning of the API? It may be obvious to you but it's not written anywhere I think, please stick a {{MW 1.12}} or whatever there if it's so.

Nemo

4395

Age (days ago)

4395

Last active (days ago)

mediawiki-api@lists.wikimedia.org

10 comments

5 participants

tags (0)

participants (5)

Betacommand
Brad Jorsch
Federico Leva (Nemo)
Hydriz Wikipedia
OQ