-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hi all,
I want to crawl around 800.000 flagged revisions from the German Wikipedia, in order to make a dump containing only flagged revisions. For this, I obviously need to spider Wikipedia. What are the limits (rate!) here, what UA should I use and what caveats do I have to take care of?
Thanks, Marco
PS: I already have a revisions list, created with the Toolserver. I used the following query: "select fp_stable,fp_page_id from flaggedpages where fp_reviewed=1;". Is it correct this one gives me a list of all articles with flagged revs, fp_stable being the revid of the most current flagged rev for this article?
Marco Schuster skrev:
I want to crawl around 800.000 flagged revisions from the German Wikipedia, in order to make a dump containing only flagged revisions.
[...]
flaggedpages where fp_reviewed=1;". Is it correct this one gives me a list of all articles with flagged revs,
Doesn't the xml dumps contain the flag for flagged revs?
// Rolf Lampa
Rolf Lampa schrieb:
Marco Schuster skrev:
I want to crawl around 800.000 flagged revisions from the German Wikipedia, in order to make a dump containing only flagged revisions.
[...]
flaggedpages where fp_reviewed=1;". Is it correct this one gives me a list of all articles with flagged revs,
Doesn't the xml dumps contain the flag for flagged revs?
They don't. And that's very sad.
-- daniel
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Wed, Jan 28, 2009 at 12:49 AM, Rolf Lampa wrote:
Marco Schuster skrev:
I want to crawl around 800.000 flagged revisions from the German Wikipedia, in order to make a dump containing only flagged revisions.
[...]
flaggedpages where fp_reviewed=1;". Is it correct this one gives me a list of all articles with flagged revs,
Doesn't the xml dumps contain the flag for flagged revs?
The xml dumps are nothing for me, way too much overhead (especially, they are old, and I want to use single files, it's easier to process these than one huuuuge xml file). And they don't contain flagged revisions flags :(
Marco
Marco Schuster skrev:
Rolf Lampa wrote:
Doesn't the xml dumps contain the flag for flagged revs?
The xml dumps are nothing for me, way too much overhead (especially, they are old, and I want to use single files, it's easier to process these than one huuuuge xml file). And they don't contain flagged revisions flags :(
I traverse the last enwiki dump (last revision only) in 15 minutes (or the Swedish svwiki in < 3 min) with my stream tool (written in Delphi Pascal).
On the go I can copy the whole thing, (takes no longer) and while at it I can create the "big three" sql-tables (page, revision & text) out of the xml dump as well, in less than 20 minutes.
I like Xml dumps. :)
I'd love, however, to see the flagged rev status as an attribute in one of the tags, for example <revision flagged_rev="true">
Regards,
// Rolf Lampa
Rolf Lampa schrieb:
I'd love, however, to see the flagged rev status as an attribute in one of the tags, for example <revision flagged_rev="true">
Regards,
Naw, it's more complex than that. You can have any number of different flags. It would probably have to be <revision><flag>foo</flag><flag>bar</flag>...</revision>.
-- daniel
Daniel Kinzler wrote:
Rolf Lampa schrieb:
I'd love, however, to see the flagged rev status as an attribute in one of the tags, for example <revision flagged_rev="true">
Regards,
Naw, it's more complex than that. You can have any number of different flags. It would probably have to be <revision><flag>foo</flag><flag>bar</flag>...</revision>.
-- daniel
It would be "<flagged/>", child of <revision>, just as <minor/>
2009/1/28 Platonides Platonides@gmail.com:
Daniel Kinzler wrote:
Rolf Lampa schrieb:
I'd love, however, to see the flagged rev status as an attribute in one of the tags, for example <revision flagged_rev="true">
Regards,
Naw, it's more complex than that. You can have any number of different flags. It would probably have to be <revision><flag>foo</flag><flag>bar</flag>...</revision>.
-- daniel
It would be "<flagged/>", child of <revision>, just as <minor/>
But, as daniel said, "flagged" isn't enough, you need to know what flag.
Marco Schuster wrote:
Hi all,
I want to crawl around 800.000 flagged revisions from the German Wikipedia, in order to make a dump containing only flagged revisions. For this, I obviously need to spider Wikipedia. What are the limits (rate!) here, what UA should I use and what caveats do I have to take care of?
Thanks, Marco
PS: I already have a revisions list, created with the Toolserver. I used the following query: "select fp_stable,fp_page_id from flaggedpages where fp_reviewed=1;". Is it correct this one gives me a list of all articles with flagged revs, fp_stable being the revid of the most current flagged rev for this article?
Fetch them from the toolserver (there's a tool by duesentrieb for that). It will catch almost all of them from the toolserver cluster, and make a request to wikipedia only if needed.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Wed, Jan 28, 2009 at 12:53 AM, Platonides wrote:
Marco Schuster wrote:
Hi all,
I want to crawl around 800.000 flagged revisions from the German Wikipedia, in order to make a dump containing only flagged revisions. For this, I obviously need to spider Wikipedia. What are the limits (rate!) here, what UA should I use and what caveats do I have to take care of?
Thanks, Marco
PS: I already have a revisions list, created with the Toolserver. I used the following query: "select fp_stable,fp_page_id from flaggedpages where fp_reviewed=1;". Is it correct this one gives me a list of all articles with flagged revs, fp_stable being the revid of the most current flagged rev for this article?
Fetch them from the toolserver (there's a tool by duesentrieb for that). It will catch almost all of them from the toolserver cluster, and make a request to wikipedia only if needed.
I highly doubt this is "legal" use for the toolserver, and I pretty much guess that 800k revisions to fetch would be a huge resource load.
Thanks, Marco
PS: CC-ing toolserver list.
Marco Schuster schrieb:
Fetch them from the toolserver (there's a tool by duesentrieb for that). It will catch almost all of them from the toolserver cluster, and make a request to wikipedia only if needed.
I highly doubt this is "legal" use for the toolserver, and I pretty much guess that 800k revisions to fetch would be a huge resource load.
Thanks, Marco
PS: CC-ing toolserver list.
It's a legal use, the only problem is that the tool i wrote for is is quite slow. You shouldn't hit it at full speed. So it might actually be better to query the main server cluster, they can distribute the load more nicely.
One day i'll rewrite WikiProxy and everything will be better :)
But by then, i do hope we have revision flags in the dumps. because that would be The Right Thing to use.
-- daniel
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Wed, Jan 28, 2009 at 1:13 AM, Daniel Kinzler wrote:
Marco Schuster schrieb:
Fetch them from the toolserver (there's a tool by duesentrieb for that). It will catch almost all of them from the toolserver cluster, and make a request to wikipedia only if needed.
I highly doubt this is "legal" use for the toolserver, and I pretty much guess that 800k revisions to fetch would be a huge resource load.
Thanks, Marco
PS: CC-ing toolserver list.
It's a legal use, the only problem is that the tool i wrote for is is quite slow. You shouldn't hit it at full speed. So it might actually be better to query the main server cluster, they can distribute the load more nicely.
What is the best speed, actually? 2 requests per second? Or can I go up to 4?
One day i'll rewrite WikiProxy and everything will be better :)
:)
But by then, i do hope we have revision flags in the dumps. because that would be The Right Thing to use.
Still, using the dumps would require me to get the full history dump because I only want flagged revisions and not current revisions without the flag.
Marco
Marco Schuster schrieb: ...
But by then, i do hope we have revision flags in the dumps. because that would be The Right Thing to use.
Still, using the dumps would require me to get the full history dump because I only want flagged revisions and not current revisions without the flag.
Including the latest revision which is flagged "good" would be an obvious feature that should be implemented along with including the revision flags. So the "current" dump would have 1-3 revisions per page.
-- daniel
2009/1/28 Daniel Kinzler daniel@brightbyte.de:
Marco Schuster schrieb: ...
But by then, i do hope we have revision flags in the dumps. because that would be The Right Thing to use.
Still, using the dumps would require me to get the full history dump because I only want flagged revisions and not current revisions without the flag.
Including the latest revision which is flagged "good" would be an obvious feature that should be implemented along with including the revision flags. So the "current" dump would have 1-3 revisions per page.
The extension is highly customisable, so different projects will have different flags available. Would you include the latest revision with each flag? The latest revision with any flag? The latest revision with a particular flag chosen for each project?
wikitech-l@lists.wikimedia.org