-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Wed, Jan 28, 2009 at 12:53 AM, Platonides wrote:
Marco Schuster wrote:
Hi all,
I want to crawl around 800.000 flagged revisions from the German Wikipedia, in order to make a dump containing only flagged revisions. For this, I obviously need to spider Wikipedia. What are the limits (rate!) here, what UA should I use and what caveats do I have to take care of?
Thanks, Marco
PS: I already have a revisions list, created with the Toolserver. I used the following query: "select fp_stable,fp_page_id from flaggedpages where fp_reviewed=1;". Is it correct this one gives me a list of all articles with flagged revs, fp_stable being the revid of the most current flagged rev for this article?
Fetch them from the toolserver (there's a tool by duesentrieb for that). It will catch almost all of them from the toolserver cluster, and make a request to wikipedia only if needed.
I highly doubt this is "legal" use for the toolserver, and I pretty much guess that 800k revisions to fetch would be a huge resource load.
Thanks, Marco
PS: CC-ing toolserver list.
Marco Schuster schrieb:
Fetch them from the toolserver (there's a tool by duesentrieb for that). It will catch almost all of them from the toolserver cluster, and make a request to wikipedia only if needed.
I highly doubt this is "legal" use for the toolserver, and I pretty much guess that 800k revisions to fetch would be a huge resource load.
Thanks, Marco
PS: CC-ing toolserver list.
It's a legal use, the only problem is that the tool i wrote for is is quite slow. You shouldn't hit it at full speed. So it might actually be better to query the main server cluster, they can distribute the load more nicely.
One day i'll rewrite WikiProxy and everything will be better :)
But by then, i do hope we have revision flags in the dumps. because that would be The Right Thing to use.
-- daniel
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Wed, Jan 28, 2009 at 1:13 AM, Daniel Kinzler wrote:
Marco Schuster schrieb:
Fetch them from the toolserver (there's a tool by duesentrieb for that). It will catch almost all of them from the toolserver cluster, and make a request to wikipedia only if needed.
I highly doubt this is "legal" use for the toolserver, and I pretty much guess that 800k revisions to fetch would be a huge resource load.
Thanks, Marco
PS: CC-ing toolserver list.
It's a legal use, the only problem is that the tool i wrote for is is quite slow. You shouldn't hit it at full speed. So it might actually be better to query the main server cluster, they can distribute the load more nicely.
What is the best speed, actually? 2 requests per second? Or can I go up to 4?
One day i'll rewrite WikiProxy and everything will be better :)
:)
But by then, i do hope we have revision flags in the dumps. because that would be The Right Thing to use.
Still, using the dumps would require me to get the full history dump because I only want flagged revisions and not current revisions without the flag.
Marco
Marco Schuster schrieb: ...
But by then, i do hope we have revision flags in the dumps. because that would be The Right Thing to use.
Still, using the dumps would require me to get the full history dump because I only want flagged revisions and not current revisions without the flag.
Including the latest revision which is flagged "good" would be an obvious feature that should be implemented along with including the revision flags. So the "current" dump would have 1-3 revisions per page.
-- daniel
Hi,
Am 28.01.2009, 09:43 Uhr, schrieb Daniel Kinzler daniel@brightbyte.de:
Including the latest revision which is flagged "good" would be an obvious feature that should be implemented along with including the revision flags. So the "current" dump would have 1-3 revisions per page.
going offtopic: this would really impact a lot of tools crawling the dump. So please do not change the current dumps, but maybe add another one with only the flagged versions.
Sincerely, APPER
toolserver-l@lists.wikimedia.org