-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Wed, Jan 28, 2009 at 12:53 AM, Platonides wrote:
Marco Schuster wrote:
Hi all,
I want to crawl around 800.000 flagged revisions from the German Wikipedia, in order to make a dump containing only flagged revisions. For this, I obviously need to spider Wikipedia. What are the limits (rate!) here, what UA should I use and what caveats do I have to take care of?
Thanks, Marco
PS: I already have a revisions list, created with the Toolserver. I used the following query: "select fp_stable,fp_page_id from flaggedpages where fp_reviewed=1;". Is it correct this one gives me a list of all articles with flagged revs, fp_stable being the revid of the most current flagged rev for this article?
Fetch them from the toolserver (there's a tool by duesentrieb for that). It will catch almost all of them from the toolserver cluster, and make a request to wikipedia only if needed.
I highly doubt this is "legal" use for the toolserver, and I pretty much guess that 800k revisions to fetch would be a huge resource load.
Thanks, Marco
PS: CC-ing toolserver list.