[Toolserver-l] [Wikitech-l] Crawling deWP

Marco Schuster marco at harddisk.is-a-geek.org
Tue Jan 27 23:59:29 UTC 2009


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, Jan 28, 2009 at 12:53 AM, Platonides  wrote:
> Marco Schuster wrote:
>> Hi all,
>>
>> I want to crawl around 800.000 flagged revisions from the German
>> Wikipedia, in order to make a dump containing only flagged revisions.
>> For this, I obviously need to spider Wikipedia.
>> What are the limits (rate!) here, what UA should I use and what
>> caveats do I have to take care of?
>>
>> Thanks,
>> Marco
>>
>> PS: I already have a revisions list, created with the Toolserver. I
>> used the following query: "select fp_stable,fp_page_id from
>> flaggedpages where fp_reviewed=1;". Is it correct this one gives me a
>> list of all articles with flagged revs, fp_stable being the revid of
>> the most current flagged rev for this article?
>
> Fetch them from the toolserver (there's a tool by duesentrieb for that).
> It will catch almost all of them from the toolserver cluster, and make a
> request to wikipedia only if needed.
I highly doubt this is "legal" use for the toolserver, and I pretty
much guess that 800k revisions to fetch would be a huge resource load.

Thanks, Marco

PS: CC-ing toolserver list.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (MingW32)
Comment: Use GnuPG with Firefox : http://getfiregpg.org (Version: 0.7.2)

iD8DBQFJf6AjW6S2GapJUuQRAvBuAJ46G0qhk+e2axFddbHFMUqzScH4PgCeIMBL
L9WWNeZaA/6vHyzSoKrGN54=
=p/R+
-----END PGP SIGNATURE-----



More information about the Toolserver-l mailing list