On Wed, Mar 11, 2009 at 2:40 AM, River Tarnell river@loreley.flyingparchment.org.uk wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Brian:
Sure - creating a lucene index of the entire revision history of all wikipedia's for a WikiBlame extension.
a natural language parse of the current revision of the english wikipedia.
can you estimate how much resources (disk/cpu/etc) would be needed to create and maintain either of these?
A useful baseline to think about might be the database backup dumper. Proposals that require processing the content of every revision to do something are structurally similar to what is required to build and compress a full history dump. Obviously dump generation is a months long process for enwiki right now, but if one is going to add a text service to the toolserver then perhaps there are ways to do that which would cut down on bottlenecks.
-Robert Rohde