[Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

Anthony wikimail at inbox.org
Thu May 17 05:52:59 UTC 2012


On Thu, May 17, 2012 at 1:22 AM, John <phoenixoverride at gmail.com> wrote:
> Anthony the process is linear, you have a php inserting X number of rows per
> Y time frame.

Amazing.  I need to switch all my databases to MySQL.  It can insert X
rows per Y time frame, regardless of whether the database is 20
gigabytes or 20 terabytes in size, regardless of whether the average
row is 3K or 1.5K, regardless of whether I'm using a thumb drive or a
RAID array or a cluster of servers, etc.

> Yes rebuilding the externallinks, links, and langlinks tables
> will take some additional time and wont scale.

And this is part of the process too, right?

> However I have been working
> with the toolserver since 2007 and Ive lost count of the number of times
> that the TS has needed to re-import a cluster, (s1-s7) and even enwiki can
> be done in a semi-reasonable timeframe.

Re-importing how?  From the compressed XML full history dumps?

> The WMF actually compresses all text
> blobs not just old versions.

Is http://www.mediawiki.org/wiki/Manual:Text_table still accurate?  Is
WMF using gzip or object?

> complete download and decompression of simple
> only took 20 minutes on my 2 year old consumer grade laptop with a standard
> home cable internet connection, same download on the toolserver (minus
> decompression) was 88s. Yeah Importing will take a little longer but
> shouldnt be that big of a deal.

For the full history English Wikipedia it *is* a big deal.

If you think it isn't, stop playing with simple.wikipedia, and tell us
how long it takes to get a mirror up and running of en.wikipedia.

Do you plan to run compressOld.php?  Are you going to import
everything in plain text first, and *then* start compressing?  Seems
like an awful lot of wasted hard drive space.

> There will also be some need cleanup tasks.
> However the main issue, archiving and restoring wmf wikis isnt an issue, and
> with moderately recent hardware is no big deal. Im putting my money where my
> mouth is, and getting actual valid stats and figures. Yes it may not be an
> exactly 1:1 ratio when scaling up, however given the basics of how importing
> a dump functions it should remain close to the same ratio

If you want to put your money where your mouth is, import
en.wikipedia.  It'll only take 5 days, right?



More information about the Wikimedia-l mailing list