[Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )
Anthony
wikimail at inbox.org
Thu May 17 05:52:59 UTC 2012
On Thu, May 17, 2012 at 1:22 AM, John <phoenixoverride at gmail.com> wrote:
> Anthony the process is linear, you have a php inserting X number of rows per
> Y time frame.
Amazing. I need to switch all my databases to MySQL. It can insert X
rows per Y time frame, regardless of whether the database is 20
gigabytes or 20 terabytes in size, regardless of whether the average
row is 3K or 1.5K, regardless of whether I'm using a thumb drive or a
RAID array or a cluster of servers, etc.
> Yes rebuilding the externallinks, links, and langlinks tables
> will take some additional time and wont scale.
And this is part of the process too, right?
> However I have been working
> with the toolserver since 2007 and Ive lost count of the number of times
> that the TS has needed to re-import a cluster, (s1-s7) and even enwiki can
> be done in a semi-reasonable timeframe.
Re-importing how? From the compressed XML full history dumps?
> The WMF actually compresses all text
> blobs not just old versions.
Is http://www.mediawiki.org/wiki/Manual:Text_table still accurate? Is
WMF using gzip or object?
> complete download and decompression of simple
> only took 20 minutes on my 2 year old consumer grade laptop with a standard
> home cable internet connection, same download on the toolserver (minus
> decompression) was 88s. Yeah Importing will take a little longer but
> shouldnt be that big of a deal.
For the full history English Wikipedia it *is* a big deal.
If you think it isn't, stop playing with simple.wikipedia, and tell us
how long it takes to get a mirror up and running of en.wikipedia.
Do you plan to run compressOld.php? Are you going to import
everything in plain text first, and *then* start compressing? Seems
like an awful lot of wasted hard drive space.
> There will also be some need cleanup tasks.
> However the main issue, archiving and restoring wmf wikis isnt an issue, and
> with moderately recent hardware is no big deal. Im putting my money where my
> mouth is, and getting actual valid stats and figures. Yes it may not be an
> exactly 1:1 ratio when scaling up, however given the basics of how importing
> a dump functions it should remain close to the same ratio
If you want to put your money where your mouth is, import
en.wikipedia. It'll only take 5 days, right?
More information about the Wikimedia-l
mailing list