On Thu, May 17, 2012 at 1:22 AM, John <phoenixoverride(a)gmail.com> wrote:
Anthony the process is linear, you have a php
inserting X number of rows per
Y time frame.
Amazing. I need to switch all my databases to MySQL. It can insert X
rows per Y time frame, regardless of whether the database is 20
gigabytes or 20 terabytes in size, regardless of whether the average
row is 3K or 1.5K, regardless of whether I'm using a thumb drive or a
RAID array or a cluster of servers, etc.
Yes rebuilding the externallinks, links, and langlinks
tables
will take some additional time and wont scale.
And this is part of the process too, right?
However I have been working
with the toolserver since 2007 and Ive lost count of the number of times
that the TS has needed to re-import a cluster, (s1-s7) and even enwiki can
be done in a semi-reasonable timeframe.
Re-importing how? From the compressed XML full history dumps?
The WMF actually compresses all text
blobs not just old versions.
Is
http://www.mediawiki.org/wiki/Manual:Text_table still accurate? Is
WMF using gzip or object?
complete download and decompression of simple
only took 20 minutes on my 2 year old consumer grade laptop with a standard
home cable internet connection, same download on the toolserver (minus
decompression) was 88s. Yeah Importing will take a little longer but
shouldnt be that big of a deal.
For the full history English Wikipedia it *is* a big deal.
If you think it isn't, stop playing with simple.wikipedia, and tell us
how long it takes to get a mirror up and running of en.wikipedia.
Do you plan to run compressOld.php? Are you going to import
everything in plain text first, and *then* start compressing? Seems
like an awful lot of wasted hard drive space.
There will also be some need cleanup tasks.
However the main issue, archiving and restoring wmf wikis isnt an issue, and
with moderately recent hardware is no big deal. Im putting my money where my
mouth is, and getting actual valid stats and figures. Yes it may not be an
exactly 1:1 ratio when scaling up, however given the basics of how importing
a dump functions it should remain close to the same ratio
If you want to put your money where your mouth is, import
en.wikipedia. It'll only take 5 days, right?