[Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

Anthony wikimail at inbox.org
Thu May 17 11:23:01 UTC 2012


On Thu, May 17, 2012 at 2:06 AM, John <phoenixoverride at gmail.com> wrote:
> On Thu, May 17, 2012 at 1:52 AM, Anthony <wikimail at inbox.org> wrote:
>> On Thu, May 17, 2012 at 1:22 AM, John <phoenixoverride at gmail.com> wrote:
>> > Anthony the process is linear, you have a php inserting X number of rows
>> > per
>> > Y time frame.
>>
>> Amazing.  I need to switch all my databases to MySQL.  It can insert X
>> rows per Y time frame, regardless of whether the database is 20
>> gigabytes or 20 terabytes in size, regardless of whether the average
>> row is 3K or 1.5K, regardless of whether I'm using a thumb drive or a
>> RAID array or a cluster of servers, etc.
>
> When refering to X over Y time, its an average of a of say 1000 revisions
> per 1 minute, any X over Y period must be considered with averages in mind,
> or getting a count wouldnt be possible.

The *average* en.wikipedia revision is more than twice the size of the
*average* simple.wikipedia revision.  The *average* performance of a
20 gig database is faster than the *average* performance of a 20
terabyte database.  The *average* performance of your laptop's thumb
drive is different from the *average* performance of a(n array of)
drive(s) which can handle 20 terabytes of data.

> If you setup your sever/hardware correctly it will compress the text
> information during insertion into the database

Is this how you set up your simple.wikipedia test?  How long does it
take import the data if you're using the same compression mechanism as
WMF (which, you didn't answer, but I assume is concatenation and
compression).  How exactly does this work "during insertion" anyway?
Does it intelligently group sets of revisions together to avoid
decompressing and recompressing the same revision several times?  I
suppose it's possible, but that would introduce quite a lot of
complication into the import script, slowing things down dramatically.

What about the answers to my other questions?

>> If you want to put your money where your mouth is, import
>> en.wikipedia.  It'll only take 5 days, right?
>
> If I actually had a server or the disc space to do it I would, just to prove
> your smartass comments as stupid as they actually are. However given my
> current resource limitations (fairly crappy internet connection, older
> laptops, and lack of HDD) I tried to select something that could give
> reliable benchmarks. If your willing to foot the bill for the new hardware
> Ill gladly prove my point

What you seem to be saying is that you're *not* putting your money
where your mouth is.

Anyway, if you want, I'll make a deal with you.  A neutral third party
rents the hardware at Amazon Web Services (AWS).  We import
simple.wikipedia full history (concatenating and compressing during
import).  We take the ratio of revisions in simple.wikipedia to the
ratio of revisions in en.wikipedia.  We import en.wikipedia full
history (concatenating and compressing during import).  If the ratio
of time it takes to import en.wikipedia vs simple.wikipedia is greater
than or equal to twice the ratio of revisions, then you reimburse the
third party.  If the ratio of import time is less than twice the ratio
of revisions (you claim it is linear, therefore it'll be the same
ratio), then I reimburse the third party.

Either way, we save the new dump, with the processing already done,
and send it to archive.org (and WMF if they're willing to host it).
So we actually get a useful result out of this.  It's not just for the
purpose of settling an argument.

Either of us can concede defeat at any point, and stop the experiment.
 At that point if the neutral third party wishes to pay to continue
the job, s/he would be responsible for the additional costs.

Shouldn't be too expensive.  If you concede defeat after 5 days, then
your CPU-time costs are $54 (assuming Extra Large High Memory
Instance).  Including 4 terabytes of EBS (which should be enough if
you compress on the fly) for 5 days should be less than $100.

I'm tempted to do it even if you don't take the bet.



More information about the Wikimedia-l mailing list