[Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

Thu May 17 11:27:07 UTC 2012

I'd like to point out that the increasingly technical nature of this
conversation probably belongs either on wikitech-l, or off-list, and that
the strident nature of the comments is fast approaching inappropriate.

Alex
Wikimedia-l list administrator

2012/5/17 Anthony <wikimail at inbox.org>

> On Thu, May 17, 2012 at 2:06 AM, John <phoenixoverride at gmail.com> wrote:
> > On Thu, May 17, 2012 at 1:52 AM, Anthony <wikimail at inbox.org> wrote:
> >> On Thu, May 17, 2012 at 1:22 AM, John <phoenixoverride at gmail.com>
> wrote:
> >> > Anthony the process is linear, you have a php inserting X number of
> rows
> >> > per
> >> > Y time frame.
> >>
> >> Amazing.  I need to switch all my databases to MySQL.  It can insert X
> >> rows per Y time frame, regardless of whether the database is 20
> >> gigabytes or 20 terabytes in size, regardless of whether the average
> >> row is 3K or 1.5K, regardless of whether I'm using a thumb drive or a
> >> RAID array or a cluster of servers, etc.
> >
> > When refering to X over Y time, its an average of a of say 1000 revisions
> > per 1 minute, any X over Y period must be considered with averages in
> mind,
> > or getting a count wouldnt be possible.
>
> The *average* en.wikipedia revision is more than twice the size of the
> *average* simple.wikipedia revision.  The *average* performance of a
> 20 gig database is faster than the *average* performance of a 20
> terabyte database.  The *average* performance of your laptop's thumb
> drive is different from the *average* performance of a(n array of)
> drive(s) which can handle 20 terabytes of data.
>
> > If you setup your sever/hardware correctly it will compress the text
> > information during insertion into the database
>
> Is this how you set up your simple.wikipedia test?  How long does it
> take import the data if you're using the same compression mechanism as
> WMF (which, you didn't answer, but I assume is concatenation and
> compression).  How exactly does this work "during insertion" anyway?
> Does it intelligently group sets of revisions together to avoid
> decompressing and recompressing the same revision several times?  I
> suppose it's possible, but that would introduce quite a lot of
> complication into the import script, slowing things down dramatically.
>
> What about the answers to my other questions?
>
> >> If you want to put your money where your mouth is, import
> >> en.wikipedia.  It'll only take 5 days, right?
> >
> > If I actually had a server or the disc space to do it I would, just to
> prove
> > your smartass comments as stupid as they actually are. However given my
> > current resource limitations (fairly crappy internet connection, older
> > laptops, and lack of HDD) I tried to select something that could give
> > reliable benchmarks. If your willing to foot the bill for the new
> hardware
> > Ill gladly prove my point
>
> What you seem to be saying is that you're *not* putting your money
> where your mouth is.
>
> Anyway, if you want, I'll make a deal with you.  A neutral third party
> rents the hardware at Amazon Web Services (AWS).  We import
> simple.wikipedia full history (concatenating and compressing during
> import).  We take the ratio of revisions in simple.wikipedia to the
> ratio of revisions in en.wikipedia.  We import en.wikipedia full
> history (concatenating and compressing during import).  If the ratio
> of time it takes to import en.wikipedia vs simple.wikipedia is greater
> than or equal to twice the ratio of revisions, then you reimburse the
> third party.  If the ratio of import time is less than twice the ratio
> of revisions (you claim it is linear, therefore it'll be the same
> ratio), then I reimburse the third party.
>
> Either way, we save the new dump, with the processing already done,
> and send it to archive.org (and WMF if they're willing to host it).
> So we actually get a useful result out of this.  It's not just for the
> purpose of settling an argument.
>
> Either of us can concede defeat at any point, and stop the experiment.
>  At that point if the neutral third party wishes to pay to continue
> the job, s/he would be responsible for the additional costs.
>
> Shouldn't be too expensive.  If you concede defeat after 5 days, then
> your CPU-time costs are $54 (assuming Extra Large High Memory
> Instance).  Including 4 terabytes of EBS (which should be enough if
> you compress on the fly) for 5 days should be less than $100.
>
> I'm tempted to do it even if you don't take the bet.
>
> _______________________________________________
> Wikimedia-l mailing list
> Wikimedia-l at lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
>