[Labs-l] [Xmldatadumps-l] Updates and a question

Bryan White bgwhite at gmail.com
Mon Jun 8 21:11:21 UTC 2015


I'm not understanding everything, so be patient with an old dog.  Also,
this is more of a problem with WMFLabs.

WMFLabs currently doesn't copy over a dump until everything is completely
finished.   For smaller dumps, the dump was finished and copied over to
WMFLabs relatively quickly.  For larger dumps (enwiki, dewiki, frwiki) it
would take 1-3 weeks, thus became useless for my purposes.  Therefore, for
the larger languages,  I'd  manually copy the specific file in the dump I
use.

Now with the new changes.  In theory....
1)  No language will be copied over for over a week after the dump has
started.
2) The majority of languages (ie small ones) will be finished within the
same short window.
3) For me, the majority of dumps are now rendered useless by the time they
are copied over to WMFLabs.

I say "in theory" because I noticed some of the real small languages have
finished within 4 days.

Bryan



On Mon, Jun 8, 2015 at 2:07 AM, Ariel T. Glenn <aglenn at wikimedia.org> wrote:

> To catch everyone who would stop reading right after the updates, let me
> put the question first.
>
> Who uses the abstract dumps?  Anyone here?  Anyone you know? Please
> forwar this to other lists where there might be users of these dumps.
> We're trying to figure out if we need to keep generating them or not.
>
> Now the updates.
>
> We got more space for the dumps server, which means we don't need to
> reduce the number of dumps kept for some time.  You'll also see other
> items showing up there soon-ish, not part of the xml dumps.
>
> We've long had a request to run stubs early on in the dumps process so
> that stats can be produced right away, and we finally have that going.
> As of this month all dump runs will be done in stages, stubs first, then
> tables, then page logs, and then the rest.  I'm open to negotiation
> about the order of jobs after the stubs, if folks have other
> preferences.
>
> We've worked around the eternal php memory leak(s), which lets us now
> run 7 workers for small wikis at once.  This means we'll get through
> those dumps quicker.
>
> Nemo_bis did some testing with an option to 7zip which means much faster
> compression with a relatively small increase in size. I've adopted that
> everywhere and we should see the difference, primarily in the big wikis,
> this month and on.
>
> New code brings new bugs.  This month's stub and page log runs for
> smaller wikis may have a duplicate entry at the end, the last item
> appearing twice.  This has been fixed for all future runs.  It shouldn't
> have a real impact on stats but folks importing from these dumps should
> be aware.
>
>
> Happy June,
>
> Ariel
>
>
>
>
> _______________________________________________
> Xmldatadumps-l mailing list
> Xmldatadumps-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/labs-l/attachments/20150608/f88f118e/attachment.html>


More information about the Labs-l mailing list