Στις 08-06-2015, ημέρα Δευ, και ώρα 15:11 -0600, ο/η Bryan White έγραψε:
I'm not understanding everything, so be patient
with an old dog.
Also, this is more of a problem with WMFLabs.
WMFLabs currently doesn't copy over a dump until everything is
completely finished. For smaller dumps, the dump was finished and
copied over to WMFLabs relatively quickly. For larger dumps (enwiki,
dewiki, frwiki) it would take 1-3 weeks, thus became useless for my
purposes. Therefore, for the larger languages, I'd manually copy
the specific file in the dump I use.
You are quite right that the dumps don't get copied over to labs til
they are complete. This is a function of having limited space, so we
don't want to delete an old dump to make room for a new one until we
know the new one finished successfully.
That was fine for the way we used to run dumps; it's not as good now,
though the first phases we run, stubs and tables, finish up in a couple
of days.
So your question really is about the bigger wikis. So I have a better
understanding of your use case, which files do you use, and what makes
them out of date for your purposes if they are a week old, let's say?
Thanks,
Ariel
Now with the new changes.
In theory....
1) No language will be copied over for over a week after the dump has
started.
2) The majority of languages (ie small ones) will be finished within
the same short window.
3) For me, the majority of dumps are now rendered useless by the time
they are copied over to WMFLabs.
I say "in theory" because I noticed some of the real small languages
have finished within 4 days.
Bryan
On Mon, Jun 8, 2015 at 2:07 AM, Ariel T. Glenn <aglenn(a)wikimedia.org>
wrote:
To catch everyone who would stop reading right after the
updates, let me
put the question first.
Who uses the abstract dumps? Anyone here? Anyone you know?
Please
forwar this to other lists where there might be users of these
dumps.
We're trying to figure out if we need to keep generating them
or not.
Now the updates.
We got more space for the dumps server, which means we don't
need to
reduce the number of dumps kept for some time. You'll also
see other
items showing up there soon-ish, not part of the xml dumps.
We've long had a request to run stubs early on in the dumps
process so
that stats can be produced right away, and we finally have
that going.
As of this month all dump runs will be done in stages, stubs
first, then
tables, then page logs, and then the rest. I'm open to
negotiation
about the order of jobs after the stubs, if folks have other
preferences.
We've worked around the eternal php memory leak(s), which lets
us now
run 7 workers for small wikis at once. This means we'll get
through
those dumps quicker.
Nemo_bis did some testing with an option to 7zip which means
much faster
compression with a relatively small increase in size. I've
adopted that
everywhere and we should see the difference, primarily in the
big wikis,
this month and on.
New code brings new bugs. This month's stub and page log runs
for
smaller wikis may have a duplicate entry at the end, the last
item
appearing twice. This has been fixed for all future runs. It
shouldn't
have a real impact on stats but folks importing from these
dumps should
be aware.
Happy June,
Ariel
_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l