Felipe Ortega <glimmer_phoenix@...> writes:
- enwiki doesn't have a valid
pages-meta-history.xml.7z file now.
Current dump (20070402) is still in progress (and BTW 1 item already
failed, so I don't expect it to be a valid one).
I don't _think_ that's necessarily true; the complete dump doesn't use the
same format as the pagelinks.sql dump, so shouldn't depend on it directly.
If the current pages-meta-current.xml.bz2 is valid, there's no reason that
I'm aware of (big disclaimer!) that the -history one won't be. (I hope my
mouth isn't writing cheques that'd have to be cashed by someone else's, eh,
fingers to an infeasible extent here.) Fingers crossed all 'round, anyway.
Congratulations for the RSS sindication service, this
is a *good* improvement
Seconded. The new format of the backup-index page is also much more
convenient, and obviously automating the runs is splendid idea.
Dan Vanderkam <danvdk@...> writes:
I've been monitoring the bz2 dump with
curl -I
http://download.wikimedia.org/enwiki/20070402/enwiki-20070402-pages-meta-hi…
Dan, that's genius! I didn't think to test if there was a file "in
place"
before the dumper creates an html link to it, and am currently kicking
myself (hard to do when typing sitting down, too).
It's grown by 5GB in the past day, which would
imply it reach ~90GB in
15 days, well before the listed ETA of 5-19. Here's hoping...
Hrm, hoping indeed, but the ETA is more in line with the length of time it
took last time (which did come down somewhat during the run, though).
Which brings me to: as great as these improvements are, it does seem to be
the case that we'll be lucky if we see an enwiki dump any more often than
once every 6-7 weeks. (About a week for en's "turn" to come along; one
week for the "small" en: dumps to complete; plus however long for the
page-history dump to finish.) For people who need the full dumps, doubtless
that's about as good as can be hoped, but my perception is that there's
more people who just use various smaller dumps. (I may be utterly biased,
too, since I'm in the latter group.) Additionally, I'm assuming that the
people who use the "full" dumps don't _also_ need the "small"
ones, since
it's just the some of same information in a different format, right?
Accordingly, it occurs to me to wonder if it'd be feasible for the
smallest and the largest dumps for en: to be "unsynched", and done
separately? That way the smaller ones would hopefully roll around every
2-3 weeks (keeping the mirrors and the people looking for "toolserver-like"
data happy) but the larger ones would still happen "in the fullness of
time", for those people that do need that data.
I realize this might not be a wildly popular suggestion in the dev camp,
requiring as it does further changes to a just-revised system to introduce
a doubtless-annoying special case, so apologies for that...
Cheers,
Alai.