Felipe Ortega <glimmer_phoenix@...> writes:
- enwiki doesn't have a valid pages-meta-history.xml.7z file now.
Current dump (20070402) is still in progress (and BTW 1 item already failed, so I don't expect it to be a valid one).
I don't _think_ that's necessarily true; the complete dump doesn't use the same format as the pagelinks.sql dump, so shouldn't depend on it directly. If the current pages-meta-current.xml.bz2 is valid, there's no reason that I'm aware of (big disclaimer!) that the -history one won't be. (I hope my mouth isn't writing cheques that'd have to be cashed by someone else's, eh, fingers to an infeasible extent here.) Fingers crossed all 'round, anyway.
Congratulations for the RSS sindication service, this is a *good* improvement
Seconded. The new format of the backup-index page is also much more convenient, and obviously automating the runs is splendid idea.
Dan Vanderkam <danvdk@...> writes:
I've been monitoring the bz2 dump with curl -I
http://download.wikimedia.org/enwiki/20070402/enwiki-20070402-pages-meta-his...
Dan, that's genius! I didn't think to test if there was a file "in place" before the dumper creates an html link to it, and am currently kicking myself (hard to do when typing sitting down, too).
It's grown by 5GB in the past day, which would imply it reach ~90GB in 15 days, well before the listed ETA of 5-19. Here's hoping...
Hrm, hoping indeed, but the ETA is more in line with the length of time it took last time (which did come down somewhat during the run, though).
Which brings me to: as great as these improvements are, it does seem to be the case that we'll be lucky if we see an enwiki dump any more often than once every 6-7 weeks. (About a week for en's "turn" to come along; one week for the "small" en: dumps to complete; plus however long for the page-history dump to finish.) For people who need the full dumps, doubtless that's about as good as can be hoped, but my perception is that there's more people who just use various smaller dumps. (I may be utterly biased, too, since I'm in the latter group.) Additionally, I'm assuming that the people who use the "full" dumps don't _also_ need the "small" ones, since it's just the some of same information in a different format, right?
Accordingly, it occurs to me to wonder if it'd be feasible for the smallest and the largest dumps for en: to be "unsynched", and done separately? That way the smaller ones would hopefully roll around every 2-3 weeks (keeping the mirrors and the people looking for "toolserver-like" data happy) but the larger ones would still happen "in the fullness of time", for those people that do need that data.
I realize this might not be a wildly popular suggestion in the dev camp, requiring as it does further changes to a just-revised system to introduce a doubtless-annoying special case, so apologies for that...
Cheers, Alai.