Hello.
I've tried to get the latest completed dump (pages-meta-history) of dewiki, and I've found only two prior dumps available, neither of them valid.
On every page http://download.wikimedia.org/your_language_here/ I've found only two links for prior versions (valid for some languages, damaged for others, for example, dewiki and, of course, enwiki).
Is this a temporal failure? In the latest pages I've found new links for RSS sindication (please excuse me if I missed some announcement about this). What is that all about?
Thank you.
Regards,
Felipe Ortega.
---------------------------------
LLama Gratis a cualquier PC del Mundo. Llamadas a fijos y móviles desde 1 céntimo por minuto. http://es.voice.yahoo.com
On 08/04/07, Felipe Ortega glimmer_phoenix@yahoo.es wrote:
I've tried to get the latest completed dump (pages-meta-history) of dewiki, and I've found only two prior dumps available, neither of them valid.
What's the status of dumps?
Dumps are really important - otherwise the database is basically not forkable. People are stopped from just crawling the pages, after all.
- d.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Felipe Ortega wrote:
I've tried to get the latest completed dump (pages-meta-history) of dewiki, and I've found only two prior dumps available, neither of them valid.
They should be valid. Can you be more specific? (There are a couple individual file dumps which were explicitly canceled during testing.)
Older versions are now automatically deleted instead of manually and intermittently; this prevents numerous new failures due to disk space fillage.
In the latest pages I've found new links for RSS sindication (please excuse me if I missed some announcement about this). What is that all about?
It's an RSS feed for updates; pretty self-explanatory?
- -- brion vibber (brion @ wikimedia.org)
Brion VIBBER wrote:
Felipe Ortega wrote:
I've tried to get the latest completed dump (pages-meta-history) of dewiki, and I've found only two prior dumps available, neither of them valid.
They should be valid. Can you be more specific? (There are a couple individual file dumps which were explicitly canceled during testing.)
Older versions are now automatically deleted instead of manually and intermittently; this prevents numerous new failures due to disk space fillage.
I hope they're automatically deleted only if the following are successfull.
I'll be more specific, Brion:
- enwiki doesn't have a valid pages-meta-history.xml.7z file now. Current dump (20070402) is still in progress (and BTW 1 item already failed, so I don't expect it to be a valid one). Previous dump (20070401) has 7 items that failed and is no valid either.
In short, now we can't access a valid complete dump for enwiki.
The rest of the languages does currently have a correct complete dump (all revisions with complete text and metadata, pages-meta-history.xml.7z).
I think there should be some previous check of the status of the current dump before deleting previous ones. For example, I need the whole dump to compute the length of every revision for the next new graphics of WikiXRay.
Congratulations for the RSS sindication service, this is a *good* improvement (and you're right, it was self-explanatory, but I wanted to ask before making wrong assumptions ;-) ). Now, my research job will be a lot easier.
Thanks. Saludos.
Felipe.
Platonides Platonides@gmail.com escribió: Brion VIBBER wrote:
Felipe Ortega wrote:
I've tried to get the latest completed dump (pages-meta-history) of dewiki, and I've found only two prior dumps available, neither of them valid.
They should be valid. Can you be more specific? (There are a couple individual file dumps which were explicitly canceled during testing.)
Older versions are now automatically deleted instead of manually and intermittently; this prevents numerous new failures due to disk space fillage.
I hope they're automatically deleted only if the following are successfull.
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
---------------------------------
LLama Gratis a cualquier PC del Mundo. Llamadas a fijos y móviles desde 1 céntimo por minuto. http://es.voice.yahoo.com
Felipe Ortega wrote:
I'll be more specific, Brion:
- enwiki doesn't have a valid pages-meta-history.xml.7z file now.
Current dump (20070402) is still in progress (and BTW 1 item already failed, so I don't expect it to be a valid one). Previous dump (20070401) has 7 items that failed and is no valid either.
In short, now we can't access a valid complete dump for enwiki.
The rest of the languages does currently have a correct complete dump (all revisions with complete text and metadata, pages-meta-history.xml.7z).
I think there should be some previous check of the status of the current dump before deleting previous ones. For example, I need the whole dump to compute the length of every revision for the next new graphics of WikiXRay.
That would be an improvement, yes.
-- brion
Sorry, Brion, one more question.
Have we definitely lost the last previous dump of enwiki in good condition (I think Feb 2007) ?
It's just a matter of scheduling for my research (to know when I could retrieve it and start de script process).
Thank you. BTW, I agree that the automatic update method is quite better (and more 'maintainable' than the previous one. Great job, as always ;-).
Felipe.
Brion Vibber brion@wikimedia.org escribió: Felipe Ortega wrote:
I'll be more specific, Brion:
- enwiki doesn't have a valid pages-meta-history.xml.7z file now.
Current dump (20070402) is still in progress (and BTW 1 item already failed, so I don't expect it to be a valid one). Previous dump (20070401) has 7 items that failed and is no valid either.
In short, now we can't access a valid complete dump for enwiki.
The rest of the languages does currently have a correct complete dump (all revisions with complete text and metadata, pages-meta-history.xml.7z).
I think there should be some previous check of the status of the current dump before deleting previous ones. For example, I need the whole dump to compute the length of every revision for the next new graphics of WikiXRay.
That would be an improvement, yes.
-- brion
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
---------------------------------
LLama Gratis a cualquier PC del Mundo. Llamadas a fijos y móviles desde 1 céntimo por minuto. http://es.voice.yahoo.com
On 10/04/07, Felipe Ortega glimmer_phoenix@yahoo.es wrote:
Sorry, Brion, one more question. Have we definitely lost the last previous dump of enwiki in good condition (I think Feb 2007) ? It's just a matter of scheduling for my research (to know when I could retrieve it and start de script process).
Does Jeff Merkey still have a good (-enough) en:wp dump set up on Wikigadugi?
- d.
Felipe Ortega <glimmer_phoenix@...> writes:
- enwiki doesn't have a valid pages-meta-history.xml.7z file now.
Current dump (20070402) is still in progress (and BTW 1 item already failed, so I don't expect it to be a valid one).
I don't _think_ that's necessarily true; the complete dump doesn't use the same format as the pagelinks.sql dump, so shouldn't depend on it directly. If the current pages-meta-current.xml.bz2 is valid, there's no reason that I'm aware of (big disclaimer!) that the -history one won't be. (I hope my mouth isn't writing cheques that'd have to be cashed by someone else's, eh, fingers to an infeasible extent here.) Fingers crossed all 'round, anyway.
Congratulations for the RSS sindication service, this is a *good* improvement
Seconded. The new format of the backup-index page is also much more convenient, and obviously automating the runs is splendid idea.
Dan Vanderkam <danvdk@...> writes:
I've been monitoring the bz2 dump with curl -I
http://download.wikimedia.org/enwiki/20070402/enwiki-20070402-pages-meta-his...
Dan, that's genius! I didn't think to test if there was a file "in place" before the dumper creates an html link to it, and am currently kicking myself (hard to do when typing sitting down, too).
It's grown by 5GB in the past day, which would imply it reach ~90GB in 15 days, well before the listed ETA of 5-19. Here's hoping...
Hrm, hoping indeed, but the ETA is more in line with the length of time it took last time (which did come down somewhat during the run, though).
Which brings me to: as great as these improvements are, it does seem to be the case that we'll be lucky if we see an enwiki dump any more often than once every 6-7 weeks. (About a week for en's "turn" to come along; one week for the "small" en: dumps to complete; plus however long for the page-history dump to finish.) For people who need the full dumps, doubtless that's about as good as can be hoped, but my perception is that there's more people who just use various smaller dumps. (I may be utterly biased, too, since I'm in the latter group.) Additionally, I'm assuming that the people who use the "full" dumps don't _also_ need the "small" ones, since it's just the some of same information in a different format, right?
Accordingly, it occurs to me to wonder if it'd be feasible for the smallest and the largest dumps for en: to be "unsynched", and done separately? That way the smaller ones would hopefully roll around every 2-3 weeks (keeping the mirrors and the people looking for "toolserver-like" data happy) but the larger ones would still happen "in the fullness of time", for those people that do need that data.
I realize this might not be a wildly popular suggestion in the dev camp, requiring as it does further changes to a just-revised system to introduce a doubtless-annoying special case, so apologies for that...
Cheers, Alai.
wikitech-l@lists.wikimedia.org