Please can somebody verify that http://download.wikimedia.org/dewiki/ is the right place to download dumps from Wikipedia? The last dump which I can see is from Feb 7.
Is there another place or another technique for Wikipedia mirror operators to get recent dumps?
Thank you
Jochen Magnus
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Jochen Magnus wrote:
Please can somebody verify that http://download.wikimedia.org/dewiki/ is the right place to download dumps from Wikipedia? The last dump which I can see is from Feb 7.
Yep.
Note that: 1) There isn't a regular schedule 2) Not all wikis run at the same time. 3) The very large wikis run less frequently 4) The last few days dumps are on hold as I'm fixing some longstanding problems in the dump generator. New dumps will commence within a few days.
Is there another place or another technique for Wikipedia mirror operators to get recent dumps?
Nope.
- -- brion vibber (brion @ pobox.com / brion @ wikimedia.org)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Brion Vibber wrote:
Jochen Magnus wrote:
Is there another place or another technique for Wikipedia mirror operators to get recent dumps?
Nope.
(Note that we do offer an update feed service on a paid basis. If you're making money republishing the wiki and want close-to-live pages, you could do worse than to call in to the office and talk about that. But we only offer the dumps on the general as-is basis.)
- -- brion vibber (brion @ pobox.com / brion @ wikimedia.org)
Jochen Magnus wrote:
Please can somebody verify that http://download.wikimedia.org/dewiki/ is the right place to download dumps from Wikipedia? The last dump which I can see is from Feb 7.
Yep.
Note that:
- There isn't a regular schedule
- Not all wikis run at the same time.
- The very large wikis run less frequently
- The last few days dumps are on hold as I'm fixing some longstanding
problems in the dump generator. New dumps will commence within a few days.
Potentially crazy idea, but given the above, what if there was a per-backed-up-wiki RSS / ATOM feed that would include a notification when new dumps were generated? That way the content (say WP:EN for example) could carried be on any Planets (or in any feed-reading software), and instead of being a pull-type system (with people having to check back regularly for dumps, which are generated on an irregular schedule), it becomes more a push-type system (you get notified when dumps are complete, and if you want them, then you go get them).
I'm not sure if that helps anyone though (i.e. there is no point in doing it if nobody is interested in the notifications).
All the best, Nick.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Nick Jenkins wrote:
Note that:
- There isn't a regular schedule
- Not all wikis run at the same time.
- The very large wikis run less frequently
- The last few days dumps are on hold as I'm fixing some longstanding
problems in the dump generator. New dumps will commence within a few days.
Potentially crazy idea, but given the above, what if there was a per-backed-up-wiki RSS / ATOM feed that would include a notification when new dumps were generated?
Crazy, but good crazy. :)
I'll slip that into my updates; updating an RSS .xml file would not be difficult at all.
- -- brion vibber (brion @ pobox.com / brion @ wikimedia.org)
Nick Jenkins wrote:
Potentially crazy idea, but given the above, what if there was a per-backed-up-wiki RSS / ATOM feed that would include a notification when new dumps were generated?
Brion Vibber <brion@...> writes:
I'll slip that into my updates; updating an RSS .xml file would not be difficult at all.
Very cool, thanks. Will this be broken down on a per-dump basis, as well as per wiki? Personnally I use only some of the "small" dumps, whose completion dates are currently on the order of a month earlier than the full history dumps...
Cheers, Alai.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Alai wrote:
Brion Vibber <brion@...> writes:
I'll slip that into my updates; updating an RSS .xml file would not be difficult at all.
Very cool, thanks. Will this be broken down on a per-dump basis, as well as per wiki? Personnally I use only some of the "small" dumps, whose completion dates are currently on the order of a month earlier than the full history dumps...
Could do that...
- -- brion vibber (brion @ wikimedia.org)
Brion Vibber wrote:
Nick Jenkins wrote:
Potentially crazy idea, but given the above, what if there was a per-backed-up-wiki RSS / ATOM feed that would include a notification when new dumps were generated?
Crazy, but good crazy. :)
I'll slip that into my updates; updating an RSS .xml file would not be difficult at all.
*-rss.xml files are now slipped into the 'latest' subdirectory with the new dumps being generated now.
I'll add links to these from the summary pages later, but you can get em manually for the moment. (Note that the symlinks in latest still point to the latest _complete_ dump, while the feeds point to the latest version of _that file_. May change the symlinks too, since people seem to prefer that behavior.)
-- brion vibber (brion @ wikimedia.org)
I'll slip that into my updates; updating an RSS .xml file would not be difficult at all.
*-rss.xml files are now slipped into the 'latest' subdirectory with the new dumps being generated now.
Cool, there are RSS feeds now up at http://download.wikimedia.org/dawiktionary/latest/ (for example).
Is there any chance maybe of a summary feed ( e.g. http://download.wikimedia.org/wiki-name/index-rss.xml ), which includes the results for the individual files, including failures? (The current per-file feeds only update on success I think, but knowing that some of the dumps failed is useful information if you need 2 or 3 files from the same backup set).
For example, I count around 24 RSS feeds for dawiktionary, one for each of the 24 data files, plus there's the one md5sum file. That's a lot of feeds, and it could be nice to have "one feed to rule them all", that showed at a glance all the backup files generated so far.
The trouble with a combination feed from a reader's perspective, is that they probably do not want a new entry to be added for every file type (that would result in 24 new entries, effectively spamming the feed-reader). Maybe something more like http://download.wikimedia.org/dawiktionary/20070327/ (i.e. a big table, with each row having the name of the file as a link to download the file, the description, whether it succeeded or failed, the file size, time it was created, and possibly the MD5dum).
The trouble with that is that some wikis take a long time to backup, and some are really quick. For example, http://download.wikimedia.org/enwiki/20070206/ took around 5 weeks from start to finish; whereas dawiktionary took 53 seconds to backup. So for enwiki, an intermediate report of which files are done would be useful, whereas for dawiktionary, there's probably no point. As a very rough possibility, since the index pages already include an ETA, maybe if after finishing a file, and if the ETA for the next file is more than 12 hours into the future, then the index feed gets updated with a new entry showing the current backup status? (Plus there would have to be an entry added after the last file finishes). That would give one entry for dawiktionary, and for enwiki's backup from 20070206, it would give: * one entry after generating oldimage.sql.gz * one entry 2 days later after generating all-titles-in-ns0.gz * one entry another 2 days later after generating pages-articles.xml.bz2 * one entry 16 hours later after generating pages-meta-current.xml.bz2 * one entry 3 weeks later after generating pages-meta-history.xml.bz2 * and then one entry a week later after generating pages-meta-history.xml.7z So that's 6 entries for enwiki, spread out over roughly 5 weeks. And the first one could be dropped maybe, and then the 5 entries correspond to the files that I suspect are used most: all-titles, pages-articles, pages-meta-current, and the pages-meta-history files.
Then again, maybe just watching those the 5 RSS feeds for those files, and ignoring everything else, is the simplest solution :-/
I don't know: maybe a summary feed is practical, maybe it's not. It's certainly more complicated to describe than I would like :-( Essentially, it's trying to walk a fine line of giving people concise and complete and current notifications as quickly as possible, whilst trying not to overwhelm them with messages.
All the best, Nick.
Brion wrote:
Note that:
- There isn't a regular schedule
- Not all wikis run at the same time.
- The very large wikis run less frequently
- The last few days dumps are on hold as I'm fixing some longstanding
problems in the dump generator. New dumps will commence within a few days.
Brion, I know all that and have understanding. But unfortunatly the download page reports now, 8 weeks after the latest successful dump for dewiki:
No prior dumps of this database stored. Dump in progress, 9 items failed
I can understand that dumps for WP mirrors have a low prioritity in the project, but - forgive me - 8 weeks is still too long!
One or two dumps per month are o.k IMHO. A commercial offer for regulary or live updates is o.k. to. We do have a client, a really important german media, which is potentially willing to pay for such a service. They are willing to pay or to support the WP project in another way. But at the moment they are a bit irritated due to the erratic service for mirrors.
Jochen
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Jochen Magnus wrote:
Brion wrote:
Note that:
- There isn't a regular schedule
- Not all wikis run at the same time.
- The very large wikis run less frequently
- The last few days dumps are on hold as I'm fixing some longstanding
problems in the dump generator. New dumps will commence within a few days.
Brion, I know all that and have understanding. But unfortunatly the download page reports now, 8 weeks after the latest successful dump for dewiki:
The dump run's been running for several days already. We have hundreds of databases, and they don't all run at once, hence not all are done yet. Have some patience!
The particular dewiki run there was halted because I was testing (and the missing link to the previous one is due to a bug which I fixed during that testing). It'll come back around in a few days.
The new script runs continuously, with automated cleanup of old versions. No additional manual intervention will be required. Simply wait until it reaches dewiki again.
- -- brion vibber (brion @ wikimedia.org)
Brion,
If you need any help with the dumps, I am happy to provide whatever assistance.
Jeff
Jochen Magnus wrote:
Brion wrote:
Note that:
- There isn't a regular schedule
- Not all wikis run at the same time.
- The very large wikis run less frequently
- The last few days dumps are on hold as I'm fixing some longstanding
problems in the dump generator. New dumps will commence within a few days.
Brion, I know all that and have understanding. But unfortunatly the download page reports now, 8 weeks after the latest successful dump for dewiki:
No prior dumps of this database stored. Dump in progress, 9 items failed
I can understand that dumps for WP mirrors have a low prioritity in the project, but - forgive me - 8 weeks is still too long!
One or two dumps per month are o.k IMHO. A commercial offer for regulary or live updates is o.k. to. We do have a client, a really important german media, which is potentially willing to pay for such a service. They are willing to pay or to support the WP project in another way. But at the moment they are a bit irritated due to the erratic service for mirrors.
Jochen
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org