Hi,
There hasn't been a successful pages-meta-history.xml.bz2 or pages-meta-history.xml.7z dump from the http://download.wikimedia.org/enwiki/ site in the last 5 dumps. How is the new dump system coming along for these large wiki files? I personally am a bit concerned that these files haven't been available for ~4months at least, maybe publicize the problems to get more feedback on how to fix it instead of just telling us that: The current dump system is not sustainable on very large wikis and is being replaced. You'll hear about it when we have the new one in place. :)
-- brion
Sorry for complaining, but it has been a long time that this is broken, what are the details of the problem?
I hope you guys are planning on adding some way to download the wikimedia commons images too at some point, ie. I was thinking a multifile torrent could work, with images from enwiki in one file, and the other wikis in other files. Also the enwiki images could also be in subcategories of popularity based on access.log files, so a space restricted user might only download the folder labelled "top10percent" and then get the top ten percent most popular images in wikipedia, which could still make a pretty complete encyclopedia for most offline users and save 90% of the disk space. Creating a multifile torrent like this is standard, ie. if you download from piratebay you know what I mean. The only drawback with the torrents is the lack of geographical awareness for the data transfer as someone mentioned before, but I think the decentralized nature of bittorrent with many possible uploaders makes this irrelevant as wikimedia won't be paying for the bandwidth if other people choose to help seed the torrents anyway.
What about www.wikipirate.org or wikitorrent.org for a list of wikimedia torrents? :) both are available!
cheers, Jamie
We were actually a week away from having a finished snapshot last month when we had an unscheduled change.
Long and short of it is that Tim's re compression of ES has made huge progress in improving the speed of the work and we simply need to wait for that to finish to re-asses how much more we need/want to change.
Feel free to find me on irc if you need any more detail.
As for the images, download speeds are not the issues. We can generally push out content faster then most people can download. It's more of a packaging issue as doing an en/pl/de/etc .. torrent will lead to tons of duplication of space.
Going down the route of image sets makes a bit more sense as we have a public data sets server on order which can potentially house these.
I know Ariel as been looking at different ways of mirroring/distributing media content so perhaps he can chime in about any progress.
--tomasz
Jamie Morken wrote:
Hi,
There hasn't been a successful pages-meta-history.xml.bz2 or pages-meta-history.xml.7z dump from the http://download.wikimedia.org/enwiki/ site in the last 5 dumps. How is the new dump system coming along for these large wiki files? I personally am a bit concerned that these files haven't been available for ~4months at least, maybe publicize the problems to get more feedback on how to fix it instead of just telling us that: The current dump system is not sustainable on very large wikis and is being replaced. You'll hear about it when we have the new one in place. :)
-- brion
Sorry for complaining, but it has been a long time that this is broken, what are the details of the problem?
I hope you guys are planning on adding some way to download the wikimedia commons images too at some point, ie. I was thinking a multifile torrent could work, with images from enwiki in one file, and the other wikis in other files. Also the enwiki images could also be in subcategories of popularity based on access.log files, so a space restricted user might only download the folder labelled "top10percent" and then get the top ten percent most popular images in wikipedia, which could still make a pretty complete encyclopedia for most offline users and save 90% of the disk space. Creating a multifile torrent like this is standard, ie. if you download from piratebay you know what I mean. The only drawback with the torrents is the lack of geographical awareness for the data transfer as someone mentioned before, but I think the decentralized nature of bittorrent with many possible uploaders makes this irrelevant as wikimedia won't be paying for the bandwidth if other people c
hoose to help seed the torrents anyway.
What about www.wikipirate.org or wikitorrent.org for a list of wikimedia torrents? :) both are available!
cheers, Jamie
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 19 February 2010 14:54, Jamie Morken jmorken@shaw.ca wrote:
I hope you guys are planning on adding some way to download the wikimedia commons images too at some point
something that could be fun is git.
plus something like a "ticket system", where you ask for permissions to download a tree inside git, and a temporal password is given to you.
but you don't know, maybe serverside git is killing (CPU?), maybe clientside is too complex for normal users, .. (?)
What about www.wikipirate.org or wikitorrent.org for a list of wikimedia torrents? :) both are available!
nah,.. thats not cool enough, ... what about pirate@HOME?, a screensaver that seed the torrent, with a tiny animation of pirates, and sea, and the wiki logo :-),.. but you don't really want to be associated to that!.
the idea to use Torrent to download snapshots of {insert large file here} has ben discussed before on this mail list, It seems the agrement is that is a bad idea for files that change often (you don't want to have 10 different versions to download, this mean less seeders), there are other problems (like ISP f*cking with the torrent protocol, network overhead, capped downloads, etc. ) but that was the main gripe against torrent.
On Fri, Feb 19, 2010 at 1:24 PM, Tei oscar.vives@gmail.com wrote:
On 19 February 2010 14:54, Jamie Morken jmorken@shaw.ca wrote:
I hope you guys are planning on adding some way to download the wikimedia commons images too at some point
something that could be fun is git.
git is not intended to handle large repositories, or anything that includes a lot of large binary files. There is no way it would work acceptably on even a moderate subset of Commons images. It bogs down on even very large source-code repositories, last I checked -- you're advised to split repositories up once they get too large.
wikitech-l@lists.wikimedia.org