Hoi, Not having it cross the ocean is profitable. Thanks, Gerard
On Mon, Apr 14, 2008 at 2:35 PM, David Gerard dgerard@gmail.com wrote:
http://files.bigpond.com/library/index.php?go=details&id=35105
Good Lord.
- d.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 14/04/2008, Snowolf snowolf@snowolf.eu wrote:
"File downloads only for BigPond BroadBand Members" :(
Out of curiosity, have any of our dumps made it out to BitTorrent yet?
Checking on Mininova.org didn't prove very helpful. A few copies of the DVDs are floating around, as well as a Ptwiki dump (not sure how complete). Seeding for all of them is practically non-existent.
Results on ThePirateBay and a few other sites are about the same, and typically worse.
-Chad
On Mon, Apr 14, 2008 at 9:20 AM, Earle Martin earle@downlode.org wrote:
On 14/04/2008, Snowolf snowolf@snowolf.eu wrote:
"File downloads only for BigPond BroadBand Members" :(
Out of curiosity, have any of our dumps made it out to BitTorrent yet?
-- Earle Martin http://downlode.org/ http://purl.org/net/earlemartin/
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 14/04/2008, Chad innocentkiller@gmail.com wrote:
Results on ThePirateBay and a few other sites are about the same, and typically worse.
Sounds like it's time to start seeding again. 3.7 gigs? I can help with that, no problem. Anyone with access to the file (I'm not in .au) want to make a torrent?
On Mon, Apr 14, 2008 at 9:59 AM, Earle Martin earle@downlode.org wrote:
On 14/04/2008, Chad innocentkiller@gmail.com wrote:
Results on ThePirateBay and a few other sites are about the same, and typically worse.
Sounds like it's time to start seeding again. 3.7 gigs? I can help with that, no problem. Anyone with access to the file (I'm not in .au) want to make a torrent?
Why bother? It will almost certainly take a lot longer to download it via BT, even in Austrailia at the far end of the internet. ;)
(not to mention that that offsite copies will usually end up stale...)
You bring up an interesting point about it going stale, what if a dedicated host was set up (not necessarily at WMF, but maybe by an outside party) who says "We will have the dumps of the largest wikis, up to date and with a live tracker." Would that be something people would be interested in pursuing?
-Chad
On Mon, Apr 14, 2008 at 10:06 AM, Gregory Maxwell gmaxwell@gmail.com wrote:
On Mon, Apr 14, 2008 at 9:59 AM, Earle Martin earle@downlode.org wrote:
On 14/04/2008, Chad innocentkiller@gmail.com wrote:
Results on ThePirateBay and a few other sites are about the same, and typically worse.
Sounds like it's time to start seeding again. 3.7 gigs? I can help with that, no problem. Anyone with access to the file (I'm not in .au) want to make a torrent?
Why bother? It will almost certainly take a lot longer to download it via BT, even in Austrailia at the far end of the internet. ;)
(not to mention that that offsite copies will usually end up stale...)
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 14/04/2008, Chad innocentkiller@gmail.com wrote:
You bring up an interesting point about it going stale, what if a dedicated host was set up (not necessarily at WMF, but maybe by an outside party) who says "We will have the dumps of the largest wikis, up to date and with a live tracker." Would that be something people would be interested in pursuing?
I'm not sure it would actually be faster on BitTorrent than just downloading from Wikimedia. The WMF has rather a lot of bandwidth, and full en:wp dumps aren't in a huge amount of demand.
- d.
David Gerard wrote:
I'm not sure it would actually be faster on BitTorrent than just downloading from Wikimedia. The WMF has rather a lot of bandwidth, and full en:wp dumps aren't in a huge amount of demand.
The image dumps were available via bit torrent for a while, but are gone again last I checked. Was there a particular reason why bit torrent was used in that case? Image dumps have been unavailable for many years now so I assume there's some barrier to providing them in the same way as other parts of Wikipedia.
Bryan Derksen wrote:
The image dumps were available via bit torrent for a while, but are gone again last I checked. Was there a particular reason why bit torrent was used in that case? Image dumps have been unavailable for many years now so I assume there's some barrier to providing them in the same way as other parts of Wikipedia.
The image dumps are unofficial and are so large that very few people can source them in their entirety, at least to more than a few people. So in this case bittorent was useful since it meant that the burden of hosting such massive files that potentially many people would want was distributed. It would be nice for Wikimedia to provide these in an official way, at least a seed and tracker of some sort, but there are still some technical problems to get around regarding image dumps made by the servers themselves (the current image dumps are made by downloading every single image file from the wiki being harvested from).
It should also be noted that at the moment there isn't anyone (to my knowledge) who is making the image dumps so they are going to be stale anyway.
MinuteElectron.
On Mon, 14 Apr 2008, MinuteElectron wrote:
The image dumps were available via bit torrent for a while, but are gone again last I checked. Was there a particular reason why bit torrent was used in that case? Image dumps have been unavailable for many years now so I assume there's some barrier to providing them in the same way as other parts of Wikipedia.
The image dumps are unofficial and are so large that very few people can source them in their entirety, at least to more than a few people. So in this case bittorent was useful since it meant that the burden of hosting such massive files that potentially many people would want was distributed.
Some weeks ago I downloaded the enwiki dump and set up a torrent and started seeding after being asked to do that on IRC.
The person requesting it said he had trouble downloading large files due to unstable internet connection. Using bittorent was better, as it has builtin support for partial downloads.
Today I stopped seeding (as a new dump was available). I guess I ended up seeding about 30 gigs so some 30 people downloaded the dump using bittorrent.
Bård Dahlmo wrote:
Some weeks ago I downloaded the enwiki dump and set up a torrent and started seeding after being asked to do that on IRC.
The person requesting it said he had trouble downloading large files due to unstable internet connection. Using bittorent was better, as it has builtin support for partial downloads.
This can be solved by installing a download manager, there are many that give you the ability to resume downloads even after loosing your internet connection or restarting your computer.
MinuteElectron.
On Mon, Apr 14, 2008 at 2:22 PM, Bård Dahlmo baard@dahlmo.no wrote:
The person requesting it said he had trouble downloading large files due to unstable internet connection. Using bittorent was better, as it has builtin support for partial downloads.
So does HTTP 1.1. You just need a decent browser. (Which, yes, excludes Firefox 2 without extensions, although Firefox 3 supports partial downloads out of the box.)
On Tue, Apr 15, 2008 at 5:05 AM, Bryan Tong Minh bryan.tongminh@gmail.com wrote:
On Mon, Apr 14, 2008 at 4:06 PM, Gregory Maxwell gmaxwell@gmail.com wrote: I think that the advantage of BT is not the speed, but the fact that it handles data corruption far better (but slower) than HTTP.
In principle, TCP should ensure reliable byte-for-byte delivery, but that's in principle. :) I'd be interested to know if BitTorrent is significantly more reliable than HTTP in practice.
2008/4/15, Simetrical Simetrical+wikilist@gmail.com:
On Mon, Apr 14, 2008 at 2:22 PM, Bård Dahlmo baard@dahlmo.no wrote:
The person requesting it said he had trouble downloading large files due to unstable internet connection. Using bittorent was better, as it has builtin support for partial downloads.
So does HTTP 1.1. You just need a decent browser. (Which, yes, excludes Firefox 2 without extensions, although Firefox 3 supports partial downloads out of the box.)
Or, if using Unix, the simple but efficient wget will do.
On Tue, Apr 15, 2008 at 5:05 AM, Bryan Tong Minh bryan.tongminh@gmail.com wrote:
On Mon, Apr 14, 2008 at 4:06 PM, Gregory Maxwell gmaxwell@gmail.com wrote: I think that the advantage of BT is not the speed, but the fact that it handles data corruption far better (but slower) than HTTP.
In principle, TCP should ensure reliable byte-for-byte delivery, but that's in principle. :) I'd be interested to know if BitTorrent is significantly more reliable than HTTP in practice.
Surely BT will check that the file you get is the same than the seeded one. But it is not "more reliable" than a direct download : you don't know if the seeded file matches the original file. The user sharing it likely downloaded it using HTTP : in this case using BT or a direct HTTP download gets you as much chances to get corrupted data.
2008/4/15, Simetrical wrote:
In principle, TCP should ensure reliable byte-for-byte delivery, but that's in principle. :) I'd be interested to know if BitTorrent is significantly more reliable than HTTP in practice.
It's more reliable in the way that when bittorrent tells you it has finished, you have the original, complete file. If you got something corrupted, it automatically redownloaded the corrupted chunks.
Nicolas Dumazet wrote:
Surely BT will check that the file you get is the same than the seeded one. But it is not "more reliable" than a direct download : you don't know if the seeded file matches the original file. The user sharing it likely downloaded it using HTTP : in this case using BT or a direct HTTP download gets you as much chances to get corrupted data.
Well, if the person creating the torrent didn't check that what it downloaded matched the hashes, something is quite wrong.
And even if bittorrent assured you about it, it's not a bad practise to check it against the original hashes.
On Wed, Apr 16, 2008 at 7:03 AM, Platonides Platonides@gmail.com wrote:
2008/4/15, Simetrical wrote:
In principle, TCP should ensure reliable byte-for-byte delivery, but that's in principle. :) I'd be interested to know if BitTorrent is significantly more reliable than HTTP in practice.
It's more reliable in the way that when bittorrent tells you it has finished, you have the original, complete file. If you got something corrupted, it automatically redownloaded the corrupted chunks.
Yes, I'm aware of the *theory*: BitTorrent hashes each chunk and automatically verifies them. But I'm curious to know if that makes a difference in *practice*, since someone mentioned it as a significant advantage of BitTorrent over HTTP.
On Wed, Apr 16, 2008 at 9:35 AM, Simetrical Simetrical+wikilist@gmail.com wrote:
Yes, I'm aware of the *theory*: BitTorrent hashes each chunk and automatically verifies them. But I'm curious to know if that makes a difference in *practice*, since someone mentioned it as a significant advantage of BitTorrent over HTTP.
TCP checksums each packet, so I'm not even sure bittorrent is more reliable in theory.
On Wed, Apr 16, 2008 at 9:38 AM, Anthony wikimail@inbox.org wrote:
On Wed, Apr 16, 2008 at 9:35 AM, Simetrical Simetrical+wikilist@gmail.com wrote:
Yes, I'm aware of the *theory*: BitTorrent hashes each chunk and automatically verifies them. But I'm curious to know if that makes a difference in *practice*, since someone mentioned it as a significant advantage of BitTorrent over HTTP.
TCP checksums each packet, so I'm not even sure bittorrent is more reliable in theory.
In theory:
1) BitTorrent uses SHA1, TCP uses CRC32 (IIRC). CRC32 is more error-prone.
2) TCP only insures that the packet was transmitted correctly, not that the packet was correct in the first place. If, for instance, the torrent is published, some server or other stores a chunk to disk, and the chunk acquires an error on disk, BitTorrent will catch the error when someone requests the chunk. TCP will not, because the error was not on the network.
3) ???
On 16/04/2008, Simetrical Simetrical+wikilist@gmail.com wrote:
In theory:
[snip interesting details]
I'm not worried about reliability, but I thought the upside of distributing Big Files through BitTorrent was that it took the stress off the originating server, which otherwise has to handle everything (and certainly multiple times if the download breaks in the middle for whatever reason). I guess this isn't a problem if you have a hefty enough pipe, though.
On Wed, Apr 16, 2008 at 11:01 AM, Earle Martin earle@downlode.org wrote:
I'm not worried about reliability, but I thought the upside of distributing Big Files through BitTorrent was that it took the stress off the originating server, which otherwise has to handle everything (and certainly multiple times if the download breaks in the middle for whatever reason). I guess this isn't a problem if you have a hefty enough pipe, though.
Or few enough downloaders. Wikimedia's bandwidth use is what, a few Gbps? Are the number of people downloading these dumps actually going to be enough to make a noticeable difference? I haven't heard that stated as a concern by Brion or anyone, so I assume not.
On Wed, Apr 16, 2008 at 11:06 AM, Simetrical Simetrical+wikilist@gmail.com wrote:
On Wed, Apr 16, 2008 at 11:01 AM, Earle Martin earle@downlode.org wrote:
I'm not worried about reliability, but I thought the upside of distributing Big Files through BitTorrent was that it took the stress off the originating server, which otherwise has to handle everything (and certainly multiple times if the download breaks in the middle for whatever reason). I guess this isn't a problem if you have a hefty enough pipe, though.
Or few enough downloaders. Wikimedia's bandwidth use is what, a few Gbps? Are the number of people downloading these dumps actually going to be enough to make a noticeable difference? I haven't heard that stated as a concern by Brion or anyone, so I assume not.
The obvious solution to get the best of both worlds would be BitTorrent with http seeding. IOW, if there are other seeds or downloaders available, you use them, and if not you fall back to HTTP range requests. I have no idea why that idea never caught on. I guess because the use of BitTorrent to transfer legal files is such a small percentage of total BitTorrent use.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Simetrical wrote:
On Wed, Apr 16, 2008 at 11:01 AM, Earle Martin earle@downlode.org wrote:
I'm not worried about reliability, but I thought the upside of distributing Big Files through BitTorrent was that it took the stress off the originating server, which otherwise has to handle everything (and certainly multiple times if the download breaks in the middle for whatever reason). I guess this isn't a problem if you have a hefty enough pipe, though.
Or few enough downloaders. Wikimedia's bandwidth use is what, a few Gbps? Are the number of people downloading these dumps actually going to be enough to make a noticeable difference? I haven't heard that stated as a concern by Brion or anyone, so I assume not.
Currently there's no reason to believe there's any problem.
If we ever do decide the bandwidth usage is too high and if we think a BT tracker will help, we'll set up a BT tracker.
Until that day, there's not much reason to mess about with it.
- -- brion vibber (brion @ wikimedia.org)
Simetrical wrote:
On Wed, Apr 16, 2008 at 9:38 AM, Anthony wrote:
TCP checksums each packet, so I'm not even sure bittorrent is more reliable in theory.
In theory:
BitTorrent uses SHA1, TCP uses CRC32 (IIRC). CRC32 is more error-prone.
TCP only insures that the packet was transmitted correctly, not
that the packet was correct in the first place. If, for instance, the torrent is published, some server or other stores a chunk to disk, and the chunk acquires an error on disk, BitTorrent will catch the error when someone requests the chunk. TCP will not, because the error was not on the network.
Fine theory, but downloads corrupt, and not because they do in the source server (redownloading wouldn't fix it).
Id' like to have into the http spec GET /file HTTP/1.1 Range: 1-1024 Unless-sha1: AABBCCDDEE
That functionality could already be provided if doing a HEAD Range: request returned a Content-MD5 header, which lighttpd isn't doing. Although perhaps APACHE with ContentDigest does.
On Tue, Apr 15, 2008 at 4:15 PM, Simetrical Simetrical+wikilist@gmail.com wrote:
On Mon, Apr 14, 2008 at 4:06 PM, Gregory Maxwell gmaxwell@gmail.com wrote: I think that the advantage of BT is not the speed, but the fact that it handles data corruption far better (but slower) than HTTP.
In principle, TCP should ensure reliable byte-for-byte delivery, but that's in principle. :) I'd be interested to know if BitTorrent is significantly more reliable than HTTP in practice.
I have not yet had data corruption with large HTTP downloads or BT downloads (not counting broken memory), so I don't know. BT divides the download in chunks and creates the SHA1 hash of each of those pieces, so even if part of the download is broken, you know which one and thus only have to redownload the broken part.
Bryan
On Mon, Apr 14, 2008 at 12:23 PM, Chad innocentkiller@gmail.com wrote:
You bring up an interesting point about it going stale, what if a dedicated host was set up (not necessarily at WMF, but maybe by an outside party) who says "We will have the dumps of the largest wikis, up to date and with a live tracker." Would that be something people would be interested in pursuing?
I'd be interested in a third party server which keeps an up-to-date copy of the server and allows on-demand dumps (so you could request, say everything modified since December 21, 2007) and other specialized queries. Kind of like toolserver, except account allocation would be based on willingness to pay rather than whatever the process is for getting access to toolserver.
On Mon, Apr 14, 2008 at 4:06 PM, Gregory Maxwell gmaxwell@gmail.com wrote:
On Mon, Apr 14, 2008 at 9:59 AM, Earle Martin earle@downlode.org wrote:
On 14/04/2008, Chad innocentkiller@gmail.com wrote:
Results on ThePirateBay and a few other sites are about the same, and typically worse.
Sounds like it's time to start seeding again. 3.7 gigs? I can help with that, no problem. Anyone with access to the file (I'm not in .au) want to make a torrent?
Why bother? It will almost certainly take a lot longer to download it via BT, even in Austrailia at the far end of the internet. ;)
I think that the advantage of BT is not the speed, but the fact that it handles data corruption far better (but slower) than HTTP.
Bryan
I can grab the latest enwiki dump and go at it. I'm not sure how to set up a tracker though. Decent bandwidth here, and I can talk to the folks at work about potentially setting up a dedicated tracker with the full dump on it, so we always have at least one copy floating around.
Maybe have an answer on that this afternoon.
-Chad
On Mon, Apr 14, 2008 at 9:59 AM, Earle Martin earle@downlode.org wrote:
On 14/04/2008, Chad innocentkiller@gmail.com wrote:
Results on ThePirateBay and a few other sites are about the same, and typically worse.
Sounds like it's time to start seeding again. 3.7 gigs? I can help with that, no problem. Anyone with access to the file (I'm not in .au) want to make a torrent?
--
Earle Martin http://downlode.org/ http://purl.org/net/earlemartin/
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org