I'm sorry if my ignorance is showing here, but I went to both sourceforge and gnutella.com and I still don't have any idea what gnutella is, other than it's a peer to peer networking protocol for file sharing. In other words, I can't figure out why Gnutella is different from Ares, the old Napster, or Bittorrent from an architectural point of view.
I am more familiar with Bittorrent, which is very useful for distributing copies of very large files (which a pdf version of Wikipedia would certainly qualify as) without using hardly any bandwidth on the main server. It's very cool that way. Perhaps Gnutella is similarly cool, but I can't find a "what is Gnutella" web page.
Quoting: "The practical implication is that the BitTorrent system makes it easy to distribute very large files to large numbers of people while placing minimal bandwidth requirements on the original "seeder." That is because everyone who wants the file is sharing with one another, rather than downloading from a central source. A separate file-sharing network known as eDonkey uses a similar system."
Would someone who is familiar with both Gnutella and Bittorrent tell me why using Bittorrent for such a project would be stupid? It would certainly use less of Wikipedia's already strained bandwidth.
-Kelly
At 04:10 PM 3/9/2004, you wrote:
"IF" == Itai Fiat itai@mail.portland.co.uk writes:
IF> Hello, This my first post to wikitech-l (yay!), and a IF> presumptuous one at that. I recently had an idea, and have IF> written an written a short (the frame of reference being the IF> Magna Carta) article describing it at IF> http://meta.wikipedia.org/wiki/Gnutella (for confusions IF> sake, the content of the article will not be recounted IF> here). While I have no problem developing this myself, I was IF> wondering if anybody has any objections - or, better yet, IF> suggestions or cash - they'd like to contribute in advance.So, I added comments on-site, but here's my feeling:
I think a p2p distribution mechanism for Wikipedia is an EXCELLENT idea. It would/could do a lot to lower the load on the main servers.
I also think that if I had a fancy new-generation p2p client, I'd make a go of re-distributing Wikipedia content. It would be a great way to promote my network, especially if it has support for in-network Websites (like Freenet does).
But I don't think this is core functionality for Wikimedia, and I doubt that it'll get a lot of support around here. I think probably the best idea is to download the database dumps, and maybe experiment from there.
You may want to shop this idea with one of the open source P2P network groups, to see what they think about it. Frankly, there's more benefit for a P2P network than there is to Wikipedia, so they'll probably be more interested.
Anyways, good idea. You're gonna need to stick with it to see it happen, though. The best ideas are like that.
~ESP
-- Evan Prodromou evan@wikitravel.org Wikitravel - http://www.wikitravel.org/ The free, complete, up-to-date and reliable world-wide travel guide _______________________________________________ Wikitech-l mailing list Wikitech-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
On Mar 17, 2004, at 15:58, Kelly Anderson wrote:
I'm sorry if my ignorance is showing here, but I went to both sourceforge and gnutella.com and I still don't have any idea what gnutella is, other than it's a peer to peer networking protocol for file sharing. In other words, I can't figure out why Gnutella is different from Ares, the old Napster, or Bittorrent from an architectural point of view.
Gnutella is decentralized; it was designed in response to the classic Napster, which could be and was shut down by suing the server runners into oblivion. Search requests are passed from node to node to node in a very inefficient fashion, but once a file is out there, the original seeder need not remain on the network. Nodes generally provide many files for peering (often all they have ever downloaded). The system was designed for smallish files (up to several megabytes).
BitTorrent is very centralized; it was designed for legitimate distribution of large files to many simultaneous downloaders, using peer-to-peer transfers simply as a way to save bandwidth for the central server. There is no searching mechanism; you must connect to the particular tracker server managing the torrent for the file you want and ask for it specifically. If the tracker goes offline, everything fails. Nodes only make available for peered access the files they are in progress downloading or have very recently downloaded (and not yet closed the window on). The system was designed for large files (tens, hundreds, or thousands of megabytes), and fetches pieces of a file from multiple different peers simultaneously if possible.
Don't know anything about Ares.
Would someone who is familiar with both Gnutella and Bittorrent tell me why using Bittorrent for such a project would be stupid? It would certainly use less of Wikipedia's already strained bandwidth.
The overhead of torrenting individual articles or PDF booklets on particular subjects from Wikipedia would likely far outweigh simple HTTP, particularly since it's generally unlikely that many people would be downloading the same (small) file (out of many thousands available) simultaneously.
Hypothetically it might be useful for distributing large bulk dumps (such as the current database dumps), if and only if more than one person at a time is likely to be downloading them.
Bandwidth really isn't a problem for Wikipedia; we use a fair amount of it (compared to Joe Bob's homepage, not compared to Yahoo) but it's not "strained".
-- brion vibber (brion @ pobox.com)
At 08:25 PM 3/17/2004, you wrote:
On Mar 17, 2004, at 15:58, Kelly Anderson wrote: Gnutella is decentralized; it was designed in response to the classic Napster, which could be and was shut down by suing the server runners into oblivion. Search requests are passed from node to node to node in a very inefficient fashion, but once a file is out there, the original seeder need not remain on the network. Nodes generally provide many files for peering (often all they have ever downloaded). The system was designed for smallish files (up to several megabytes).
Thank you for that clear explanation Brian. I think I understand better now. Too bad that the search couldn't be somewhat more efficient... but I get the model now.
BitTorrent is very centralized; it was designed for legitimate distribution of large files to many simultaneous downloaders, using peer-to-peer transfers simply as a way to save bandwidth for the central server. There is no searching mechanism; you must connect to the particular tracker server managing the torrent for the file you want and ask for it specifically. If the tracker goes offline, everything fails. Nodes only make available for peered access the files they are in progress downloading or have very recently downloaded (and not yet closed the window on). The system was designed for large files (tens, hundreds, or thousands of megabytes), and fetches pieces of a file from multiple different peers simultaneously if possible.
While it is true that BitTorrent does not have a search facility, I believe it is relatively easy to duplicate the torrent file so that it becomes decentralized. In fact, I believe it to be the case that once you have an entire copy of the file in question, you become a new seeder, although I'm unclear if that automatically gives you a new distributable version of the torrent file. I think it doesn't as you say.
Don't know anything about Ares.
It's a little commercial program more similar to Gnutella than BitTorrent, as per the discussion here.
Would someone who is familiar with both Gnutella and Bittorrent tell me why using Bittorrent for such a project would be stupid? It would certainly use less of Wikipedia's already strained bandwidth.
The overhead of torrenting individual articles or PDF booklets on particular subjects from Wikipedia would likely far outweigh simple HTTP, particularly since it's generally unlikely that many people would be downloading the same (small) file (out of many thousands available) simultaneously.
Agreed, we were talking though about distributing the entire database, or large portions of it (like all the images). For this, BitTorrent would be perfect. The centralized nature isn't a problem since there is only really one source of the information.
Hypothetically it might be useful for distributing large bulk dumps (such as the current database dumps), if and only if more than one person at a time is likely to be downloading them.
Precisely. Although typically with BitTorrent people leave the connection open to upload at least as much as they download, even if that may take much longer. It's a feature of at least some of the BitTorrent clients. All in all, BitTorrent is IMHO a very cool file distribution mechanism.
Bandwidth really isn't a problem for Wikipedia; we use a fair amount of it (compared to Joe Bob's homepage, not compared to Yahoo) but it's not "strained".
I guess I was using "bandwidth" in a more generic sense, including CPU time and so forth. From the response times I normally get from Wikipedia, something is generally straining... (I have a broadband connection, and it happens relatively consistently, so I'm relatively sure the issue is mostly with the server.) This is not to disrespect Wikipedia or WikiMedia, just that in my personal experience it is just not as responsive as say Ebay or Amazon or similar sites. I must say that I'm somewhat surprised at this with all the hardware that has obviously been thrown at the problem of late. Perhaps the issue is just PHP... I don't know. Wikis that I've set up (which typically only have a dozen users admittedly) seem to have good response times, so I don't think it's the code per se, but rather the load. In any case, my thought was that BitTorrent would be a good way to distribute large Wikipedia files without impacting any of the existing servers in a substantial fashion.
-Kelly
"KA" == Kelly Anderson kelly@acoin.com writes:
KA> I'm sorry if my ignorance is showing here, but I went to both KA> sourceforge and gnutella.com and I still don't have any idea KA> what gnutella is, other than it's a peer to peer networking KA> protocol for file sharing. In other words, I can't figure out KA> why Gnutella is different from Ares, the old Napster, or KA> Bittorrent from an architectural point of view.
The main reason it's different -- and one that has nothing to do with Wikipedia, btw -- is that you don't need to connect to a central server, like Napster or BT. You just need to know other people running Gnutella, and get their IP addresses, and then you "discover" new addresses over time.
The main reason it'd be appropriate is that it's probably still the biggest P2P network. Because the protocol is open, there are lots of clients that use the network. Kazaa, Limewire and Bearshare, for example, all run on the Gnutella network.
One thing it's _not_ good at is for publishing Web sites over P2P. That's probably what we'd need for Wikipedia.
KA> I am more familiar with Bittorrent, which is very useful for KA> distributing copies of very large files (which a pdf version KA> of Wikipedia would certainly qualify as) without using hardly KA> any bandwidth on the main server.
I, for one, would _never_ download the entire Wikipedia PDF file. Not only would it be humongous, but several hundred or thousand articles would be edited between the time I started the download and the time it finished.
It makes more sense to distribute a page at a time rather than the whole thing at once.
~ESP
Kelly Anderson wrote:
Would someone who is familiar with both Gnutella and Bittorrent tell me why using Bittorrent for such a project would be stupid? It would certainly use less of Wikipedia's already strained bandwidth.
Well, wikipedia's bandwidth is not strained at all. And I don't think that large downloads of whatever is really going to be an issue.
There have been proposals, which I enthusiastically support, that we should in some organized and sensible fashion try to share our articles on Gnutella networks. The reason to use gnutella rather than bittorrent for such a thing is just that lots more people use gnutella than bittorrent, although I should add that once we decide on ways to share things, then sharing by either system should be simple enough.
But, perhaps I'm behind the times in some ways, so maybe I've got this all wrong.
--Jimbo
Perhaps I'm misunderstanding, but I don't see *any* practical benefits to this whatsoever. What are we trying to accomplish? Anyone who uses gnutella can fire up a web browser and browse the site; our articles are tiny, and don't even begin to approach the size where there is tangible benefit from distributing them peer-to-peer. Finally - do we really want to pollute the p2p network(s) with horrendously outdated versions of articles (since this is inevitably what would happen)?
Not to mention what the evil users can do: pack the trendy variant of Bagle/[insert virus name here] as "Wikipedia - <Legit article name>.pdf.scr", and start distributing all sorts of crap bearing the project name. Of course, one can argue that these people can already do so now: but remember, right now, we're not endorsing p2p as an official distribution channel of any type. The moment we start to is the moment when this type of flagrant abuse becomes unstoppable, because a user (especially if clueless) cannot tell whether an article he's getting is John Doe's POV wrapped in a Wikipedia header, a virus, or a legit article.
What's, ideally, the benefit? And how does it surpass normal browsing?
Cheers, -IK
Jimmy Wales wrote:
There have been proposals, which I enthusiastically support, that we should in some organized and sensible fashion try to share our articles on Gnutella networks. [...]
For 99.9% of the users 99.9% of the time you are exactly correct. The discussion of Gnutella and P2P came up in the context of the major "bandwidth stealers" who were downloading the entire site one page at a time, and were taking images on an as needed basis. Using Gnutella or BitTorrent to distribute a big zip file with all the images in it at once would solve this very specific issue.
If, as others have said, bandwidth isn't a scarce resource, then this is probably not an idea that should be implemented any time soon.
-Kelly
At 07:51 AM 3/18/2004, you wrote:
Perhaps I'm misunderstanding, but I don't see *any* practical benefits to this whatsoever. What are we trying to accomplish? Anyone who uses gnutella can fire up a web browser and browse the site; our articles are tiny, and don't even begin to approach the size where there is tangible benefit from distributing them peer-to-peer. Finally - do we really want to pollute the p2p network(s) with horrendously outdated versions of articles (since this is inevitably what would happen)?
Not to mention what the evil users can do: pack the trendy variant of Bagle/[insert virus name here] as "Wikipedia - <Legit article name>.pdf.scr", and start distributing all sorts of crap bearing the project name. Of course, one can argue that these people can already do so now: but remember, right now, we're not endorsing p2p as an official distribution channel of any type. The moment we start to is the moment when this type of flagrant abuse becomes unstoppable, because a user (especially if clueless) cannot tell whether an article he's getting is John Doe's POV wrapped in a Wikipedia header, a virus, or a legit article.
What's, ideally, the benefit? And how does it surpass normal browsing?
Cheers, -IK
Jimmy Wales wrote:
There have been proposals, which I enthusiastically support, that we should in some organized and sensible fashion try to share our articles on Gnutella networks. [...]
Wikitech-l mailing list Wikitech-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org