Google dugg this up: http://www.nabble.com/BitTorrent-Downloads-Posted-for-enwiki-20070402-images...
When I fire up rTorrent, it says "can't resolve host" -- is there an updated version of this floating around? Did anyone every successfully download it?
Thanks!
Google dugg this up: http://www.nabble.com/BitTorrent-Downloads-Posted-for-enwiki-20070402-images...
When I fire up rTorrent, it says "can't resolve host" -- is there an updated version of this floating around? Did anyone every successfully download it?
Thanks!
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
download
http://meta.wikimedia.org/wiki/Wikix
You can get the images faster with wikix. You will need to download the latest XML Dump. You will need the image tags in the dump if you post the images somewhere if any of them are fair use.
Jeff
Jeff -- thanks that is exactly the sort of thing I'm looking for.
What I don't understand is how this fits in with the "no-crawler" policy?
Also I appreciate you sharing the tool with the community.
Thanks, Yousef
----- Original Message ----- From: jmerkey@wolfmountaingroup.com To: "Wikimedia developers" wikitech-l@lists.wikimedia.org Cc: wikitech-l@lists.wikimedia.org Sent: Thursday, October 25, 2007 12:34:33 PM (GMT-0800) America/Los_Angeles Subject: Re: [Wikitech-l] images torrent
Google dugg this up: http://www.nabble.com/BitTorrent-Downloads-Posted-for-enwiki-20070402-images...
When I fire up rTorrent, it says "can't resolve host" -- is there an updated version of this floating around? Did anyone every successfully download it?
Thanks!
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
download
http://meta.wikimedia.org/wiki/Wikix
You can get the images faster with wikix. You will need to download the latest XML Dump. You will need the image tags in the dump if you post the images somewhere if any of them are fair use.
Jeff
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Jeff -- thanks that is exactly the sort of thing I'm looking for.
What I don't understand is how this fits in with the "no-crawler" policy?
The number of people with 2.5 TB lying around to use to host Wikipedia's images and workspace for thumbnails and rendering does not seem to be that large. Wikix is not very intrusive in any event.
Better have a lot of space. As of 20071018 dump, the total for all images is:
[root@gadugi ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/sdb3 473G 9.1G 440G 3% / /dev/sdb1 388M 14M 354M 4% /boot tmpfs 4.0G 0 4.0G 0% /dev/shm /dev/sda1 1.1T 330G 716G 32% /wikidump /dev/hda3 112G 36G 71G 34% /w /dev/sdb4 616G 406G 179G 70% /image [root@gadugi ~]#
406GB for total image payload for the English Wikipedia.
Jeff
On 10/26/07, Yousef Ourabi yourabi@zero-analog.com wrote:
What I don't understand is how this fits in with the "no-crawler" policy?
What "no-crawler" policy?
Anthony wrote:
On 10/26/07, Yousef Ourabi yourabi@zero-analog.com wrote:
What I don't understand is how this fits in with the "no-crawler" policy?
What "no-crawler" policy?
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
[[Wikipedia:Database download#Please do not use a web crawler]]
--Darkwind
Anthony wrote:
On 10/26/07, Yousef Ourabi yourabi@zero-analog.com wrote:
What I don't understand is how this fits in with the "no-crawler" policy?
What "no-crawler" policy?
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
[[Wikipedia:Database download#Please do not use a web crawler]]
--Darkwind
Wikix downloads images in a non-intrusive manner, in fact, its no more intrusive than your average workstation browsing the site. This was by design. I intentionally make the tool slow in order to avoid any impacts on the site.
Given google's high rankings of Wikipedia pages, and the relationship with ask.com and others, it obvious that massive web crawling of the site is permitted by these search engines and in fact is encouraged.
Needless to say, wikix is nowhere near as intense as these other applications. Provided it is not being used in a malicious manner, it does not appear to impinge on this policy. Esspecially since Wikimedia states at download.wikimedia.org image tarballs will be supported at some point.
Jeff
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
well I've already started the scripts.
I would be interested in mirroring this images so people from the community who need access to them have multiple choices for download -- and thus remove some of the burden from Wikipedia.
What is the process for this (other than sending out an email) ? How receptive would wikipedia be to something like that?
Thanks.
----- Original Message ----- From: jmerkey@wolfmountaingroup.com To: "Wikimedia developers" wikitech-l@lists.wikimedia.org Cc: "Wikimedia developers" wikitech-l@lists.wikimedia.org Sent: Friday, October 26, 2007 5:24:12 AM (GMT-0800) America/Los_Angeles Subject: Re: [Wikitech-l] images torrent
Anthony wrote:
On 10/26/07, Yousef Ourabi yourabi@zero-analog.com wrote:
What I don't understand is how this fits in with the "no-crawler" policy?
What "no-crawler" policy?
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
[[Wikipedia:Database download#Please do not use a web crawler]]
--Darkwind
Wikix downloads images in a non-intrusive manner, in fact, its no more intrusive than your average workstation browsing the site. This was by design. I intentionally make the tool slow in order to avoid any impacts on the site.
Given google's high rankings of Wikipedia pages, and the relationship with ask.com and others, it obvious that massive web crawling of the site is permitted by these search engines and in fact is encouraged.
Needless to say, wikix is nowhere near as intense as these other applications. Provided it is not being used in a malicious manner, it does not appear to impinge on this policy. Esspecially since Wikimedia states at download.wikimedia.org image tarballs will be supported at some point.
Jeff
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 10/26/07, RLS evendell@gmail.com wrote:
Anthony wrote:
On 10/26/07, Yousef Ourabi yourabi@zero-analog.com wrote:
What I don't understand is how this fits in with the "no-crawler" policy?
What "no-crawler" policy?
[[Wikipedia:Database download#Please do not use a web crawler]]
Have Google and Yahoo been informed of this policy?
BTW, that talks about articles, not images. And it contradicts robots.txt, especially "## we're disabling this experimentally 11-09-2006\n#Crawl-delay: 1"
It seems to stem from something said on the Village Pump back in 2003. I for one am going to go with robots.txt, not something someone said on some Wikipedia page.
On 10/26/07, Anthony wikimail@inbox.org wrote:
On 10/26/07, RLS evendell@gmail.com wrote:
Anthony wrote:
On 10/26/07, Yousef Ourabi yourabi@zero-analog.com wrote:
What I don't understand is how this fits in with the "no-crawler" policy?
What "no-crawler" policy?
[[Wikipedia:Database download#Please do not use a web crawler]]
Have Google and Yahoo been informed of this policy?
BTW, that talks about articles, not images. And it contradicts robots.txt, especially "## we're disabling this experimentally 11-09-2006\n#Crawl-delay: 1"
It seems to stem from something said on the Village Pump back in 2003.
Here's the diff: http://en.wikipedia.org/w/index.php?title=Wikipedia_talk:Village_pump&di...
Some other fun stuff from the village pump circa 2003: *"I suggest that all articles about movies and tv shows be scrapped, and instead have the links point to the apropriate page on the Internet Movie Database." - Vroman *"There is never a good reason to delete perfectly good material from the Wikipedia. Wikipedia isn't paper." - Zoe *"A wiki devoted just to movies and TV shows would not be a bad thing. We're probably not there yet, though." - Wapcaplet (inventor of Wikia?) *"I fear we can't ban this range. Banning this is banning all of AOL." - JeLuF *"As nobody else had edited this article, its arbitrary deletion was uncontroversial. However, deleting articles that someone else has edited (beyond blanking/reverts) is more controversial, with strong opinions on both sides." - Martin
On 10/26/07, Anthony wikimail@inbox.org wrote:
Have Google and Yahoo been informed of this policy?
No, since they're our number-one referers.
BTW, that talks about articles, not images. And it contradicts robots.txt, especially "## we're disabling this experimentally 11-09-2006\n#Crawl-delay: 1"
It seems to stem from something said on the Village Pump back in 2003. I for one am going to go with robots.txt, not something someone said on some Wikipedia page.
I believe a more accurate story would be as follows:
1) Live mirrors of the site, however big or small, are discouraged without prior agreement. You're supposed to use the dumps for this. If you want to provide some kind of useful value-added "gateway" or framing or whatever, that for instance marks up the pages in some useful way or whatever, *and* you very clearly acknowledge the source and give a link, *and* you don't run ads or similar, *and* you don't use too much bandwidth, that's probably fine (although best to ask first). If you don't meet the preceding conditions, you may be asked to pay a fee for the mirroring service, or face blocking.
2) Anything that uses enough server resources to slow down the site will probably be blocked or killed if it's noticed. In the old days this was a concern, but nowadays it's probably not.
There was a page I once saw where someone had put up the statement that bots should only request pages once every ten seconds or something. When I looked in the histories, I saw that Brion had added it in like 2003, along with a description of the hardware Wikipedia was being run on: a single server with one Pentium CPU. Later someone removed the part of that edit with the grossly-outdated server description, but neglected to remove the by then ludicrous blanket restriction on crawlers.
Anyway, it comes down to this: it's always courteous to ask, but if you don't cause any actual damage probably nobody will notice or care. Don't take that as any official party line, I'm not a sysadmin, but that seems to hold as far as I can tell.
On 10/27/07, Anthony wikimail@inbox.org wrote:
[[Wikipedia:Database download#Please do not use a web crawler]]
Have Google and Yahoo been informed of this policy?
Context: "Please do not use a web crawler to download large numbers of articles."
As in "Don't use a web crawler to get big amounts of data for your own personal use" (i.e. for mirroring). And it's quite valid, if lots of people downloaded the entire site one article at a time, we'd end up with big problems - especially seeing as the load would be evenly distributed across many articles, and hence there'd be a lot of extra parsing happening.
Google and Yahoo have nothing to do with this, as search engines would represent a tiny portion of our requests (whereas many users doing a lot of requesting would not), and use the data obtained for the public benefit.
On 10/28/07, Andrew Garrett andrew@epstone.net wrote:
On 10/27/07, Anthony wikimail@inbox.org wrote:
[[Wikipedia:Database download#Please do not use a web crawler]]
Have Google and Yahoo been informed of this policy?
Context: "Please do not use a web crawler to download large numbers of articles."
As in "Don't use a web crawler to get big amounts of data for your own personal use" (i.e. for mirroring). And it's quite valid, if lots of people downloaded the entire site one article at a time, we'd end up with big problems - especially seeing as the load would be evenly distributed across many articles, and hence there'd be a lot of extra parsing happening.
Google and Yahoo have nothing to do with this, as search engines would represent a tiny portion of our requests (whereas many users doing a lot of requesting would not), and use the data obtained for the public benefit.
The same could be said about Yousef Ourabi, though. He's only one person, and he's "interested in mirroring this images so people from the community who need access to them have multiple choices for download".
I think Simetrical got the right de facto policy. Don't run a live mirror, and don't slow down or break anything, and no one's going to care or even notice.
On 10/25/07, Yousef Ourabi yourabi@zero-analog.com wrote:
Google dugg this up: http://www.nabble.com/BitTorrent-Downloads-Posted-for-enwiki-20070402-images...
When I fire up rTorrent, it says "can't resolve host" -- is there an updated version of this floating around? Did anyone every successfully download it?
Thanks!
Anyone who wants copies of the image collection and seriously has the storage to take one, please contact me. I've been doing one off rsync image feeds off one of my own systems.
Transferring the files via HTTP is miserably slow, especially if you want the full 1.6TBish collection. ;)
On 10/26/07, Gregory Maxwell gmaxwell@gmail.com wrote:
On 10/25/07, Yousef Ourabi yourabi@zero-analog.com wrote:
Google dugg this up: http://www.nabble.com/BitTorrent-Downloads-Posted-for-enwiki-20070402-images...
When I fire up rTorrent, it says "can't resolve host" -- is there an updated version of this floating around? Did anyone every successfully download it?
Thanks!
Anyone who wants copies of the image collection and seriously has the storage to take one, please contact me. I've been doing one off rsync image feeds off one of my own systems.
Hmm, depends. How would you be transferring them to me?
Transferring the files via HTTP is miserably slow, especially if you want the full 1.6TBish collection. ;)
I'd imagine it's bandwidth limited no matter how you do it. 15-16 days or so at 10mbps. Add 4-5 days for HTTP/1.1 handshaking and say 20 days. Assuming you're using some sort of pipelining, of course.
http://www.google.com/search?q=%281.6+terabytes+divided+by+10+megabits%29+di...
wikitech-l@lists.wikimedia.org