Hi,
I have a suggestion for wikipedia!! I think that the database dumps including the image files should be made available by a wikipedia bittorrent tracker so that people would be able to download the wikipedia backups including the images (which currently they can't do) and also so that wikipedia's bandwidth costs would be reduced. I think it is important that wikipedia can be downloaded for using it offline now and in the future for people.
best regards, Jamie Morken
Jamie Morken wrote:
Hi,
I have a suggestion for wikipedia!! I think that the database dumps including the image files should be made available by a wikipedia bittorrent tracker so that people would be able to download the wikipedia backups including the images (which currently they can't do) and also so that wikipedia's bandwidth costs would be reduced. I think it is important that wikipedia can be downloaded for using it offline now and in the future for people.
best regards, Jamie Morken
Has been tried before (when they were smaller). How many people do you think will have the necessary space and be willing to download it?
I have been using the dumps for few months and I think this kind of dumps is much better than a torrent. Yes bandwidth can be saved but I do not think the the cost of bandwidth is higher than the cost of maintaining the torrents.
If people are not hosting the files so the value of torrents is limited.
I think regular mirroring is much better but it all depends on the willingness of people to host the files.
bilal -- Verily, with hardship comes ease.
On Thu, Jan 7, 2010 at 11:30 AM, Platonides Platonides@gmail.com wrote:
Jamie Morken wrote:
Hi,
I have a suggestion for wikipedia!! I think that the database dumps including the image files should be made available by a wikipedia bittorrent tracker so that people would be able to download the wikipedia backups including the images (which currently they can't do) and also so that wikipedia's bandwidth costs would be reduced. I think it is important that wikipedia can be downloaded for using it offline now and in the future for people.
best regards, Jamie Morken
Has been tried before (when they were smaller). How many people do you think will have the necessary space and be willing to download it?
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 01/07/2010 01:40 AM, Jamie Morken wrote:
I have a suggestion for wikipedia!! I think that the database dumps including the image files should be made available by a wikipedia bittorrent tracker so that people would be able to download the wikipedia backups including the images (which currently they can't do) and also so that wikipedia's bandwidth costs would be reduced. [...]
Is the bandwidth used really a big problem? Bandwidth is pretty cheap these days, and given Wikipedia's total draw, I suspect the occasional dump download isn't much of a problem.
Bittorrent's real strength is when a lot of people want to download the same thing at once. E.g., when a new Ubuntu release comes out. Since Bittorrent requires all downloaders to be uploaders, it turns the flood of users into a benefit. But unless somebody has stats otherwise, I'd guess that isn't the problem here.
William
It s possible to download our zeno ( articles or images ) dump with http://okawix..com
----- Original Message ----- From: "William Pietri" william@scissor.com To: "Wikimedia developers" wikitech-l@lists.wikimedia.org Sent: Thursday, January 07, 2010 5:52 PM Subject: Re: [Wikitech-l] downloading wikipedia database dumps
On 01/07/2010 01:40 AM, Jamie Morken wrote:
I have a suggestion for wikipedia!! I think that the database dumps including the image files should be made available by a wikipedia bittorrent tracker so that people would be able to download the wikipedia backups including the images (which currently they can't do) and also so that wikipedia's bandwidth costs would be reduced. [...]
Is the bandwidth used really a big problem? Bandwidth is pretty cheap these days, and given Wikipedia's total draw, I suspect the occasional dump download isn't much of a problem.
Bittorrent's real strength is when a lot of people want to download the same thing at once. E.g., when a new Ubuntu release comes out. Since Bittorrent requires all downloaders to be uploaders, it turns the flood of users into a benefit. But unless somebody has stats otherwise, I'd guess that isn't the problem here.
William
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hello,
Is the bandwidth used really a big problem? Bandwidth is pretty cheap these days, and given Wikipedia's total draw, I suspect the occasional dump download isn't much of a problem.
I am not sure about the cost of the bandwidth, but the wikipedia image dumps are no longer available on the wikipedia dump anyway. I am guessing they were removed partly because of the bandwidth cost, or else image licensing issues perhaps.
from: http://en.wikipedia.org/wiki/Wikipedia_database#Images_and_uploaded_files
"Currently Wikipedia does not allow or provide facilities to download all images. As of 17 May 2007 (2007 -05-17)[update], Wikipedia disabled or neglected all viable bulk downloads of images including torrent trackers. Therefore, there is no way to download image dumps other than scraping Wikipedia pages up or using Wikix, which converts a database dump into a series of scripts to fetch the images.
Unlike most article text, images are not necessarily licensed under the GFDL & CC-BY-SA-3.0. They may be under one of many free licenses, in the public domain, believed to be fair use, or even copyright infringements (which should be deleted). In particular, use of fair use images outside the context of Wikipedia or similar works may be illegal. Images under most licenses require a credit, and possibly other attached copyright information. This information is included in image description pages, which are part of the text dumps available from download.wikimedia.org. In conclusion, download these images at your own risk (Legal)"
Bittorrent's real strength is when a lot of people want to download the same thing at once. E.g., when a new Ubuntu release comes out. Since Bittorrent requires all downloaders to be uploaders, it turns the flood of users into a benefit. But unless somebody has stats otherwise, I'd guess that isn't the problem here.
Bittorrent is simply a more efficient method to distribute files, especially if the much larger wikipedia image files were made available again. The last dump from english wikipedia including images is over 200GB but is understandably not available for download. Even if there are only 10 people per month who download these large files, bittorrent should be able to reduce the bandwidth cost to wikipedia significantly. Also I think that having bittorrent setup for this would cost wikipedia a small amount, and may save money in the long run, as well as encourage people to experiment with offline encyclopedia usage etc. To make people have to crawl wikipedia with Wikix if they want to download the images is a bad solution, as it means that the images are downloaded inefficiently. Also one wikix user reported that his download connection was cutoff by a wikipedia admin for "remote downloading".
Unless there are legal reasons for not allowing images to be downloaded, I think the wikipedia image files should be made available for efficient download again. However since wikix can theoretically be used to download the images, I think it would also be legal to allow the image dump to be downloaded as well, thoughts?
cheers, Jamie
William
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Fri, Jan 8, 2010 at 4:31 PM, Jamie Morken jmorken@shaw.ca wrote:
Bittorrent is simply a more efficient method to distribute files, especially if the much larger wikipedia image files were made available again. The last dump from english wikipedia including images is over 200GB but is understandably not available for download. Even if there are only 10 people per month who download these large files, bittorrent should be able to reduce the bandwidth cost to wikipedia significantly. Also I think that having bittorrent setup for this would cost wikipedia a small amount, and may save money in the long run, as well as encourage people to experiment with offline encyclopedia usage etc. To make people have to crawl wikipedia with Wikix if they want to download the images is a bad solution, as it means that the images are downloaded inefficiently. Also one wikix user reported that his download connection was cutoff by a wikipedia admin for "remote downloading".
The problem with BitTorrent is that it is unsuitable for rapidly changing data sets, such as images. If you want to add a single file to the torrent, the entire torrent hash changes, meaning that you end up with separate peer pools for every different data set, although they mostly contain the same files.
That said, it could of course be benificial for an initial dump download and is better than the current situation where there is nothing available at all.
Bryan
On Fri, Jan 8, 2010 at 10:31 AM, Jamie Morken jmorken@shaw.ca wrote:
I am not sure about the cost of the bandwidth, but the wikipedia image dumps are no longer available on the wikipedia dump anyway. I am guessing they were removed partly because of the bandwidth cost, or else image licensing issues perhaps.
I think we just don't have infrastructure set up to dump images. I'm very sure bandwidth is not an issue -- the number of people with a terabyte (or is it more?) handy that they want to download a Wikipedia image dump to will be vanishingly small compared to normal users. Licensing wouldn't be an issue for Commons, at least, as long as it's easy to link the images up to their license pages. (I imagine it would technically violate some licenses, but no one would probably worry about it.)
Bittorrent is simply a more efficient method to distribute files, especially if the much larger wikipedia image files were made available again. The last dump from english wikipedia including images is over 200GB but is understandably not available for download. Even if there are only 10 people per month who download these large files, bittorrent should be able to reduce the bandwidth cost to wikipedia significantly.
Wikipedia uses an average of multiple gigabits per second of bandwidth, as I recall. One gigabit per second adds up to about 10.5 terabytes per day, so say 300 terabytes per month. I'm pretty sure the average figure is more like five or ten Gbps than one, so let's say a petabyte a month at least Ten people per month downloading an extra terabyte is not a big issue. And I really doubt we'd see that many people downloading a full image dump every month.
The sensible bandwidth-saving way to do it would be to set up an rsync daemon on the image servers, and let people use that. Then you could get an old copy of the files from anywhere (including Bittorrent, if you like) and only have to download the changes. Plus, you could get up-to-the-minute copies if you like, although probably some throttling should be put into place to stop dozens of people from all running rsync in a loop to make sure they have the absolute latest version. I believe rsync 2 doesn't handle such huge numbers of files acceptably, but I heard rsync 3 is supposed to be much better. That sounds like a better direction to look in than Bittorrent -- nobody's going to want to redownload the same files constantly to get an up-to-date set.
Unless there are legal reasons for not allowing images to be downloaded, I think the wikipedia image files should be made available for efficient download again.
I'm pretty sure the reason there's no image dump is purely because not enough resources have been devoted to getting it working acceptably.
On Fri, Jan 8, 2010 at 10:56 AM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
On Fri, Jan 8, 2010 at 10:31 AM, Jamie Morken jmorken@shaw.ca wrote:
I am not sure about the cost of the bandwidth, but the wikipedia image dumps are no longer available on the wikipedia dump anyway. I am guessing they were removed partly because of the bandwidth cost, or else image licensing issues perhaps.
I think we just don't have infrastructure set up to dump images. I'm very sure bandwidth is not an issue -- the number of people with a
Correct. The space wasn't available for the required intermediate cop(y|ies).
terabyte (or is it more?) handy that they want to download a Wikipedia image dump to will be vanishingly small compared to normal users.
s/terabyte/several terabytes/ My copy is not up to date, but it's not smaller than 4.
Licensing wouldn't be an issue for Commons, at least, as long as it's easy to link the images up to their license pages. (I imagine it would technically violate some licenses, but no one would probably worry about it.)
We also dump the licensing information. If we can lawfully put the images on website then we can also distribute them in dump form. There is and can be no licensing problem.
Wikipedia uses an average of multiple gigabits per second of bandwidth, as I recall.
http://www.nedworks.org/~mark/reqstats/trafficstats-daily.png
Though only this part is paid for: http://www.nedworks.org/~mark/reqstats/transitstats-daily.png
The rest is peering, etc. which is only paid for in the form of equipment, port fees, and operational costs.
The sensible bandwidth-saving way to do it would be to set up an rsync daemon on the image servers, and let people use that.
This was how I maintained a running mirror for a considerable time.
Unfortunately the process broke when WMF ran out of space and needed to switch servers.
On Fri, Jan 8, 2010 at 10:31 AM, Jamie Morken jmorken@shaw.ca wrote:
Bittorrent is simply a more efficient method to distribute files,
No. In a very real absolute sense bittorrent is considerably less efficient than other means.
Bittorrent moves more of the outbound traffic to the edges of the network where the real cost per gbit/sec is much greater than at major datacenters, because a megabit on a low speed link is more costly than a megabit on a high speed link and a megabit on 1 mile of fiber is more expensive than a megabit on 10 feet of fiber.
More over, bittorrent is topology unaware so the path length tends to approach the internet average mean path length. Datacenters tend to be more centrally located topology wise, and topology aware distribution is easily applied to centralized stores. (E.g. WMF satisfies requests from Europe in europe, though not for the dump downloads as there simply isn't enough traffic to justify it)
Bittorrent also is a more complicated, higher overhead service which requires more memory and more disk IO than traditional transfer mechanisms.
There are certainly cases where bittorrent is valuable, such as the flash mob case of a new OS release. This really isn't one of those cases.
On Thu, Jan 7, 2010 at 11:52 AM, William Pietri william@scissor.com wrote:
On 01/07/2010 01:40 AM, Jamie Morken wrote:
I have a suggestion for wikipedia!! I think that the database dumps including the image files should be made available by a wikipedia bittorrent tracker so that people would be able to download the wikipedia backups including the images (which currently they can't do) and also so that wikipedia's bandwidth costs would be reduced. [...]
Is the bandwidth used really a big problem? Bandwidth is pretty cheap these days, and given Wikipedia's total draw, I suspect the occasional dump download isn't much of a problem.
Bittorrent's real strength is when a lot of people want to download the same thing at once. E.g., when a new Ubuntu release comes out. Since Bittorrent requires all downloaders to be uploaders, it turns the flood of users into a benefit. But unless somebody has stats otherwise, I'd guess that isn't the problem here.
We tried BT for the commons poty archive once while I was watching and we never had a downloader stay connected long enough to help another downloader... and that was only 500mb, much easier to seed.
BT also makes the server costs a lot higher: it has more cpu/memory overhead, and creates a lot of random disk IO. For low volume large files it's often not much of a win.
I haven't seen the numbers for a long time, but when I last looked download.wikimedia.org was producing fairly little traffic... and much of what it was producing was outside of the peak busy hour for the sites. Since the transit is paid for on the 95th percentile and the WMF still has a decent day/night swing out of peak traffic is effectively free. The bandwidth is nothing to worry about.
On Fri, Jan 8, 2010 at 8:24 AM, Gregory Maxwell gmaxwell@gmail.com wrote:
s/terabyte/several terabytes/ My copy is not up to date, but it's not smaller than 4.
Top most versions of Commons files are about 4.9 TB, files on enwiki but not Commons add another 200 GB or so.
-Robert Rohde
On Fri, Jan 8, 2010 at 10:56 AM, Aryeh Gregor <Simetrical+wikilist@gmail.comSimetrical%2Bwikilist@gmail.com
wrote:
The sensible bandwidth-saving way to do it would be to set up an rsync daemon on the image servers, and let people use that.
The bandwidth-saving way to do things would be to just allow mirrors to use hotlinking. Requiring a middle man to temporarily store images (many, and possibly even most of which will never even be downloaded by end users) just wastes bandwidth.
William Pietri wrote:
On 01/07/2010 01:40 AM, Jamie Morken wrote:
I have a suggestion for wikipedia!! I think that the database dumps including the image files should be made available by a wikipedia bittorrent tracker so that people would be able to download the wikipedia backups including the images (which currently they can't do) and also so that wikipedia's bandwidth costs would be reduced. [...]
Is the bandwidth used really a big problem? Bandwidth is pretty cheap these days, and given Wikipedia's total draw, I suspect the occasional dump download isn't much of a problem.
No, bandwidth is not really the problem here. I think the core issue is to have bulk access to images.
There have been a number of these requests in the past and after talking back and forth, it has usually been the case that a smaller subset of the data works just as well.
A good example of this was the Deutsche Fotokek archive made late last year.
http://download.wikipedia.org/images/Deutsche_Fotothek.tar ( 11GB )
This provided an easily retrievable high quality subset of our image data which researchers could use.
Now if we were to snapshot image data and store them for a particular project the amount of duplicate image data would become significant. That's because we re-use a ton of image data between projects and rightfully so.
If instead we package all of commons into a tarball then we get roughly 6T's of image data which after numerous conversation has been a bit more then most people want to process.
So what does everyone think of going down the collections route?
If we provide enough different and up to date ones then we could easily give people a large but manageable amount of data to work with.
If there is a page already for this then please feel free to point me to it otherwise I'll create one.
--tomasz
I think having access to them on Commons repository is much easier to handle. A subset should be good enough.
Having 11 TB of images needs huge research capabilities in order to handle all of them and work with all of them.
Maybe a special API or advanced API functions would allow people enough access and at the same time save the bandwidth and the hassle to handle this behemoth collection.
bilal -- Verily, with hardship comes ease.
On Fri, Jan 8, 2010 at 1:57 PM, Tomasz Finc tfinc@wikimedia.org wrote:
William Pietri wrote:
On 01/07/2010 01:40 AM, Jamie Morken wrote:
I have a suggestion for wikipedia!! I think that the database dumps including the image files should be made available by a wikipedia bittorrent tracker so that people would be able to download the wikipedia backups including the images (which currently they can't do) and also so that wikipedia's bandwidth costs would be reduced. [...]
Is the bandwidth used really a big problem? Bandwidth is pretty cheap these days, and given Wikipedia's total draw, I suspect the occasional dump download isn't much of a problem.
No, bandwidth is not really the problem here. I think the core issue is to have bulk access to images.
There have been a number of these requests in the past and after talking back and forth, it has usually been the case that a smaller subset of the data works just as well.
A good example of this was the Deutsche Fotokek archive made late last year.
http://download.wikipedia.org/images/Deutsche_Fotothek.tar ( 11GB )
This provided an easily retrievable high quality subset of our image data which researchers could use.
Now if we were to snapshot image data and store them for a particular project the amount of duplicate image data would become significant. That's because we re-use a ton of image data between projects and rightfully so.
If instead we package all of commons into a tarball then we get roughly 6T's of image data which after numerous conversation has been a bit more then most people want to process.
So what does everyone think of going down the collections route?
If we provide enough different and up to date ones then we could easily give people a large but manageable amount of data to work with.
If there is a page already for this then please feel free to point me to it otherwise I'll create one.
--tomasz
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Fri, Jan 8, 2010 at 3:28 PM, Bilal Abdul Kader bilalak@gmail.com wrote:
I think having access to them on Commons repository is much easier to handle. A subset should be good enough.
Having 11 TB of images needs huge research capabilities in order to handle all of them and work with all of them.
Maybe a special API or advanced API functions would allow people enough access and at the same time save the bandwidth and the hassle to handle this behemoth collection.
Well, if there were an rsyncd you could just fetch the ones you wanted arbitrarily.
Can someone articulate what the use case is?
Is there someone out there who could use a 5 TB image archive but is disappointed it doesn't exist? Seems rather implausible.
If not, then I assume that everyone is really after only some subset of the files. If that's the case we should try to figure out what kinds of subsets and the best way to handle them.
-Robert Rohde
On Fri, Jan 8, 2010 at 3:55 PM, Robert Rohde rarohde@gmail.com wrote:
Can someone articulate what the use case is?
Is there someone out there who could use a 5 TB image archive but is disappointed it doesn't exist? Seems rather implausible.
If not, then I assume that everyone is really after only some subset of the files. If that's the case we should try to figure out what kinds of subsets and the best way to handle them.
Er. I've maintained a non-WMF disaster recovery archive for a long time, though its no longer completely current since the rsync went away and web fetching is lossy.
It saved our rear a number of times, saving thousands of images from irreparable loss. Moreover it allowed things like image hashing before we had that in the database, and it would allow perceptual lossy hash matching if I ever got around to implementing tools to access the output.
There really are use cases. Moreover, making complete copies of the public data available as dumps to the public is a WMF board supported initiative.
Gregory Maxwell wrote:
Er. I've maintained a non-WMF disaster recovery archive for a long time, though its no longer completely current since the rsync went away and web fetching is lossy.
And the box run out of disk space. We could try until it fills again, though.
A sysadmin fixing images with wrong hashes would also be nice https://bugzilla.wikimedia.org/show_bug.cgi?id=17057#c3
It saved our rear a number of times, saving thousands of images from irreparable loss. Moreover it allowed things like image hashing before we had that in the database, and it would allow perceptual lossy hash matching if I ever got around to implementing tools to access the output.
IMHO the problem is not accessing it, but hashing those terabytes of images.
There really are use cases. Moreover, making complete copies of the public data available as dumps to the public is a WMF board supported initiative.
On Fri, Jan 8, 2010 at 2:37 PM, Gregory Maxwell gmaxwell@gmail.com wrote:
Er. I've maintained a non-WMF disaster recovery archive for a long time, though its no longer completely current since the rsync went away and web fetching is lossy.
It saved our rear a number of times, saving thousands of images from irreparable loss.
While I certainly can't fault your good will, I do find it disturbing that it was necessary. Ideally, Wikimedia should have internal backups of sufficient quality that we don't have to depend on what third parties happen to have saved for any circumstance short of meteors falling from the heavens.
Moreover it allowed things like image hashing before we had that in the database, and it would allow perceptual lossy hash matching if I ever got around to implementing tools to access the output.
If the goal is some version of "do something useful for Wikimedia", then it actually seems rather bizarre to have the first step be "copy X TB of gradually changing data to privately owned and managed servers". For Wikimedia applications, it would seem much more natural to make tools and technology available to do such things within Wikimedia. That way developers could work on such problems without having to worry about how much disk space they can personally afford. Again, there is nothing wrong with you generously doing such things with your own resources, but ideally running duplicate repositories for the benefit of Wikimedia should be unnecessary.
There really are use cases. Moreover, making complete copies of the public data available as dumps to the public is a WMF board supported initiative.
I agree with the goal of making WMF content available, but given that we don't offer any image dump right now and a comprehensive dump as such would be usable to almost no one, then I don't think a classic dump is where we should start. Even you don't seem to want that. If I understand correctly, you'd like to have an easier way to reliably download individual image files. You wouldn't actually want to be presented with some form of monolithic multi-terabyte tarball each month.
Hence, I would say say it makes more sense to discuss way to make individual images and user specified subsets of images more easily available. The same gateways that could allow you to keep synchronized could also help other people to download individual files. Other goals could see functions like export pages expanded to include options for download all associated image files at the same time one downloads a set of wikitext.
The general point I am trying to make is that if we think about what people really want, and how the files are likely to be used, then there may be better delivery approaches than trying to create huge image dumps.
-Robert Rohde
2010/1/9 Robert Rohde rarohde@gmail.com:
The general point I am trying to make is that if we think about what people really want, and how the files are likely to be used, then there may be better delivery approaches than trying to create huge image dumps.
Whilst, I'd hope, not letting the quest for the perfect solution hold up in any way the quest for a better-than-nothing solution. Because "nothing" is what we have right now, and that's really really bad.
- d.
On Fri, Jan 8, 2010 at 8:25 PM, Robert Rohde rarohde@gmail.com wrote:
While I certainly can't fault your good will, I do find it disturbing that it was necessary. Ideally, Wikimedia should have internal backups of sufficient quality that we don't have to depend on what third parties happen to have saved for any circumstance short of meteors falling from the heavens.
Yea, well, you can't easily eliminate all the internal points of failure. "someone with root loses control of their access and someone nasty wipes everything" is really hard to protect against with online systems.
Avoiding the case where some failure is reliably replicated among all of WMF's copies (which was the case in the deletions I recovered, they were redundant copies, which were deleted too) can be best accomplished with an air-gap.
And meteors *do* fall, if rarely. WMF can be robust against that— for only the price of making all the data available, which is something worth doing for other principled and practical reasons.
Within wikimedia means that Wikimedia remains a single point of failure. This is too easy to avoid. Disk space is cheap, and not your problem. At least a few third parties will create and maintain full copies and this is a good thing.
Moreover it allowed things like image hashing before we had that in the database, and it would allow perceptual lossy hash matching if I ever got around to implementing tools to access the output.
If the goal is some version of "do something useful for Wikimedia", then it actually seems rather bizarre to have the first step be "copy X TB of gradually changing data to privately owned and managed servers". For Wikimedia applications, it would seem much more natural to make tools and technology available to do such things within Wikimedia. That way developers could work on such problems without having to worry about how much disk space they can personally afford. Again, there is nothing wrong with you generously doing such things with your own resources, but ideally running duplicate repositories for the benefit of Wikimedia should be unnecessary.
Within wikimedia means within Wikimedia's means, priorities, and politics. Having it locally means that if I decide that I want to decide to saturate a dozen cores computing perceptual hashes for a week I don't have to convince anyone else that its a good use of resources. I don't have to convince wikimedia to fund a project, I don't have to take up resources which might be better used by someone else, I don't have to set any expectations that I might not live up to.
Of course, its great to have public resources 'locally' (which is what the toolserver is for), it doesn't cover all cases.
There really are use cases. Moreover, making complete copies of the public data available as dumps to the public is a WMF board supported initiative.
I agree with the goal of making WMF content available, but given that we don't offer any image dump right now and a comprehensive dump as such would be usable to almost no one, then I don't think a classic dump is where we should start. Even you don't seem to want that. If I understand correctly, you'd like to have an easier way to reliably download individual image files. You wouldn't actually want to be presented with some form of monolithic multi-terabyte tarball each month.
No one wants the monolithic tarball. The way I got updates previously was via a rsync push.
No one sane would suggest a monolithic tarball: it's too much of a pain to produce!
Image dump != monolithic tarball.
But I think producing subsets is pretty much worthless. I can't think of a valid use for any reasonably sized subset. ("All media used on big wiki X" is a useful subset I've produced for people before, but it's not small enough to be a big win vs a full copy)
[snip]
The general point I am trying to make is that if we think about what people really want, and how the files are likely to be used, then there may be better delivery approaches than trying to create huge image dumps.
If all is made available then everyone's wants can be satisfied. No subset is going to get us there. Of course, there are a lot of possibilities for the means of transmission, but I think it would be most useful to assume that at least a few people are going to want to grab everything.
On Fri, Jan 8, 2010 at 9:06 PM, Gregory Maxwell gmaxwell@gmail.com wrote:
Yea, well, you can't easily eliminate all the internal points of failure. "someone with root loses control of their access and someone nasty wipes everything" is really hard to protect against with online systems.
Isn't that what the system immutable flag is for?
It's easy, as long as you're willing to put up with a bit of whining from the person with root access.
On Fri, Jan 8, 2010 at 9:40 PM, Anthony wikimail@inbox.org wrote:
Isn't that what the system immutable flag is for?
No, that's for confusing the real roots while providing only a speed bump to an actual hacker. Anyone with root access can always just unset the flag. Or, failing that, dd if=/dev/zero of=/dev/sda works pretty well.
On Sat, Jan 9, 2010 at 11:09 PM, Aryeh Gregor <Simetrical+wikilist@gmail.comSimetrical%2Bwikilist@gmail.com
wrote:
On Fri, Jan 8, 2010 at 9:40 PM, Anthony wikimail@inbox.org wrote:
Isn't that what the system immutable flag is for?
No, that's for confusing the real roots while providing only a speed bump to an actual hacker. Anyone with root access can always just unset the flag. Or, failing that, dd if=/dev/zero of=/dev/sda works pretty well.
Depends on the machine's securelevel.
On Sat, Jan 9, 2010 at 11:26 PM, Anthony wikimail@inbox.org wrote:
Depends on the machine's securelevel.
Google informs me that securelevel is a BSD feature. Wikimedia uses Linux and Solaris. It might make sense to have backups be sent to a server that no one has remote access to, say, but the point is that there's always a possibility of some kind of catastrophic failure. Maybe the servers will all catch on fire and melt, maybe someone will get past datacenter security and steal them, who knows. Remember the story of the company that carefully kept off-site backups -- on a different floor of the World Trade Center, where its servers were located.
It doesn't hurt to have extra copies out there, and it certainly fits with Wikimedia's mission. Any researcher who wants to analyze the whole data set should be able to without having to ask anyone's permission -- that's what free information is about.
On Sat, Jan 9, 2010 at 11:40 PM, Aryeh Gregor <Simetrical+wikilist@gmail.comSimetrical%2Bwikilist@gmail.com
wrote:
On Sat, Jan 9, 2010 at 11:26 PM, Anthony wikimail@inbox.org wrote:
Depends on the machine's securelevel.
Google informs me that securelevel is a BSD feature. Wikimedia uses Linux and Solaris.
Well, Greg's comment wasn't specific to Linux or Solaris. In any case, I don't know about Solaris, but Linux seems to have some sort of CAP_LINUX_IMMUTABLE and CAP_SYS_RAWIO. I'm sure Solaris has something similar.
It doesn't hurt to have extra copies out there
Certainly not.
On Fri, Jan 8, 2010 at 6:06 PM, Gregory Maxwell gmaxwell@gmail.com wrote: <snip>
No one wants the monolithic tarball. The way I got updates previously was via a rsync push.
No one sane would suggest a monolithic tarball: it's too much of a pain to produce!
I know that You didn't want or use a tarball, but requests for an "image dump" are not that uncommon and often the requester is envisioning something like a tarball. Arguably that is what the originator of this thread seems to have been asking for. I think you and I are probably mostly on the same page about the virtue of ensuring that images can be distributed and that monolithic approaches are bad.
<snip>
But I think producing subsets is pretty much worthless. I can't think of a valid use for any reasonably sized subset. ("All media used on big wiki X" is a useful subset I've produced for people before, but it's not small enough to be a big win vs a full copy)
Wikipedia itself has gotten so large that increasingly people are mirroring subsets rather than allocate the space for a full mirror (e.g. 10000 pages on cooking, or medicine, or whatever). Grabbing images needed for such an application would be useful. I can also see virtues in having a way grab all images in a category (or set of categories). For example, grab all images of dogs, or all images of Barack Obama. In case you think this is all hypothetical, I've actually downloaded tens of thousands of images on more than one occasion to support topical projects.
<snip>
If all is made available then everyone's wants can be satisfied. No subset is going to get us there. Of course, there are a lot of possibilities for the means of transmission, but I think it would be most useful to assume that at least a few people are going to want to grab everything.
Of course, strictly speaking we already provide HTTP access to everything. So the real question is how can we make access easier, more reliable, and less burdensome. You or someone else suggested an API for grabbing files and that seems like a good idea. Ultimately the best answer may well be to take multiple approaches to accommodate both people like you who want everything as well as people that want only more modest collections.
-Robert Rohde
Robert Rohde wrote:
Of course, strictly speaking we already provide HTTP access to everything. So the real question is how can we make access easier, more reliable, and less burdensome. You or someone else suggested an API for grabbing files and that seems like a good idea. Ultimately the best answer may well be to take multiple approaches to accommodate both people like you who want everything as well as people that want only more modest collections.
-Robert Rohde
Anthony wrote:
The bandwidth-saving way to do things would be to just allow mirrors to use hotlinking. Requiring a middle man to temporarily store images (many, and possibly even most of which will never even be downloaded by end users) just wastes bandwidth.
There is already a way to instruct a wiki to use images from a foreign wiki as they are needed. With proper caching.
On 1.16 it will even be much easier, as you will only need to set $wgUseInstantCommons = true; to use Wikimedia Commons images. http://www.mediawiki.org/wiki/Manual:$wgUseInstantCommons
On Sat, Jan 9, 2010 at 7:44 AM, Platonides Platonides@gmail.com wrote:
Robert Rohde wrote:
Of course, strictly speaking we already provide HTTP access to everything. So the real question is how can we make access easier, more reliable, and less burdensome. You or someone else suggested an API for grabbing files and that seems like a good idea. Ultimately the best answer may well be to take multiple approaches to accommodate both people like you who want everything as well as people that want only more modest collections.
-Robert Rohde
Anthony wrote:
The bandwidth-saving way to do things would be to just allow mirrors to use hotlinking. Requiring a middle man to temporarily store images (many, and possibly even most of which will never even be downloaded by end users) just wastes bandwidth.
There is already a way to instruct a wiki to use images from a foreign wiki as they are needed. With proper caching.
On 1.16 it will even be much easier, as you will only need to set $wgUseInstantCommons = true; to use Wikimedia Commons images. http://www.mediawiki.org/wiki/Manual:$wgUseInstantCommons
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
I'd really like to underline this last piece, as it's something I feel we're not promoting as heavily as we should be--with 1.16 making it a 1-line switch to turn it on, perhaps we should publicize this. Thanks to work Brion did in 1.13 and I picked up later on, this ability to use files from Wikimedia Commons (or potentially any MediaWiki installation). Pointed out above, this has configurable caching that can be set as aggressively as you'd like.
To mirror Wikipedia these days, all you'd need is the article and template dumps, point the ForeignAPIRepos at Commons and enwiki, and you've got yourself a working mirror. No need to dump the images and reimport them somewhere. Cache the thumbnails aggressively enough and you'll be hosting the images locally, in effect.
-Chad
On Sat, Jan 9, 2010 at 5:37 AM, Robert Rohde rarohde@gmail.com wrote:
I know that You didn't want or use a tarball, but requests for an "image dump" are not that uncommon and often the requester is envisioning something like a tarball. Arguably that is what the originator of this thread seems to have been asking for. I think you and I are probably mostly on the same page about the virtue of ensuring that images can be distributed and that monolithic approaches are bad.
Monolithic approaches may be bad, but they're better than nothing, which is what we have now.
Tar everything up into 250 gig tarballs and upload it to Internet Archive. Then we're not dependent on the WMF to take the next step and convert those tarballs into something useful.
On Sat, Jan 9, 2010 at 7:44 AM, Platonides Platonides@gmail.com wrote:
Anthony wrote:
The bandwidth-saving way to do things would be to just allow mirrors to
use
hotlinking. Requiring a middle man to temporarily store images (many,
and
possibly even most of which will never even be downloaded by end users)
just
wastes bandwidth.
There is already a way to instruct a wiki to use images from a foreign wiki as they are needed. With proper caching.
Umm, the "with proper caching" part is exactly the part I was talking about wasting bandwidth. Sending the images to a middle-man wastes bandwidth. In a "proper caching" scenario, the middle-man is in a position where the cached material passes through anyway. That saves bandwidth. But that isn't how Instant Commons works.
The original version of Instant Commons had it right. The files were sent straight from the WMF to the client. That version still worked last I checked, but my understanding is that it was deprecated in favor of the bandwidth-wasting "store files in a caching middle-man".
On 1.16 it will even be much easier, as you will only need to set
$wgUseInstantCommons = true; to use Wikimedia Commons images. http://www.mediawiki.org/wiki/Manual:$wgUseInstantCommons
That assumes you're using MediaWiki.
On Sat, Jan 9, 2010 at 8:50 AM, Anthony wikimail@inbox.org wrote:
The original version of Instant Commons had it right. The files were sent straight from the WMF to the client. That version still worked last I checked, but my understanding is that it was deprecated in favor of the bandwidth-wasting "store files in a caching middle-man".
If I were a site admin using InstantCommons, I would want to keep a copy of all the images used anyway, in case they were deleted on commons but I still wanted to use them on my wiki.
- Carl
On Sat, Jan 9, 2010 at 9:27 AM, Carl (CBM) cbm.wikipedia@gmail.com wrote:
On Sat, Jan 9, 2010 at 8:50 AM, Anthony wikimail@inbox.org wrote:
The original version of Instant Commons had it right. The files were sent straight from the WMF to the client. That version still worked last I checked, but my understanding is that it was deprecated in favor of the bandwidth-wasting "store files in a caching middle-man".
If I were a site admin using InstantCommons, I would want to keep a copy of all the images used anyway, in case they were deleted on commons but I still wanted to use them on my wiki.
- Carl
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
A valid suggestion, but I think it should be configurable either way. Some sites will like to use Wikimedia Commons but don't necessarily have the space to store thumbnails (much less the original sources).
However, a "copy source file too" option could be added in, for sites that would also like to fetch the original source file and then import it locally. None of this is out of the realm of possibilities.
The main reason we went for the "render there, show thumbnail here" idea was to increase compatibility. Not everyone has their wikis set up to render things like SVGs. By rendering remotely, you're assuming the source repo like Commons was set up to render it (a valid assumption). By importing the image locally, you're then possibly requesting remote files that you can't render.
Again, more configuration options for the different use cases are possible.
-Chad
* Gregory Maxwell gmaxwell@gmail.com [Fri, 8 Jan 2010 21:06:11 -0500]:
No one wants the monolithic tarball. The way I got updates previously was via a rsync push.
No one sane would suggest a monolithic tarball: it's too much of a pain to produce!
Image dump != monolithic tarball.
Why not to extend the filerepo to make rsync or similar (maybe more efficient) incremental backups easy? Incremental distributed filerepo. Dmitriy
wikitech-l@lists.wikimedia.org