(subject was: Cyn Skyberg joins Wikimedia as CTCO!) On Thu, Sep 16, 2010 at 4:03 AM, Bod Notbod bodnotbod@gmail.com wrote:
On Tue, Sep 14, 2010 at 12:06 PM, Liam Wyatt liamwyatt@gmail.com wrote:
I've always thought that if for some reason all of the Wikimedia projects suddenly disappeared (and no one had any backups) we would be upset about it for a couple of days but then we would just start again ...
O RLY!?
This was my first thought as well. But as Liam said after that...
... and we would do it better!
If this scenario ever did happen, I expect that the first step would be to ensure it never happened again.
We, the people, should start planning for this worse case scenario now.
A mirror system would be great; that is how project gutenberg, linux and sf.net do disaster recovery. It is simple, cheap and effective.
http://sourceforge.net/apps/trac/sourceforge/wiki/Mirrors http://www.gutenberg.org/MIRRORS.ALL
English, French, German, Italian, Polish, Portugeuse, Swedish and Chinese Wikipedia all appear to have some mirrors, but are any of them reliable enough to be used for disaster recovery?
http://en.wikipedia.org/wiki/Wikipedia:MIRROR http://de.wikipedia.org/wiki/Wikipedia:Weiternutzung http://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Sites_miroirs http://it.wikipedia.org/wiki/Wikipedia:Cloni
It would be nice to have an agreement with these mirrors that they will make the most recent dump available if WMF is unable to provide it.
I don't see any mirrors listed on the Spanish page about mirrors.
Are there mirrors of other Wikimedia projects?
The smaller projects are easier to backup, as they are smaller. I am sure that with a little effort and coordination, chapters, universities and similar organisations would be willing to routinely backup a subset of projects, and combined we would have multiple current backups of all projects.
-- John Vandenberg
Only wimps use tape backup: real men just upload their important stuff on ftp, and let the rest of the world mirror it ;) - Torvalds, Linus (1996-07-20).
On Thu, Sep 16, 2010 at 11:00:38AM +1000, John Vandenberg wrote:
A mirror system would be great; that is how project gutenberg, linux and sf.net do disaster recovery. It is simple, cheap and effective.
What about history data and other logs? I'm afraid that mirrors may not be the best way to copy all of them.
On Thu, Sep 16, 2010 at 11:36 AM, Osama Khalid osamak@gnu.org wrote:
On Thu, Sep 16, 2010 at 11:00:38AM +1000, John Vandenberg wrote:
A mirror system would be great; that is how project gutenberg, linux and sf.net do disaster recovery. It is simple, cheap and effective.
What about history data and other logs? I'm afraid that mirrors may not be the best way to copy all of them.
We would need to mirror the database dumps available from
http://dumps.wikimedia.org/backup-index.html
The toolserver (run by the German chapter) has a 'live' copy of the databases, so that would likely be the first recovery option.
http://meta.wikimedia.org/wiki/Toolserver
Is the toolserver database missing any data which would be needed to reconstruct a wikimedia project, other than the images?
The main potential problem with a replicated copy, like the toolserver, is that it can replicate the problem that occurred on the primary database.
-- John Vandenberg
On Thu, Sep 16, 2010 at 11:52:25AM +1000, John Vandenberg wrote:
We would need to mirror the database dumps available from
The main issue is that they are not being updated as frequently as they should be.
The toolserver (run by the German chapter) has a 'live' copy of the databases, so that would likely be the first recovery option.
That's interesting because we also need a 'private backup' for all hidden data including deleted revisions. Storing them at chapters facilities seems to be a good idea.
John Vandenberg, 16/09/2010 03:00:
English, French, German, Italian, Polish, Portugeuse, Swedish and Chinese Wikipedia all appear to have some mirrors, but are any of them reliable enough to be used for disaster recovery?
Obviously not, at least Italian ones.
The smaller projects are easier to backup, as they are smaller. I am sure that with a little effort and coordination, chapters, universities and similar organisations would be willing to routinely backup a subset of projects, and combined we would have multiple current backups of all projects.
I agree. Now we have only this: http://www.balkaninsight.com/en/main/news/21606/ How many TB are needed? I don't know what's the average, but e.g. right now my university should have about 50 TB of free disk space (which is not so much, after all).
Nemo
I suggested a similar idea in another thread in this mailing list. Seriously, I don't know why after 10 years (since Wikipedia creation), we haven't used a similar mirror system like GNU/Linux ISOs.
Some weeks ago, I wrote a script (I can share it with interested people) to download every 7z pages-meta-history file from download.wikimedia.org, and it wasted only about 100 GB. In only 100 GB, we have the whole texts and histories in all languages in all projects. Very cheap. In the future, we can talk about images backup, ect, but I don't really understand why Wiki[mp]edia texts are not replicated to *every* country in the world, to avoid be vulnerable to disasters, human errors, censorship, ect.
2010/9/16 Federico Leva (Nemo) nemowiki@gmail.com
John Vandenberg, 16/09/2010 03:00:
English, French, German, Italian, Polish, Portugeuse, Swedish and Chinese Wikipedia all appear to have some mirrors, but are any of them reliable enough to be used for disaster recovery?
Obviously not, at least Italian ones.
The smaller projects are easier to backup, as they are smaller. I am sure that with a little effort and coordination, chapters, universities and similar organisations would be willing to routinely backup a subset of projects, and combined we would have multiple current backups of all projects.
I agree. Now we have only this: http://www.balkaninsight.com/en/main/news/21606/ How many TB are needed? I don't know what's the average, but e.g. right now my university should have about 50 TB of free disk space (which is not so much, after all).
Nemo
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
On Thu, Sep 16, 2010 at 5:40 PM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
John Vandenberg, 16/09/2010 03:00:
English, French, German, Italian, Polish, Portugeuse, Swedish and Chinese Wikipedia all appear to have some mirrors, but are any of them reliable enough to be used for disaster recovery?
Obviously not, at least Italian ones.
The smaller projects are easier to backup, as they are smaller. I am sure that with a little effort and coordination, chapters, universities and similar organisations would be willing to routinely backup a subset of projects, and combined we would have multiple current backups of all projects.
I agree. Now we have only this: http://www.balkaninsight.com/en/main/news/21606/
Kudos to Milos & Wikimedia Serbia!!
How many TB are needed? I don't know what's the average, but e.g. right now my university should have about 50 TB of free disk space (which is not so much, after all).
The key would be to allow the mirrors to delete their mirror when they need to use their excess storage capability. If they let us know in advance that they are reclaiming the space, another organisation with excess storage capability can take over.
-- John Vandenberg
On Thu, Sep 16, 2010 at 12:58 AM, John Vandenberg jayvdb@gmail.com wrote:
On Thu, Sep 16, 2010 at 5:40 PM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
John Vandenberg, 16/09/2010 03:00:
English, French, German, Italian, Polish, Portugeuse, Swedish and Chinese Wikipedia all appear to have some mirrors, but are any of them reliable enough to be used for disaster recovery?
Obviously not, at least Italian ones.
The smaller projects are easier to backup, as they are smaller. I am sure that with a little effort and coordination, chapters, universities and similar organisations would be willing to routinely backup a subset of projects, and combined we would have multiple current backups of all projects.
I agree. Now we have only this: http://www.balkaninsight.com/en/main/news/21606/
Kudos to Milos & Wikimedia Serbia!!
How many TB are needed? I don't know what's the average, but e.g. right now my university should have about 50 TB of free disk space (which is not so much, after all).
The key would be to allow the mirrors to delete their mirror when they need to use their excess storage capability. If they let us know in advance that they are reclaiming the space, another organisation with excess storage capability can take over.
I appreciate all the enthusiasm in thread, but (speaking for myself as an individual, and IT consultant who does things like business continuity and disaster recovery planning consulting among other infrastructure work) this is a core operational competency role that the Foundation needs to ensure is handled in house as part of the routine IT operations. And, as I understand it now, it is, though I have only had high level discussions with some of the Foundation staff about this and not seen the server configs myself so I can't personally attest to the status.
Database and file backups need to be in (at least) 2 locations, and my understanding is that there are complete redundant copies at the Amsterdam datacenter now, and that the new main datacenter in Virginia will continue this.
If a third location is needed, the current HQ in San Francisco is plenty far enough away from the other 2 locations to provide excellent DR capability. If there's need for a datacenter / fast net access redundant copy in SF or the Bay Area, a rack or few U of a shared rack would be enough for a fileserver, and that's available at multiple excellently connected locations in the Bay Area.
Disaster Recovery is not something the Foundation should attempt to crowdsource. I recommend it be left to professionals whose job it is and who have prior experience in the field. If you haven't watched major services drop, datacenters burn down, software environments melt down, and spent years working to ensure that those don't happen again, you really don't have a good feel for the type and magnitude of the risks and the sorts of tools to employ to try and mitigate them.
If there's interest in an offline discussion on IT disasters and disaster recovery and reliability engineering, I can do that, but it should be offline from Foundation-L...
On Sep 16, 2010, at 4:16 AM, George Herbert george.herbert@gmail.com wrote:
On Thu, Sep 16, 2010 at 12:58 AM, John Vandenberg jayvdb@gmail.com wrote:
On Thu, Sep 16, 2010 at 5:40 PM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
John Vandenberg, 16/09/2010 03:00:
English, French, German, Italian, Polish, Portugeuse, Swedish and Chinese Wikipedia all appear to have some mirrors, but are any of them reliable enough to be used for disaster recovery?
Obviously not, at least Italian ones.
The smaller projects are easier to backup, as they are smaller. I am sure that with a little effort and coordination, chapters, universities and similar organisations would be willing to routinely backup a subset of projects, and combined we would have multiple current backups of all projects.
I agree. Now we have only this: http://www.balkaninsight.com/en/main/news/21606/
Kudos to Milos & Wikimedia Serbia!!
How many TB are needed? I don't know what's the average, but e.g. right now my university should have about 50 TB of free disk space (which is not so much, after all).
The key would be to allow the mirrors to delete their mirror when they need to use their excess storage capability. If they let us know in advance that they are reclaiming the space, another organisation with excess storage capability can take over.
I appreciate all the enthusiasm in thread, but (speaking for myself as an individual, and IT consultant who does things like business continuity and disaster recovery planning consulting among other infrastructure work) this is a core operational competency role that the Foundation needs to ensure is handled in house as part of the routine IT operations. And, as I understand it now, it is, though I have only had high level discussions with some of the Foundation staff about this and not seen the server configs myself so I can't personally attest to the status.
Database and file backups need to be in (at least) 2 locations, and my understanding is that there are complete redundant copies at the Amsterdam datacenter now, and that the new main datacenter in Virginia will continue this.
If a third location is needed, the current HQ in San Francisco is plenty far enough away from the other 2 locations to provide excellent DR capability. If there's need for a datacenter / fast net access redundant copy in SF or the Bay Area, a rack or few U of a shared rack would be enough for a fileserver, and that's available at multiple excellently connected locations in the Bay Area
Having multiple backups (w/ private user, deleted content data tables) within WMF at various data centers is no doubt extremely crucial & depending on third parties would be a terrible mistake.
But also up-to-date distributed copies (sans private data, but w/ full history & images) outside WMF is also very important. Why can't we do both? I highly highly doubt anything bad will happen to WMF but despite best intentions & efforts, you never know (zombies take over? rogue sys admin?). Distributed backups beyond WMF help ensure wikipedia goes on w/o reliance on WMF
Disaster Recovery is not something the Foundation should attempt to crowdsource. I recommend it be left to professionals whose job it is and who have prior experience in the field. If you haven't watched major services drop, datacenters burn down, software environments melt down, and spent years working to ensure that those don't happen again, you really don't have a good feel for the type and magnitude of the risks and the sorts of tools to employ to try and mitigate them.
Surely there are third parties with such experience and interested in this. Internet Archives? Bibliotecha (sp?) Alexandria? Library of Congress? Surely google has or should have copy?, what about as a public dataset on Amazon cloud services (thought there was something?), universities are also good some with super data centers (e.g. San Diego State University), etc.
If there's interest in an offline discussion on IT disasters and disaster recovery and reliability engineering, I can do that, but it should be offline from Foundation-L...
Maybe not foundation-l :) but I am cool with some degree of transparency & open discussion on a list or some communications channel dedicated to the topic.
I'm not involved in creating dumps but couldn't it be possible to offer daily or weekly diffs of enwiki and other wikis, and have utilities to apply diffs to the last full dump? Having regular dumps + regular diffs (weekly, daily, and even minutely) + Swiss army knife utilities for handling diffs and dumps is something that openstreetmap has managed to excel with and makes me very happy :) to know people have up-to-date copies distributed on various places. I feel sad to know this is not the case with wikipedia :(
@aude
-- -george william herbert george.herbert@gmail.com
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
On Thu, Sep 16, 2010 at 11:14 AM, Aude aude.wiki@gmail.com wrote:
Surely there are third parties with such experience and interested in this. [...] Surely google has or should have copy?
It would be interesting to know what Google has. I recently began a new article and was stunned to see that Google had indexed, given a high ranking to, and (IIRC) had a cache of the article within the day.
I'm not technical, so I speak from ignorance, but I imagine they wouldn't have article histories.
The notion that Wikipedia was currently vulnerable to data loss had honestly never occurred to me; I thought that the reference sites that use our content meant that back-ups are ubiquitous. You've all given me the fear.
John Vandenberg wrote:
The key would be to allow the mirrors to delete their mirror when they need to use their excess storage capability. If they let us know in advance that they are reclaiming the space, another organisation with excess storage capability can take over.
Surely I don't need to be the one to point out that another huge issue with mirrors is that they often replicate bad information ("John Doe is a rapist", etc.). The mirrors you all are talking about sound like they'd update fairly regularly. Some of the current (unofficial) mirrors, however, have a horrible tendency to import once and then linger forever.
MZMcBride
On Thu, Sep 16, 2010 at 6:16 PM, George Herbert george.herbert@gmail.com wrote:
Disaster Recovery is not something the Foundation should attempt to crowdsource.
IIRC, it Greg Maxwell who had (some of?) the images that the Foundation lost when a bug was rolled into production.
It is lovely that the Foundation is improving their disaster preparedness, however the community should not depend on the Foundation for this. For all we know, it could be the Foundation which becomes the disaster we never planned for.
I recommend it be left to professionals whose job it is and who have prior experience in the field. If you ...
That you offering to help the community, yea? ;-)
On Thu, Sep 16, 2010 at 6:17 PM, MZMcBride z@mzmcbride.com wrote:
John Vandenberg wrote:
The key would be to allow the mirrors to delete their mirror when they need to use their excess storage capability. If they let us know in advance that they are reclaiming the space, another organisation with excess storage capability can take over.
Surely I don't need to be the one to point out that another huge issue with mirrors is that they often replicate bad information ("John Doe is a rapist", etc.). The mirrors you all are talking about sound like they'd update fairly regularly. Some of the current (unofficial) mirrors, however, have a horrible tendency to import once and then linger forever.
I'm not so interested in these mirrors putting the data into a database and publishing it onto the web. At present I would be more interested in the dumps (.7z) being systematically mirrored, with a commitment to make them available if the shit ever hit the fan.
-- John Vandenberg
I want to paste a paragraph by Richard Stallman from his *The Free Universal Encyclopedia and Learning Resource*[1]. For curious people and for adding more useful ideas to this thread. I want you see this 'movement of backup all!' only a wish of protecting this huge wiki treasure that we are writing since 2001.
*Permit mirror sites.*
When information is available on the web only at one site, its availability is vulnerable. A local problem—a computer crash, an earthquake or flood, a budget cut, a change in policy of the school administration—could cut off access for everyone forever. To guard against loss of the encyclopedia's material, we should make sure that every piece of the encyclopedia is available from many sites on the Internet, and that new copies can be put up if some disappear.
There is no need to set up an organization or a bureaucracy to do this, because Internet users like to set up “mirror sites” which hold duplicate copies of interesting web pages. What we must do in advance is ensure that this is legally permitted.
Therefore, each encyclopedia article and each course should explicitly grant irrevocable permission for anyone to make verbatim copies available on mirror sites. This permission should be one of the basic stated principles of the free encyclopedia.
Some day there may be systematic efforts to ensure that each article and course is replicated in many copies—perhaps at least once on each of the six inhabited continents. This would be a natural extension of the mission of archiving that libraries undertake today. But it would be premature to make formal plans for this now. It is sufficient for now to resolve to make sure people have permission to do this mirroring when they get around to it. Regards, emijrp
[1] http://www.gnu.org/encyclopedia/free-encyclopedia.html
2010/9/16 John Vandenberg jayvdb@gmail.com
On Thu, Sep 16, 2010 at 6:16 PM, George Herbert george.herbert@gmail.com wrote:
Disaster Recovery is not something the Foundation should attempt to crowdsource.
IIRC, it Greg Maxwell who had (some of?) the images that the Foundation lost when a bug was rolled into production.
It is lovely that the Foundation is improving their disaster preparedness, however the community should not depend on the Foundation for this. For all we know, it could be the Foundation which becomes the disaster we never planned for.
I recommend it be left to professionals whose job it is and who have prior experience in the field. If you ...
That you offering to help the community, yea? ;-)
On Thu, Sep 16, 2010 at 6:17 PM, MZMcBride z@mzmcbride.com wrote:
John Vandenberg wrote:
The key would be to allow the mirrors to delete their mirror when they need to use their excess storage capability. If they let us know in advance that they are reclaiming the space, another organisation with excess storage capability can take over.
Surely I don't need to be the one to point out that another huge issue
with
mirrors is that they often replicate bad information ("John Doe is a rapist", etc.). The mirrors you all are talking about sound like they'd update fairly regularly. Some of the current (unofficial) mirrors,
however,
have a horrible tendency to import once and then linger forever.
I'm not so interested in these mirrors putting the data into a database and publishing it onto the web. At present I would be more interested in the dumps (.7z) being systematically mirrored, with a commitment to make them available if the shit ever hit the fan.
-- John Vandenberg
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
MZMcBride wrote:
John Vandenberg wrote:
The key would be to allow the mirrors to delete their mirror when they need to use their excess storage capability. If they let us know in advance that they are reclaiming the space, another organisation with excess storage capability can take over.
Surely I don't need to be the one to point out that another huge issue with mirrors is that they often replicate bad information ("John Doe is a rapist", etc.). The mirrors you all are talking about sound like they'd update fairly regularly. Some of the current (unofficial) mirrors, however, have a horrible tendency to import once and then linger forever.
MZMcBride
If they are not live mirrors which will go down when they can't connect to wikipedia on-the-fly to scrape their data (so they aren't really mirroring anything).
John wrote:
IIRC, it Greg Maxwell who had (some of?) the images that the Foundation lost when a bug was rolled into production.
Yes. He has a partial copy of the images.
George wrote:
If there's interest in an offline discussion on IT disasters and disaster recovery and reliability engineering, I can do that, but it should be offline from Foundation-L...
This thread should move to wikitech-l or xmldatadumps-l
2010/9/17 Platonides Platonides@gmail.com
MZMcBride wrote:
John Vandenberg wrote:
The key would be to allow the mirrors to delete their mirror when they need to use their excess storage capability. If they let us know in advance that they are reclaiming the space, another organisation with excess storage capability can take over.
Surely I don't need to be the one to point out that another huge issue
with
mirrors is that they often replicate bad information ("John Doe is a rapist", etc.). The mirrors you all are talking about sound like they'd update fairly regularly. Some of the current (unofficial) mirrors,
however,
have a horrible tendency to import once and then linger forever.
MZMcBride
If they are not live mirrors which will go down when they can't connect to wikipedia on-the-fly to scrape their data (so they aren't really mirroring anything).
John wrote:
IIRC, it Greg Maxwell who had (some of?) the images that the Foundation lost when a bug was rolled into production.
Yes. He has a partial copy of the images.
George wrote:
If there's interest in an offline discussion on IT disasters and disaster recovery and reliability engineering, I can do that, but it should be offline from Foundation-L...
This thread should move to wikitech-l or xmldatadumps-l
In that mailing lists there are enough threads about text/images/mirrors dumps without great advances.
I think that this mailing list is perfect for this topic. It is not only a tech topic, it is a matter of public awareness.
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
wikimedia-l@lists.wikimedia.org