Dude, I need that strong stuff you're having.
Let me sum this up, The basic optimization is this : You don't need to transfer that new article in every revision to all users at all times.
There's not much difference between transferring every revision and just some 'good' revisions.
The central server could just say : this is the last revision that has been released by the editors responsible for it, there are 100 edits in process and you can get involved by going to this page here (hosted on a server someplace else).
Editing is miniscule part of our workload.
There is no need to transfer those 100 edits to all the users on the web and they are not interesting to everyone.
Well, we may not transfer them, in case of flagged revisions, we can transfer in case of pure wiki. Point is, someone has to transfer.
Lets take a look at what the engine does, it allows editing of text.
That includes conflict resolution, cross-indexing, history tracking, abuse filtering, full text indexing, etc.
It renders the text.
It means building the output out of many individual assets (templates, anyone?), embed media, transform based on user options, etc.
It serves the text.
And not only text - it serves complex aggregate views like 'last related changes', 'watchlist', 'contributions by new users', etc.
The wiki from ward cunningham is a perl script of the most basic form.
That is probably one of reasons why we're not using wiki from Ward Cunningham anymore, and have something else, called Mediawiki.
There is not much magic involved.
Not much use at multi-million article wiki with hundreds of millions of revisions.
Of course you need search tools, version histories and such. There are places for optimizing all of those processes.
And we've done that with MediaWiki ;-)
It is not lunacy, it is a fact that such work can be done, and is done without a central server in many places.
Name me a single website with distributed-over-internet backend.
Just look at for example how people edit code in an open source software project using git. It is distributed, and it works.
Git is limited and expensive for way too many of our operations. Also, you have to have whole copy of GIT, it doesn't have on-demand-remote-pulls nor any caching layer attached to that. I appreciate your will of cloning Wikipedia.
It works if you want expensive accesses, of course. We're talking about serving a website here, not a case which is very nicely depicted at: http://xkcd.com/303/
There are already wikis based on git available.
Anyone tried putting Wikipedia content on them, and try simulating our workload? :) I understand that Git's semantics are usable for Wikipedia's basic revision storage, but it's data would still have to be replicated to other types of storages, that would allow various cross-indexing and cross-reporting.
How well does Git handle parallelism internally? How can it be parallelized over multiple machines? etc ;-) It lacks engineering. Basic stuff is nice, but it isn't what we need.
There are other peer to peer networks such as TOR or freenet that would be possible to use.
How? These are just transports.
If you were to split up the editing of wikipedia articles into a network of git servers across the globe and the rendering and distribution of the resulting data would be the job of the WMF.
And how would that save any money? By adding much more complexity to most of processes, and by having major cost item untouched?
Now the issue of resolving conflicts is pretty simple in the issue of git, everyone has a copy and can do what they want with it. If you like the version from someone else, you pull it.
Who's revision does Wikimedia merge?
In terms of wikipedia as having only one viewpoint, the NPOV that is reflected by the current revision at any one point in time, that version would be one pushed from its editors repositories. It is imaginable that you would have one senior editor for each topic who has their own repository of of pages who pull in versions from many people.
Go to Citizendium, k, thx.
Please lets be serious here! I am talking about the fact that not all people need all the centralised services at all times.
You have absolute misunderstanding on what our technology platform is doing. You're wasting your time, you're wasting my time, you're wasting time of everyone who has to read your or my emails.
A tracker to manage what server is used for what group of editors can be pretty efficient. Essentially it is a form of DNS. A tracker need only show you the current repositories that are registered for a certain topic.
Seriously, need that stuff you're on. Have you ever been involved in building anything remotely similar?
The entire community does not get involved in all the conflicts. There are only a certain number of people that are deeply involved in any one section of the wikipedia at any given time.
Have you ever edited Wikipedia? :) You understand editorial process there?
Imagine that you had, lets say 1000 conference rooms available for discussion and working together spread around the world and the results of those rooms would be fed back into the Wikipedia. These rooms or servers would be for processing the edits and conflicts any given set of pages.
How is that more efficient?
My idea is that you don't need to have a huge server to resolve conflicts. many pages don't have many conflicts, there are certain areas which need constant arbitration of course. Even if you split up the groups into different viewpoints where the arbitration team only deals with the output of two teams (pro and contra).
NEED YOUR STUFFFFFF.
From the retrospective you would be able to identify what groups of editors are collaborating (enhancing each other) and conflicting (overwriting each other). If you split them up into different rooms when they should be collaborating and reduce the conflicts, then you will win alot.
You'll get Nobel prize of literature if you continue so! Infinite monkeys, when managed properly, ... ;-)
Even in Germany, most edits do not show up immediately. They have some person to check the commits. Now that would also mean that those edits before they are commited do not need to go a single data center.
Again, you don't win efficiency. You win 'something', like, bragging rights in your local p2p-wanking-circle. This part of editorial process is miniscule in terms of workload.
You should be able to just pull the versions you want in the depth that you want. That selection of versions and depth would be a large optimization in its self.
Except that it is not the cost for us.
So there are different ways to reduce the load on a single server and create pockets of processing for different topics. The only really important thing is that people who are working on the same topic are working on the same server or have a path of communication.
YOU SHOULD MENTION JABBER!!!111oneoneeleven
To sum it up, if conflicts are the major problem in the wikipedia, the major cost in terms of review and coordination, then you should rethink the workflow to push the processing time back to the editor causing the conflict.
Semi-atomic resolution of conflicts is what allows fast collaboration to happen. You fail to understand that.
Right now the revisions are stored in whole, but not in part. If you only add in new information then the you need less storage. That would be one big optimization for the wikipedia to transfer only the changes across the net and not full revisions.
??????
OK, well I think this is enough for now. I do ask you to remain serious, and we can have a serious discussion on the topic of optimisation.
I am serious. You fail at everything.
You fail to understand online operation implications (privacy, security, etc) You fail to understand our content. You fail to understand our costs You fail to understand our archival and cross-indexing needs You fail to understand our editorial process efficiency You fail to understand that distribution increases overall costs.
You fail to understand pretty much everything.
I admire your enthusiasm of 'scaling basic wiki'. We're not running basic wiki, we're way beyond that. I have no idea how I can have serious discussion with someone who is so out of reality. You suggest high complexity engineering project, that would bring nearly no wins over anything. At this point you should erase your email client, that would much more efficient.
I deliberately keep this topic on foundation-l, because I'm sure it is not worth the time of people on wikitech-l@ ;-)
Domas