Let me sum this up, The basic optimization is this : You don't need to transfer that new article in every revision to all users at all times. The central server could just say : this is the last revision that has been released by the editors responsible for it, there are 100 edits in process and you can get involved by going to this page here (hosted on a server someplace else). There is no need to transfer those 100 edits to all the users on the web and they are not interesting to everyone.
On Sun, Dec 13, 2009 at 12:10 PM, Domas Mituzas midom.lists@gmail.com wrote:
- The other questions are, does it make sense to have such a
centralized client server architecture? We have been talking about using a distributed vcs for mediawiki.
Lunatics without any idea of stuff being done inside the engine talk about distribution. Let them!
I hope you are serious here, Lets take a look at what the engine does, it allows editing of text. It renders the text. It serves the text. The wiki from ward cunningham is a perl script of the most basic form. There is not much magic involved. Of course you need search tools, version histories and such. There are places for optimizing all of those processes.
It is not lunacy, it is a fact that such work can be done, and is done without a central server in many places.
Just look at for example how people edit code in an open source software project using git. It is distributed, and it works.
There are already wikis based on git available. There are other peer to peer networks such as TOR or freenet that would be possible to use.
If you were to split up the editing of wikipedia articles into a network of git servers across the globe and the rendering and distribution of the resulting data would be the job of the WMF.
Now the issue of resolving conflicts is pretty simple in the issue of git, everyone has a copy and can do what they want with it. If you like the version from someone else, you pull it.
In terms of wikipedia as having only one viewpoint, the NPOV that is reflected by the current revision at any one point in time, that version would be one pushed from its editors repositories. It is imaginable that you would have one senior editor for each topic who has their own repository of of pages who pull in versions from many people.
- Now, back to the optimization. Lets say you were able to optimize
the program. We would identify the major cpu burners and optimize them out. That does not solve the problem. Because I would think that the php program is only a small part of the entire issue. The fact that the data is flowing in a certain wasteful way is the cause of the waste, not the program itself. Even if it would be much more efficient and moving around data that is not needed, the data is not needed.
We can have new kind of Wikipedia. The one where we serve blank pages, and people imagine content in it. We\ve done that with moderate success quite often.
Please lets be serious here! I am talking about the fact that not all people need all the centralised services at all times.
So if you have 10 people collaborating on a topic, only the results of that work will be checked into the central server. the decentralized communication would be between fewer parties and reduce the resources used.
Except that you still need tracker to handle all that, and resolve conflicts, as still, there're > no good methods of resolving conflicts with small number of untrusted entities.
A tracker to manage what server is used for what group of editors can be pretty efficient. Essentially it is a form of DNS. A tracker need only show you the current repositories that are registered for a certain topic.
Resolving conflicts is important, but you only need so many people for that.
The entire community does not get involved in all the conflicts. There are only a certain number of people that are deeply involved in any one section of the wikipedia at any given time.
Imagine that you had, lets say 1000 conference rooms available for discussion and working together spread around the world and the results of those rooms would be fed back into the Wikipedia. These rooms or servers would be for processing the edits and conflicts any given set of pages.
My idea is that you don't need to have a huge server to resolve conflicts. many pages don't have many conflicts, there are certain areas which need constant arbitration of course. Even if you split up the groups into different viewpoints where the arbitration team only deals with the output of two teams (pro and contra).
Even if you look at the number of editors in a highly contested page, they are not unlimited.
From the retrospective you would be able to identify what groups of
editors are collaborating (enhancing each other) and conflicting (overwriting each other). If you split them up into different rooms when they should be collaborating and reduce the conflicts, then you will win alot.
Even in Germany, most edits do not show up immediately. They have some person to check the commits. Now that would also mean that those edits before they are commited do not need to go a single data center.
People interested in getting all the versions available would need to be able to find them. But for stuff like that people would be prepared to wait a bit longer to collect the data from many servers if needed. You should be able to just pull the versions you want in the depth that you want. That selection of versions and depth would be a large optimization in its self.
So there are different ways to reduce the load on a single server and create pockets of processing for different topics. The only really important thing is that people who are working on the same topic are working on the same server or have a path of communication.
To sum it up, if conflicts are the major problem in the wikipedia, the major cost in terms of review and coordination, then you should rethink the workflow to push the processing time back to the editor causing the conflict.
Right now the revisions are stored in whole, but not in part. If you only add in new information then the you need less storage. That would be one big optimization for the wikipedia to transfer only the changes across the net and not full revisions.
For course even a new section could be a conflict if the new text is garbage or in need of editing. If you want to replace a single word or a sentence then lets say would create a conflict branch in one of external conference rooms that would be the host of the page until the work is finished there. The main server would just have a pointer to the workgroup and the load would be pushed away. That also means that any local server would be able to process the data and host the branch until it is pushed back to the main server.
OK, well I think this is enough for now. I do ask you to remain serious, and we can have a serious discussion on the topic of optimisation.
thanks, mike