Re: [Foundation-l] Wikimedia and Environment

13 Dec 2009

      Let me sum this up, The basic optimization is this :
You don't need to transfer that new article in every revision to all
users at all times.
The central server could just say  : this is the last revision that
has been released by the editors responsible for it, there are 100
edits in process and you can get involved by going to this page here
(hosted on a server someplace else). There is no need to transfer
those 100 edits to all the users on the web and they are not
interesting to everyone.
On Sun, Dec 13, 2009 at 12:10 PM, Domas Mituzas midom.lists@gmail.com wrote:
...
...

The other questions are, does it make sense to have such a

centralized client server architecture? We have been talking about
using a distributed vcs for mediawiki.
Lunatics without any idea of stuff being done inside the engine talk about distribution. Let them!
I hope you are serious here,
Lets take a look at what the engine does, it allows editing of text.
It renders the text. It serves the text. The wiki from ward cunningham
is a perl script of the most basic form. There is not much magic
involved. Of course you need search tools, version histories and such.
There are places for optimizing all of those processes.
It is not lunacy, it is a fact that such work can be done, and is done
without a central server in many places.
Just look at for example how people edit code in an open source
software project using git. It is distributed, and it works.
There are already wikis based on git available.
There are other peer to peer networks such as TOR or freenet that
would be possible to use.
If you were to split up the editing of wikipedia articles into a
network of git servers across the globe and the rendering and
distribution of the resulting data would be the job of the WMF.
Now the issue of resolving conflicts is pretty simple in the issue of
git, everyone has a copy and can do what they want with it. If you
like the version from someone else, you pull it.
In terms of wikipedia as having only one viewpoint, the NPOV that is
reflected by the current revision at any one point in time, that
version would be one pushed from its editors repositories. It is
imaginable that you would have one senior editor for each topic who
has their own repository of of pages who pull in versions from many
people.
...
...

Now, back to the optimization. Lets say you were able to optimize

the program. We would identify the major cpu burners and optimize them
out. That does not solve the problem. Because I would think that the
php program is only a small part of the entire issue. The fact that
the data is flowing in a certain wasteful way is the cause of the
waste, not the program itself. Even if it would be much more efficient
and moving around data that is not needed, the data is not needed.
We can have new kind of Wikipedia. The one where we serve blank pages, and people imagine content in it. We\ve done that with moderate success quite often.
Please lets be serious here!
I am talking about the fact that not all people need all the
centralised services at all times.
...
...
So if you have 10 people collaborating on a topic, only the results of
that work will be checked into the central server. the decentralized
communication would be between fewer parties and reduce the resources
used.
Except that you still need tracker to handle all that, and resolve conflicts, as still, there're > no good methods of resolving conflicts with small number of untrusted entities.
A tracker to manage what server is used for what group of editors can
be pretty efficient. Essentially it is a form of DNS. A tracker need
only show you the current repositories that are registered for a
certain topic.
Resolving conflicts is important, but you only need so many people for that.
The entire community does not get involved in all the conflicts. There
are only a certain number of people that are deeply involved in any
one section of the wikipedia at any given time.
Imagine that you had, lets say 1000 conference rooms available for
discussion and working together spread around the world and the
results of those rooms would be fed back into the Wikipedia. These
rooms or servers would be for processing the edits and conflicts any
given set of pages.
My idea is that you don't need to have a huge server to resolve
conflicts. many pages don't have many conflicts, there are certain
areas which need constant arbitration of course. Even if you split up
the groups into different viewpoints where the arbitration team only
deals with the output of two teams (pro and contra).
Even if you look at the number of editors in a highly contested page,
they are not unlimited.
...
From the retrospective you would be able to identify what groups of
editors are collaborating (enhancing each other) and conflicting
(overwriting each other). If you split them up into different rooms
when they should be collaborating and reduce the conflicts, then you
will win alot.
Even in Germany, most edits do not show up immediately. They have some
person to check the commits. Now that would also mean that those edits
before they are commited do not need to go a single data center.
People interested in getting all the versions available would need to
be able to find them. But for stuff like that people would be prepared
to wait a bit longer to collect the data from many servers if needed.
You should be able to just pull the versions you want in the depth
that you want. That selection of versions and depth would be a large
optimization in its self.
So there are different ways to reduce the load on a single server and
create pockets of processing for different topics. The only really
important thing is that people who are working on the same topic are
working on the same server or have a path of communication.
To sum it up, if conflicts are the major problem in the wikipedia, the
major cost in terms of review and coordination, then you should
rethink the workflow to push the processing time back to the editor
causing the conflict.
Right now the revisions are stored in whole, but not in part. If you
only add in new information then the you need less storage. That would
be one big optimization for the wikipedia to transfer only the changes
across the net and not full revisions.
For course even a new section could be a conflict if the new text is
garbage or in need of editing. If you want to replace a single word or
a sentence then lets say would create a conflict branch in one of
external conference rooms that would be the host of the page until the
work is finished there. The main server would just have a pointer to
the workgroup and the load would be pushed away. That also means that
any local server would be able to process the data and host the branch
until it is pushed back to the main server.
OK, well I think this is enough for now. I do ask you to remain
serious, and we can have a serious discussion on the topic of
optimisation.
thanks,
mike

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Foundation-l] Wikimedia and Environment