Dude, I need that strong stuff you're having.
Let me sum this up, The basic optimization is this :
You don't need to transfer that new article in every revision to all
users at all times.
There's not much difference between transferring every revision and just some
'good' revisions.
The central server could just say : this is the last
revision that
has been released by the editors responsible for it, there are 100
edits in process and you can get involved by going to this page here
(hosted on a server someplace else).
Editing is miniscule part of our workload.
There is no need to transfer
those 100 edits to all the users on the web and they are not
interesting to everyone.
Well, we may not transfer them, in case of flagged revisions, we can transfer in case of
pure wiki. Point is, someone has to transfer.
Lets take a look at what the engine does, it allows
editing of text.
That includes conflict resolution, cross-indexing, history tracking, abuse filtering, full
text indexing, etc.
It renders the text.
It means building the output out of many individual assets (templates, anyone?), embed
media, transform based on user options, etc.
It serves the text.
And not only text - it serves complex aggregate views like 'last related changes',
'watchlist', 'contributions by new users', etc.
The wiki from ward cunningham
is a perl script of the most basic form.
That is probably one of reasons why we're not using wiki from Ward Cunningham anymore,
and have something else, called Mediawiki.
There is not much magic
involved.
Not much use at multi-million article wiki with hundreds of millions of revisions.
Of course you need search tools, version histories and
such.
There are places for optimizing all of those processes.
And we've done that with MediaWiki ;-)
It is not lunacy, it is a fact that such work can be
done, and is done
without a central server in many places.
Name me a single website with distributed-over-internet backend.
Just look at for example how people edit code in an
open source
software project using git. It is distributed, and it works.
Git is limited and expensive for way too many of our operations. Also, you have to have
whole copy of GIT, it doesn't have on-demand-remote-pulls nor any caching layer
attached to that.
I appreciate your will of cloning Wikipedia.
It works if you want expensive accesses, of course. We're talking about serving a
website here, not a case which is very nicely depicted at:
http://xkcd.com/303/
There are already wikis based on git available.
Anyone tried putting Wikipedia content on them, and try simulating our workload? :)
I understand that Git's semantics are usable for Wikipedia's basic revision
storage, but it's data would still have to be replicated to other types of storages,
that would allow various cross-indexing and cross-reporting.
How well does Git handle parallelism internally? How can it be parallelized over multiple
machines? etc ;-) It lacks engineering. Basic stuff is nice, but it isn't what we
need.
There are other peer to peer networks such as TOR or
freenet that
would be possible to use.
How? These are just transports.
If you were to split up the editing of wikipedia
articles into a
network of git servers across the globe and the rendering and
distribution of the resulting data would be the job of the WMF.
And how would that save any money? By adding much more complexity to most of processes,
and by having major cost item untouched?
Now the issue of resolving conflicts is pretty simple
in the issue of
git, everyone has a copy and can do what they want with it. If you
like the version from someone else, you pull it.
Who's revision does Wikimedia merge?
In terms of wikipedia as having only one viewpoint,
the NPOV that is
reflected by the current revision at any one point in time, that
version would be one pushed from its editors repositories. It is
imaginable that you would have one senior editor for each topic who
has their own repository of of pages who pull in versions from many
people.
Go to Citizendium, k, thx.
Please lets be serious here!
I am talking about the fact that not all people need all the
centralised services at all times.
You have absolute misunderstanding on what our technology platform is doing. You're
wasting your time, you're wasting my time, you're wasting time of everyone who has
to read your or my emails.
A tracker to manage what server is used for what group
of editors can
be pretty efficient. Essentially it is a form of DNS. A tracker need
only show you the current repositories that are registered for a
certain topic.
Seriously, need that stuff you're on. Have you ever been involved in building anything
remotely similar?
The entire community does not get involved in all the
conflicts. There
are only a certain number of people that are deeply involved in any
one section of the wikipedia at any given time.
Have you ever edited Wikipedia? :) You understand editorial process there?
Imagine that you had, lets say 1000 conference rooms
available for
discussion and working together spread around the world and the
results of those rooms would be fed back into the Wikipedia. These
rooms or servers would be for processing the edits and conflicts any
given set of pages.
How is that more efficient?
My idea is that you don't need to have a huge
server to resolve
conflicts. many pages don't have many conflicts, there are certain
areas which need constant arbitration of course. Even if you split up
the groups into different viewpoints where the arbitration team only
deals with the output of two teams (pro and contra).
NEED YOUR STUFFFFFF.
From the retrospective you would be able to identify
what groups of
editors are collaborating (enhancing each other) and conflicting
(overwriting each other). If you split them up into different rooms
when they should be collaborating and reduce the conflicts, then you
will win alot.
You'll get Nobel prize of literature if you continue so!
Infinite monkeys, when managed properly, ... ;-)
Even in Germany, most edits do not show up
immediately. They have some
person to check the commits. Now that would also mean that those edits
before they are commited do not need to go a single data center.
Again, you don't win efficiency. You win 'something', like, bragging rights in
your local p2p-wanking-circle.
This part of editorial process is miniscule in terms of workload.
You should be able to just pull the versions you want
in the depth
that you want. That selection of versions and depth would be a large
optimization in its self.
Except that it is not the cost for us.
So there are different ways to reduce the load on a
single server and
create pockets of processing for different topics. The only really
important thing is that people who are working on the same topic are
working on the same server or have a path of communication.
YOU SHOULD MENTION JABBER!!!111oneoneeleven
To sum it up, if conflicts are the major problem in
the wikipedia, the
major cost in terms of review and coordination, then you should
rethink the workflow to push the processing time back to the editor
causing the conflict.
Semi-atomic resolution of conflicts is what allows fast collaboration to happen.
You fail to understand that.
Right now the revisions are stored in whole, but not
in part. If you
only add in new information then the you need less storage. That would
be one big optimization for the wikipedia to transfer only the changes
across the net and not full revisions.
??????
OK, well I think this is enough for now. I do ask you
to remain
serious, and we can have a serious discussion on the topic of
optimisation.
I am serious. You fail at everything.
You fail to understand online operation implications (privacy, security, etc)
You fail to understand our content.
You fail to understand our costs
You fail to understand our archival and cross-indexing needs
You fail to understand our editorial process efficiency
You fail to understand that distribution increases overall costs.
You fail to understand pretty much everything.
I admire your enthusiasm of 'scaling basic wiki'. We're not running basic
wiki, we're way beyond that. I have no idea how I can have serious discussion with
someone who is so out of reality.
You suggest high complexity engineering project, that would bring nearly no wins over
anything. At this point you should erase your email client, that would much more
efficient.
I deliberately keep this topic on foundation-l, because I'm sure it is not worth the
time of people on wikitech-l@ ;-)
Domas