This is a very good question. In general, we would be delighted to find a
small wiki (i.e., smaller than the En / De / Fr ones) that wants to try this
out, and we think that this is realistic. We think that it would be
essential to experiment first on smaller wikis to gain more experience with
load, and how to distribute it on the hardware infrastructure, before
jumping to the largest wikis. This is kind of obvious of course.
Note that in any case you can throttle down the load of the extension
without affecting the main wiki, as it is all implemented in asynchronous
fashion (the main wiki never needs to wait for the extension when serving
requests).
Let me now give a somewhat detailed answer on the load.
- When there is an edit, there is an asynchronous job started to analyze the
edit. The key word is asynchronous: it does not slow down the edit http
processing on the main server. Right now, the analysis job is running on
the same machine that runs mediawiki; this can change if desired. The
analysis takes a fraction of a second, but requires reading 5-10 revisions
from the database, and some other minor db access as well. So clearly edits
become more expensive, but even on the English Wikipedia (5 edits / second
at most?) a single CPU would suffice. As long as edits are a small
percentage of the reads (which is true of most wikis), we do not think that
the analysis of edits is a significant load.
- When you ask to look at trust information, it is very quick: we just read
some alternative markup, rather than the standard one.
- For each revision that is analyzed for trust, mediawiki stores the
original, unchanged revision, and we store in an additional database table
the same revision, annotated for trust and text origin. If you want to have
trust information for all revisions, this causes the db size to exand by a
factor of roughly 2.5: not an issue for small wikis, but an issue for very
large ones. In practice, very few people are interested in the trust
information for very old article versions, so we could keep the trust
information only for the most recent 50 or so revisions (in fact, there is a
better way to determine the threshold, but let's simplify). This would
reduce storage, and make it proportional to the number of articles rather
than the number of revisions. It would be quite easy for us to add a
variable that enables such pruning; let me add it to our todo list.
- The hard job is when you have a big existing wiki, and you need to analyze
all its past, to get the extension up to date. We are trying to decide
whether to build special tools that facilitate this "catch-up" starting from
xml dumps; if the WMF expressed clear interest in our extension, we would
consider it. The current implementation can "catch up with the past" at
something like 10 revisions / second; to limit db usage, one can throttle
this down, e.g. to 4 revisions / second. This "catch up" analysis can be
run in the background, and can use a spare cpu connected to the db, or
whatever. I believe the main bottleneck would be the db load, rather than
the cpu load (the cpu load can occur on a separate cpu, the db is shared
with the wiki servers). But if one accepts that the catch-up takes a bit of
time, one can just throttle down the analysis; after all, this needs to be
done only once for each wiki.
Suggestions and comments are welcome.
Luca
On Tue, Aug 26, 2008 at 5:33 PM, mike.lifeguard <mike.lifeguard(a)gmail.com>wrote;wrote:
Do we have any idea what server load is like? Is this
something WMF could
potentially deploy at this point?
Mike
------------------------------
*From:* Luca de Alfaro [mailto:luca@dealfaro.org]
*Sent:* August 23, 2008 10:56 PM
*To:* wikiquality-l(a)lists.wikimedia.org
*Subject:* [Wikiquality-l] WikiTrust v2 released: reputation and trust
foryour wiki in real-time!
As some of you might remember, we have been working on author
reputation and text trust systems for wikis; some of you may have seen
our demo at WikiMania 2007, or the on-line demo
http://wiki-trust.cse.ucsc.edu/
Since then, we have been busy at work to build a system that can be
deployed on any wiki, and display the text trust information.
And we finally made it!
We are pleased to announce the release of WikiTrust version 2!
With it, you can compute author reputation and text trust of your
wikis in real-time, as edits to the wiki are made, and you can display
text trust via a new "trust" tab.
The tool can be installed as a MediaWiki extension, and is released
open-source, under the BSD license; the project page is
http://trust.cse.ucsc.edu/WikiTrust
WikiTrust can be deployed both on new, and on existing, wikis.
WikiTrust stores author reputation and text trust in additional
database tables. If deployed on an existing wiki, WikiTrust first
computes the reputation and trust information for the current wiki
content, and then processes new edits as they are made. The
computation is scalable, parallel, and fault-tolerant, in the sense
that WikiTrust adaptively fills in missing trust or reputation
information.
On my MacBook, running under Ubuntu in vmware, WikiTrust can analize
some 10-20 revisions / second of a wiki; so with a little patience,
unless your wiki is truly huge, you can just deploy it and wait a
bit.
Go to
http://trust.cse.ucsc.edu/WikiTrust for more information and for
the code!
Feedback, comments, etc are much appreciated!
Luca de Alfaro
(with Ian Pye and Bo Adler)
_______________________________________________
Wikiquality-l mailing list
Wikiquality-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikiquality-l