This is a very good question. In general, we would be delighted to find a small wiki (i.e., smaller than the En / De / Fr ones) that wants to try this out, and we think that this is realistic. We think that it would be essential to experiment first on smaller wikis to gain more experience with load, and how to distribute it on the hardware infrastructure, before jumping to the largest wikis. This is kind of obvious of course. Note that in any case you can throttle down the load of the extension without affecting the main wiki, as it is all implemented in asynchronous fashion (the main wiki never needs to wait for the extension when serving requests).
Let me now give a somewhat detailed answer on the load.
- When there is an edit, there is an asynchronous job started to analyze the edit. The key word is asynchronous: it does not slow down the edit http processing on the main server. Right now, the analysis job is running on the same machine that runs mediawiki; this can change if desired. The analysis takes a fraction of a second, but requires reading 5-10 revisions from the database, and some other minor db access as well. So clearly edits become more expensive, but even on the English Wikipedia (5 edits / second at most?) a single CPU would suffice. As long as edits are a small percentage of the reads (which is true of most wikis), we do not think that the analysis of edits is a significant load.
- When you ask to look at trust information, it is very quick: we just read some alternative markup, rather than the standard one.
- For each revision that is analyzed for trust, mediawiki stores the original, unchanged revision, and we store in an additional database table the same revision, annotated for trust and text origin. If you want to have trust information for all revisions, this causes the db size to exand by a factor of roughly 2.5: not an issue for small wikis, but an issue for very large ones. In practice, very few people are interested in the trust information for very old article versions, so we could keep the trust information only for the most recent 50 or so revisions (in fact, there is a better way to determine the threshold, but let's simplify). This would reduce storage, and make it proportional to the number of articles rather than the number of revisions. It would be quite easy for us to add a variable that enables such pruning; let me add it to our todo list.
- The hard job is when you have a big existing wiki, and you need to analyze all its past, to get the extension up to date. We are trying to decide whether to build special tools that facilitate this "catch-up" starting from xml dumps; if the WMF expressed clear interest in our extension, we would consider it. The current implementation can "catch up with the past" at something like 10 revisions / second; to limit db usage, one can throttle this down, e.g. to 4 revisions / second. This "catch up" analysis can be run in the background, and can use a spare cpu connected to the db, or whatever. I believe the main bottleneck would be the db load, rather than the cpu load (the cpu load can occur on a separate cpu, the db is shared with the wiki servers). But if one accepts that the catch-up takes a bit of time, one can just throttle down the analysis; after all, this needs to be done only once for each wiki.
Suggestions and comments are welcome.
Luca
On Tue, Aug 26, 2008 at 5:33 PM, mike.lifeguard mike.lifeguard@gmail.comwrote:
Do we have any idea what server load is like? Is this something WMF could potentially deploy at this point?
Mike
*From:* Luca de Alfaro [mailto:luca@dealfaro.org] *Sent:* August 23, 2008 10:56 PM *To:* wikiquality-l@lists.wikimedia.org *Subject:* [Wikiquality-l] WikiTrust v2 released: reputation and trust foryour wiki in real-time!
As some of you might remember, we have been working on author reputation and text trust systems for wikis; some of you may have seen our demo at WikiMania 2007, or the on-line demo http://wiki-trust.cse.ucsc.edu/
Since then, we have been busy at work to build a system that can be deployed on any wiki, and display the text trust information. And we finally made it!
We are pleased to announce the release of WikiTrust version 2!
With it, you can compute author reputation and text trust of your wikis in real-time, as edits to the wiki are made, and you can display text trust via a new "trust" tab. The tool can be installed as a MediaWiki extension, and is released open-source, under the BSD license; the project page is http://trust.cse.ucsc.edu/WikiTrust
WikiTrust can be deployed both on new, and on existing, wikis. WikiTrust stores author reputation and text trust in additional database tables. If deployed on an existing wiki, WikiTrust first computes the reputation and trust information for the current wiki content, and then processes new edits as they are made. The computation is scalable, parallel, and fault-tolerant, in the sense that WikiTrust adaptively fills in missing trust or reputation information.
On my MacBook, running under Ubuntu in vmware, WikiTrust can analize some 10-20 revisions / second of a wiki; so with a little patience, unless your wiki is truly huge, you can just deploy it and wait a bit. Go to http://trust.cse.ucsc.edu/WikiTrust for more information and for the code!
Feedback, comments, etc are much appreciated!
Luca de Alfaro (with Ian Pye and Bo Adler)
Wikiquality-l mailing list Wikiquality-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikiquality-l