Brian schrieb:
I think what the toolserver guys are saying is that
they've got the
data (e.g., a replica of the master database) and they are willing to
expand operations to include larger-scale computations, and so yes
they are willing to become more "research oriented". They just need
the extra hardware of course. I think it's difficult to estimate how
much but here are some applications that I would like to make or see
made sooner or later:
* WikiBlame - A Lucene index of the history of all projects that can
instantly find the authors of a pasted snippet. I'm not clear on the
memory requirements of hosting an app like this after the index is
created, but the index will be terabyte-size at 35% of the text dump.
Note that WikiTrust can do this too, and will probably go into testing soon. For
now, the database for WikiTrust weill be off-site, but if it goes live on
wikipedia, the hardwaree would be run at the main wmf cluster, and not on the
toolserver.
* WikiBlame for images - an image similarity algorithm
over all images
in all projects that can find all places a given image is being used.
I believe there is a one-time major cpu cost when first analyzing the
images and then a much lesser realtime comparison cost. Again, the
memory requirements of hosting such an app are unclear.
That would be very nice to have...
* A vandalism classifier bot that uses the entire
history of a wiki in
order to predict whether the current edit is vandalism. Basically, a
major extension of existing published work on automatically detecting
vandalism, which only used several hundred edits. This would require
major cpu resources for training but very little cost for real-time
classification.
Pretty big for a toolserver poroject. But an excellent research topic!
* Dumps, including extended dump formats such as a
natural language
parse of the full text of the recent version of a wiki made readily
available for researchers.
Finally, there are many worthwhile projects that have been presented
at past Wikimanias or published in the literature that deserve to be
kept up to date as the encyclopedia continues to grow. Permanent
hosting for such projects would be a worthwhile goal, as would
reaching out to these researchers. If the foundation can afford such
an endeavor, the hardware cost is actually not that great. Perhaps
datacenter fees are.
Please don't foprget that the toolserver is NOT run by the wikimedia foundation.
It's run by wikimedia germany, which has maybe a tenth of the foundation's
budget. If the foundation is interested in supporting us further, that's great,
we just need to keep responsibilities clear: is the foundation runnign a
project, or is the foundation heling us (wikimedia germany) to run a project?...
-- daniel