Brian schrieb:
I think what the toolserver guys are saying is that they've got the data (e.g., a replica of the master database) and they are willing to expand operations to include larger-scale computations, and so yes they are willing to become more "research oriented". They just need the extra hardware of course. I think it's difficult to estimate how much but here are some applications that I would like to make or see made sooner or later:
- WikiBlame - A Lucene index of the history of all projects that can
instantly find the authors of a pasted snippet. I'm not clear on the memory requirements of hosting an app like this after the index is created, but the index will be terabyte-size at 35% of the text dump.
Note that WikiTrust can do this too, and will probably go into testing soon. For now, the database for WikiTrust weill be off-site, but if it goes live on wikipedia, the hardwaree would be run at the main wmf cluster, and not on the toolserver.
- WikiBlame for images - an image similarity algorithm over all images
in all projects that can find all places a given image is being used. I believe there is a one-time major cpu cost when first analyzing the images and then a much lesser realtime comparison cost. Again, the memory requirements of hosting such an app are unclear.
That would be very nice to have...
- A vandalism classifier bot that uses the entire history of a wiki in
order to predict whether the current edit is vandalism. Basically, a major extension of existing published work on automatically detecting vandalism, which only used several hundred edits. This would require major cpu resources for training but very little cost for real-time classification.
Pretty big for a toolserver poroject. But an excellent research topic!
- Dumps, including extended dump formats such as a natural language
parse of the full text of the recent version of a wiki made readily available for researchers.
Finally, there are many worthwhile projects that have been presented at past Wikimanias or published in the literature that deserve to be kept up to date as the encyclopedia continues to grow. Permanent hosting for such projects would be a worthwhile goal, as would reaching out to these researchers. If the foundation can afford such an endeavor, the hardware cost is actually not that great. Perhaps datacenter fees are.
Please don't foprget that the toolserver is NOT run by the wikimedia foundation. It's run by wikimedia germany, which has maybe a tenth of the foundation's budget. If the foundation is interested in supporting us further, that's great, we just need to keep responsibilities clear: is the foundation runnign a project, or is the foundation heling us (wikimedia germany) to run a project?...
-- daniel