I think what the toolserver guys are saying is that they've got the data (e.g., a replica of the master database) and they are willing to expand operations to include larger-scale computations, and so yes they are willing to become more "research oriented". They just need the extra hardware of course. I think it's difficult to estimate how much but here are some applications that I would like to make or see made sooner or later:
* WikiBlame - A Lucene index of the history of all projects that can instantly find the authors of a pasted snippet. I'm not clear on the memory requirements of hosting an app like this after the index is created, but the index will be terabyte-size at 35% of the text dump.
* WikiBlame for images - an image similarity algorithm over all images in all projects that can find all places a given image is being used. I believe there is a one-time major cpu cost when first analyzing the images and then a much lesser realtime comparison cost. Again, the memory requirements of hosting such an app are unclear.
* A vandalism classifier bot that uses the entire history of a wiki in order to predict whether the current edit is vandalism. Basically, a major extension of existing published work on automatically detecting vandalism, which only used several hundred edits. This would require major cpu resources for training but very little cost for real-time classification.
* Dumps, including extended dump formats such as a natural language parse of the full text of the recent version of a wiki made readily available for researchers.
Finally, there are many worthwhile projects that have been presented at past Wikimanias or published in the literature that deserve to be kept up to date as the encyclopedia continues to grow. Permanent hosting for such projects would be a worthwhile goal, as would reaching out to these researchers. If the foundation can afford such an endeavor, the hardware cost is actually not that great. Perhaps datacenter fees are.
On Fri, Mar 13, 2009 at 3:42 PM, Morten Warncke-Wang morten@cs.umn.edu wrote:
Hi all,
Judging by the replies we think we've failed to communicate clearly some of the ideas we wanted to put forward, and we'd like to take the opportunity to try to clear that up.
We did not want to narrow this down to be only about a third party toolserver. Before we initiated contact we noticed the need for adding more resources to the existing cluster. Therefore we also had in mind the idea of augmenting the toolserver, rather than attempt to create a competitor for it. For instance this could help allow the toolserver to also host applications requiring some amounts of text crunching, which is currently not feasible as far as we can tell.
Additionally we think there could perhaps be two paths to account creation, one for Wikipedians and one for researchers, with the research path laid out with clearer documentation on the requirements projects would need to fit the toolserver and what the application should contain, which combined with faster feedback would aid to make the process easier for the researchers.
We hope that this clears up some central points in our ideas surrounding a "research oriented toolserver". Currently we are exploring several ideas and this particular one might not become more than a thought and a thread on a mailing list. Nonetheless perhaps there are thoughts here that can become more solid somewhere down the line.
Morten Warncke-Wang, Research Assistant John Riedl, Professor GroupLens Research www.grouplens.org
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l