I think what the toolserver guys are saying is that they've got the
data (e.g., a replica of the master database) and they are willing to
expand operations to include larger-scale computations, and so yes
they are willing to become more "research oriented". They just need
the extra hardware of course. I think it's difficult to estimate how
much but here are some applications that I would like to make or see
made sooner or later:
* WikiBlame - A Lucene index of the history of all projects that can
instantly find the authors of a pasted snippet. I'm not clear on the
memory requirements of hosting an app like this after the index is
created, but the index will be terabyte-size at 35% of the text dump.
* WikiBlame for images - an image similarity algorithm over all images
in all projects that can find all places a given image is being used.
I believe there is a one-time major cpu cost when first analyzing the
images and then a much lesser realtime comparison cost. Again, the
memory requirements of hosting such an app are unclear.
* A vandalism classifier bot that uses the entire history of a wiki in
order to predict whether the current edit is vandalism. Basically, a
major extension of existing published work on automatically detecting
vandalism, which only used several hundred edits. This would require
major cpu resources for training but very little cost for real-time
classification.
* Dumps, including extended dump formats such as a natural language
parse of the full text of the recent version of a wiki made readily
available for researchers.
Finally, there are many worthwhile projects that have been presented
at past Wikimanias or published in the literature that deserve to be
kept up to date as the encyclopedia continues to grow. Permanent
hosting for such projects would be a worthwhile goal, as would
reaching out to these researchers. If the foundation can afford such
an endeavor, the hardware cost is actually not that great. Perhaps
datacenter fees are.
On Fri, Mar 13, 2009 at 3:42 PM, Morten Warncke-Wang <morten(a)cs.umn.edu> wrote:
Hi all,
Judging by the replies we think we've failed to communicate clearly
some of the ideas we wanted to put forward, and we'd like to take the
opportunity to try to clear that up.
We did not want to narrow this down to be only about a third party
toolserver. Before we initiated contact we noticed the need for
adding more resources to the existing cluster. Therefore we also had
in mind the idea of augmenting the toolserver, rather than attempt to
create a competitor for it. For instance this could help allow the
toolserver to also host applications requiring some amounts of text
crunching, which is currently not feasible as far as we can tell.
Additionally we think there could perhaps be two paths to account
creation, one for Wikipedians and one for researchers, with the
research path laid out with clearer documentation on the requirements
projects would need to fit the toolserver and what the application
should contain, which combined with faster feedback would aid to make
the process easier for the researchers.
We hope that this clears up some central points in our ideas
surrounding a "research oriented toolserver". Currently we are
exploring several ideas and this particular one might not become more
than a thought and a thread on a mailing list. Nonetheless perhaps
there are thoughts here that can become more solid somewhere down the
line.
Morten Warncke-Wang, Research Assistant
John Riedl, Professor
GroupLens Research
www.grouplens.org
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l