On Tue, Mar 10, 2009 at 2:18 PM, Daniel Kinzler daniel@brightbyte.de wrote:
Robert Rohde schrieb:
The converse of this is that some recognized experts would probably prefer to administer their own server/cluster rather than relying on some random guy with Wikimedia DE (or wherever) to get things done.
An academic institution may also get a serious research grant for this - that would be more complicated if the money would be handeled via the german chapter. Though it's something we are, of course, also interested in.
Basically, if we could all work on making the toolserver THE ONE PLACE for working with wikipedia's data, that would be perfect. If, for some reason, it makes sense to build a separate cluster, I propose to give it a distict purpose and profile: let it provide facilities for fulltext research, with low priority for the update latency, and high priority of having fulltext in various forms, with search indexes, word lists, and all the fun.
Personally I would favor a physically distinct cluster (regardless of who administers it) more or less with the focus you describe. In particular, I think it is useful to separate "tools" from "analysis". A "tool" aims to provide useful information in near realtime based on specific and focused parameters. By contrast, "analysis" often involves running some process systematically through a very large portion of the data with the expectation that it will take a while (for example, I've used dumps to perform large statistical analyses where the processing code might take 24 hours when run against the full edit history of a large wiki.) "Tools" need high availability and low lag relative to the live site, but "analysis" doesn't care if it gets out of date and should use scheduling etc. to balance large loads.
-Robert Rohde