On Tue, Mar 10, 2009 at 2:18 PM, Daniel Kinzler <daniel(a)brightbyte.de> wrote:
Robert Rohde schrieb:
The converse of this is that some recognized
experts would probably
prefer to administer their own server/cluster rather than relying on
some random guy with Wikimedia DE (or wherever) to get things done.
An academic institution may also get a serious research grant for this - that
would be more complicated if the money would be handeled via the german chapter.
Though it's something we are, of course, also interested in.
Basically, if we could all work on making the toolserver THE ONE PLACE for
working with wikipedia's data, that would be perfect. If, for some reason, it
makes sense to build a separate cluster, I propose to give it a distict purpose
and profile: let it provide facilities for fulltext research, with low priority
for the update latency, and high priority of having fulltext in various forms,
with search indexes, word lists, and all the fun.
Personally I would favor a physically distinct cluster (regardless of
who administers it) more or less with the focus you describe. In
particular, I think it is useful to separate "tools" from "analysis".
A "tool" aims to provide useful information in near realtime based on
specific and focused parameters. By contrast, "analysis" often
involves running some process systematically through a very large
portion of the data with the expectation that it will take a while
(for example, I've used dumps to perform large statistical analyses
where the processing code might take 24 hours when run against the
full edit history of a large wiki.) "Tools" need high availability
and low lag relative to the live site, but "analysis" doesn't care if
it gets out of date and should use scheduling etc. to balance large
loads.
-Robert Rohde