On Tue, 29 Mar 2005 20:16:03 +0100, Minty <mintywalker(a)gmail.com> wrote:
anyone playing with
http://nutch.org/
?
Actually, I think the more generic Lucene library which Nutch is built
upon will be more useful. We should be indexing the wikitext, not the
HTML (which is a lower quality version ;))
Seriously, we also don't want a crawler. What is left in Nutch's favour?
However, I don't imagine either will be used by Wikimedia, as they are
written in Java, which is slow and takes up too much memory compared
to natively running stuff (i.e. C or C++). It's already bad enough
that we're using PHP! (In one extreme case, a diff took 45.5 seconds
in PHP while the same algorithm took 0.5 seconds in C (or maybe C++)
(this is from a developer)).