On Tue, 29 Mar 2005 20:16:03 +0100, Minty mintywalker@gmail.com wrote:
anyone playing with http://nutch.org/ ?
Actually, I think the more generic Lucene library which Nutch is built upon will be more useful. We should be indexing the wikitext, not the HTML (which is a lower quality version ;))
Seriously, we also don't want a crawler. What is left in Nutch's favour?
However, I don't imagine either will be used by Wikimedia, as they are written in Java, which is slow and takes up too much memory compared to natively running stuff (i.e. C or C++). It's already bad enough that we're using PHP! (In one extreme case, a diff took 45.5 seconds in PHP while the same algorithm took 0.5 seconds in C (or maybe C++) (this is from a developer)).