Actually, I think the more generic Lucene library which Nutch is built upon will be more useful. We should be indexing the wikitext, not the HTML (which is a lower quality version ;))
This is the only open issue when you plan to use lucene, you need a good parser for the syntax and this is very difficult.
Seriously, we also don't want a crawler. What is left in Nutch's favour?
Nothing! Use Lucene - trust me. :-) It will definitely save wikipedia very very much load!!!