Also, lucene is just a search library.
Internally, I use the lucene search library for indexing the crawled
contents.
A nice thing about crawled content vs. indexing the mediawiki database dump.
There are lot of useful aspects when you look at the html and its link
structure. But otherwise, a database dump won't give that. Those useful html
content could be a better source for search relevancy.
Cheers,
Jian
On 10/11/07, Emufarmers Sangly <emufarmers(a)gmail.com> wrote:
On 10/11/07, jian chen <chenjian1227(a)gmail.com> wrote:
The search engine is built using java and has 3
components. Crawler,
Indexer
and Searcher.
Right now I have a question for the community. We have a requirement to
lock
down the wiki so that only logged in users could see the wiki content.
But,
in order for the crawler to download the content, I need to find a way
to
enable the access to the crawler based on ip
address.
Does mediawiki support such a feature to turn on/off access based on ip
address?
The Python Wikipediabot (
http://meta.wikimedia.org/wiki/Pywikipedia) can
log
in as a user; I'm sure you can add its login functionality. But I'm not
sure whether it's such a good idea to crawl a live wiki: The Lucene search
engine (
http://www.mediawiki.org/wiki/Lucene) forms an index of a database
dump instead. Is there some reason that you need the crawling
functionality?
--
Arr, ye emus,
http://emufarmers.com
_______________________________________________
MediaWiki-l mailing list
MediaWiki-l(a)lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/mediawiki-l