I have a mediawiki based resourced that needs a full text search engine. Google will not work as it is not yet a public resource. Anyone have any recommendations? This is intended to be used at an academic institution.
Thanks,
jnowacki
On Dec 5, 2007 10:23 AM, Jonathan Nowacki jnowacki@gmail.com wrote:
I have a mediawiki based resourced that needs a full text search engine. Google will not work as it is not yet a public resource. Anyone have any recommendations? This is intended to be used at an academic institution.
Lucene http://www.mediawiki.org/wiki/Extension:LuceneSearch is what Wikipedia uses. You might also want to take a look at Sphinx < http://www.mediawiki.org/wiki/Extension:SphinxSearch%3E.
I've been thinking about moving from the default to Lucene, and am NOT an expert, so take the following with lots of NaCl. I'd like to hear what people who know what they're talking about think!
As I understand it, Lucene indexes and stores the indexes into a set of index files that are kept in memory or are swapped in as needed and does not use the backend database that's running the wiki. By contrast, Sphinx works via mySQL. I believe that this difference can be important as the size and use of the wiki increases, since the search can end up taxing the db leading to performance degradation for mySQL. But if Lucene sucks up all your free memory, you could get performance problems outside mySQL. This is probably not an issue for your setup behind a firewall, but I'm wondering how to think about the tradeoffs for a smallish single-server wiki that sometimes gets swamped by search engine hits. And yes, I know that I need to learn more about robots.txt too...
Google also sells search appliances, in case you really want it to exactly like Google. ;)
Jim
On Dec 5, 2007, at 1:57 PM, Emufarmers Sangly wrote:
On Dec 5, 2007 10:23 AM, Jonathan Nowacki jnowacki@gmail.com wrote:
I have a mediawiki based resourced that needs a full text search engine. Google will not work as it is not yet a public resource. Anyone have any recommendations? This is intended to be used at an academic institution.
Lucene http://www.mediawiki.org/wiki/Extension:LuceneSearch is what Wikipedia uses. You might also want to take a look at Sphinx < http://www.mediawiki.org/wiki/Extension:SphinxSearch%3E.
-- Arr, ye emus, http://emufarmers.com _______________________________________________ MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
===================================== Jim Hu Associate Professor Dept. of Biochemistry and Biophysics 2128 TAMU Texas A&M Univ. College Station, TX 77843-2128 979-862-4054
I would like to include myself as someone who knows what they are talking about but it might be a stretch!
Anyhow, we implemented Lucene on a wiki with approaching 350k pages. The performance and types of search available are impressive and we are using the existing search tables in MySQL to feed the Lucene index. We perofrm an incremental synchronization every 15 minutes and the indexing is fast with a couple of thousand documents being indexed per minute. It is unlikely we will have that many in a 15 minute period so all should be fine.
Things like language stemming, fishing returns fish and fishes etc, multiple language support, prioritization based upon various factors (title, occurrences of words near each other) and a host of other features.
Memory is indeed a consideration but even so we are able to run this on a 1GB dedicated server and still see search response times well under 200mS. We will be moving to a bigger server but not as a result of Lucene.
If you look at Lucene then also look to implement Solr which provides added functionality as well as the standard highlight search term within the teaser results and transaction management amongst multiple index servers.
If you want to talk to the best in the industry go to sematext (www.sematext.com) which is run by Otis, one of the original participants in Lucene. We are VERY happy with our move to Lucene and will be adding Solr in the next couple of weeks.
Hope some of this helps.
Regards, Paul
On 12/5/07 3:35 PM, "Jim Hu" jimhu@tamu.edu wrote:
I've been thinking about moving from the default to Lucene, and am NOT an expert, so take the following with lots of NaCl. I'd like to hear what people who know what they're talking about think!
As I understand it, Lucene indexes and stores the indexes into a set of index files that are kept in memory or are swapped in as needed and does not use the backend database that's running the wiki. By contrast, Sphinx works via mySQL. I believe that this difference can be important as the size and use of the wiki increases, since the search can end up taxing the db leading to performance degradation for mySQL. But if Lucene sucks up all your free memory, you could get performance problems outside mySQL. This is probably not an issue for your setup behind a firewall, but I'm wondering how to think about the tradeoffs for a smallish single-server wiki that sometimes gets swamped by search engine hits. And yes, I know that I need to learn more about robots.txt too...
Google also sells search appliances, in case you really want it to exactly like Google. ;)
Jim
On Dec 5, 2007, at 1:57 PM, Emufarmers Sangly wrote:
On Dec 5, 2007 10:23 AM, Jonathan Nowacki jnowacki@gmail.com wrote:
I have a mediawiki based resourced that needs a full text search engine. Google will not work as it is not yet a public resource. Anyone have any recommendations? This is intended to be used at an academic institution.
Lucene http://www.mediawiki.org/wiki/Extension:LuceneSearch is what Wikipedia uses. You might also want to take a look at Sphinx < http://www.mediawiki.org/wiki/Extension:SphinxSearch%3E.
-- Arr, ye emus, http://emufarmers.com _______________________________________________ MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
===================================== Jim Hu Associate Professor Dept. of Biochemistry and Biophysics 2128 TAMU Texas A&M Univ. College Station, TX 77843-2128 979-862-4054
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Jim Hu wrote:
As I understand it, Lucene indexes and stores the indexes into a set of index files that are kept in memory or are swapped in as needed and does not use the backend database that's running the wiki.
[snip]
But if Lucene sucks up all your free memory, you could get performance problems outside mySQL.
this is not exactly true. Lucene will cache some of the index in memory, but it's only a small amount. you can index a very large wiki (such as the English Wikipedia) using Lucene without running into memory problems.
you will need a reasonable amount of disk space to store the index, of course, and more RAM will allow your OS to cache more of the index files itself, which helps performance.
- river.
Jim Hu wrote:
As I understand it, Lucene indexes and stores the indexes into a set of index files that are kept in memory or are swapped in as needed and does not use the backend database that's running the wiki. By contrast, Sphinx works via mySQL.
Regarding indexes, Sphinx can be set up to use either a MySQL backend or it's own data format, which is the standard mode. It might be though that the SphinxSearch extension ( http://www.mediawiki.org/wiki/Extension:SphinxSearch ) uses the wiki's database to get the article extracts for the search page, since these extracts are not in the indexes.
I have, btw, been impressed by Sphinx's indexing speed (something like 1000 pages / 6 sec) , as well as it's set of features and config options (mutli-language stemming et.c.), and think it looks very promising.
Regards Samuel
I was under the impression about Sphinx from a VERY superficial scan of their website. Didn't know there were non-mySQL options!
Which makes it an even harder choice, I guess. I'm leaning still toward Lucene, largely because others who are using it seem to be happy, and they seem to be telling me that the memory concern is not significant (again, this is from a too-quick scan of the docs on the Apache/Lucene website). I suspect they were talking about the size issue for indexing the whole internet, not just one wiki.
My thinking is partly based on the guess that because more wikis are using Lucene, and that since Wikipedia is using it, further development and improvement of the extension is likely to be better. But for all I know, everyone will switch to Sphinx by the time I educate myself sufficiently and find the free time to actually install Lucene! ; )
Jim
On Dec 6, 2007, at 9:57 AM, Samuel Lampa wrote:
Jim Hu wrote:
As I understand it, Lucene indexes and stores the indexes into a set of index files that are kept in memory or are swapped in as needed and does not use the backend database that's running the wiki. By contrast, Sphinx works via mySQL.
Regarding indexes, Sphinx can be set up to use either a MySQL backend or it's own data format, which is the standard mode. It might be though that the SphinxSearch extension ( http://www.mediawiki.org/wiki/Extension:SphinxSearch ) uses the wiki's database to get the article extracts for the search page, since these extracts are not in the indexes.
I have, btw, been impressed by Sphinx's indexing speed (something like 1000 pages / 6 sec) , as well as it's set of features and config options (mutli-language stemming et.c.), and think it looks very promising.
Regards Samuel
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
===================================== Jim Hu Associate Professor Dept. of Biochemistry and Biophysics 2128 TAMU Texas A&M Univ. College Station, TX 77843-2128 979-862-4054
All I know is the author of http://meta.wikimedia.org/wiki/Fulltext_search_engines#Ksana_Search_For_Wiki... is a nice guy.
On Dec 5, 2007 9:23 AM, Jonathan Nowacki jnowacki@gmail.com wrote:
I have a mediawiki based resourced that needs a full text search engine. Google will not work as it is not yet a public resource. Anyone have any recommendations? This is intended to be used at an academic institution.
I use mnoGo. I am sure others are better, but the ability to dynamically tune the indexer for my namespaces and separate them (additionally separate talk pages), crawl doc/pdfs, provide myself with any type of report I want off search terms used (redirections authors never thought of), migrate in mailing lists or what have you externally, follow interwiki links for a single page, obviously it can search as different user/group restrictions and feed that to appropriate individual user/group restrictions if you need that sort of headache, writing stops into templates and what not is quite handy as well. But I am sure all the big guys do this. And the search aint bad either.
On 06/12/2007, Gabriel Millerd gmillerd@gmail.com wrote:
On Dec 5, 2007 9:23 AM, Jonathan Nowacki jnowacki@gmail.com wrote:
I have a mediawiki based resourced that needs a full text search engine. Google will not work as it is not yet a public resource. Anyone have any recommendations? This is intended to be used at an academic institution.
I use mnoGo. I am sure others are better, but the ability to dynamically tune the indexer for my namespaces and separate them (additionally separate talk pages), crawl doc/pdfs, provide myself with any type of report I want off search terms used (redirections authors never thought of), migrate in mailing lists or what have you externally, follow interwiki links for a single page, obviously it can search as different user/group restrictions and feed that to appropriate individual user/group restrictions if you need that sort of headache, writing stops into templates and what not is quite handy as well. But I am sure all the big guys do this. And the search aint bad either.
Since the thread came up, I would like to ask if any of the current search engines will fix my current search problem; I have a lot of transclusion in my wiki, whereby the bulk of many 'resource' pages is dropped in via a specific 'data' templates. For example;
A 'resource' page, http://biodatabase.org/index.php/Ensembl
The corresponding 'data' template, http://biodatabase.org/index.php/Template:NARDatabase:Ensembl
Its somewhat messy, but it is a requirement because of the copyright status of the source 'data'. So far so good, however, my searches turn up hits to the underlying data template, and not the corresponding resource page. For example;
http://biodatabase.org/index.php/Special:Search?search=Wellcome+Trust&fu...
I thought about hacking the underlying index tables to redirect terms from the data templates to the resource pages, but I am not sure if that is the best answer. If I go with Lucene can I fix this problem, or will it require more hacking?
Thanks for any help, and sorry for the long winded description,
Dan.
-- Gabriel Millerd
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
On Dec 6, 2007 4:36 AM, Dan Bolser dan.bolser@gmail.com wrote:
I thought about hacking the underlying index tables to redirect terms from the data templates to the resource pages, but I am not sure if that is the best answer. If I go with Lucene can I fix this problem, or will it require more hacking?
Thanks for any help, and sorry for the long winded description,
My expectations with any search would be be similar to the results provided for my Google on that same page.
The various documents that related to the variable $wgNamespacesToBeSearchedDefault might be helpful in arresting your results.
On 06/12/2007, Gabriel Millerd gmillerd@gmail.com wrote:
On Dec 6, 2007 4:36 AM, Dan Bolser dan.bolser@gmail.com wrote:
I thought about hacking the underlying index tables to redirect terms from the data templates to the resource pages, but I am not sure if that is the best answer. If I go with Lucene can I fix this problem, or will it require more hacking?
Thanks for any help, and sorry for the long winded description,
My expectations with any search would be be similar to the results provided for my Google on that same page.
Yes, Google 'sees' the transcluded data on the resource page, but the internal MW search indexes do not.
The various documents that related to the variable $wgNamespacesToBeSearchedDefault might be helpful in arresting your results.
I already set the value of that variable to ensure that all searches are in all namespaces by default. Personally I don't see the point of namespace specific searches. For the vast majority of wikies this adds confusion (and nothing else).
-- Gabriel Millerd
Cheers, Dan.
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
On Dec 6, 2007 2:43 AM, Gabriel Millerd gmillerd@gmail.com wrote:
On Dec 5, 2007 9:23 AM, Jonathan Nowacki jnowacki@gmail.com wrote:
I have a mediawiki based resourced that needs a full text search engine. Google will not work as it is not yet a public resource. Anyone have
any
recommendations? This is intended to be used at an academic
institution.
I use mnoGo. I am sure others are better, but the ability to dynamically tune the indexer for my namespaces and separate them (additionally separate talk pages), crawl doc/pdfs, provide myself with any type of report I want off search terms used (redirections authors never thought of), migrate in mailing lists or what have you externally, follow interwiki links for a single page, obviously it can search as different user/group restrictions and feed that to appropriate individual user/group restrictions if you need that sort of headache, writing stops into templates and what not is quite handy as well. But I am sure all the big guys do this. And the search aint bad either.
That sounds extremely promising. How did you integrate mnoGo with your MediaWiki installation? If you would be willing to provide a step-by-step guide on MediaWiki.org, it would be very helpful (for me, anyway, and hopefully for other people!).
On Dec 6, 2007 4:28 PM, Emufarmers Sangly emufarmers@gmail.com wrote:
That sounds extremely promising. How did you integrate mnoGo with your MediaWiki installation? If you would be willing to provide a step-by-step guide on MediaWiki.org, it would be very helpful (for me, anyway, and hopefully for other people!).
I should do this, your right. I don't work there anymore in the same capacity actually and rewriting all this for FOSS would be a good project to do now that I have the 'free time'. Lot of things I didn't like in that system. But this is a closer rundown of what I wrote:
SpecialScan * Created a Special page that handles the 'front end' functionality akin to the mnoGo scripts * Implemented functionality in the above to handle the new page creation workflow (asdf is not found, create? * This is click tracked so we know when people use it and what they searched for (like google basically) * Altered the Monobook to use the above over the original. * I think initially mod_rewrite was used as a cheap hack as well.
Backend: * Use perl and tt to create a indexer file on demand and launch the indexer. taking a peak at the mediawiki database to grab namespaces and other hints quick. * the indexer does its normal multiple burns through of the wiki under the two major user ids public and private. the private one is limited to just single namespace in this case though so its not a full run. * in mnoGo you can make 'datasets' of various regex url patterns to search, so adjusting the result weighting or the revisit time of pages like ^/w/Talk: can be useful if your into that sort of thing. * special care to follow interwiki links to a limited degree and peek into the mediawiki data for 'okaying' them.
SpecialScan/Terms * Special page for admins, since quieries are farmable making a special page for 'what did people most search for' and some cheap guesses at common failed searches goes here.
SpecialMnogo/Config * Special page for admin people to tune the indexing process to some degree, purge them all, manually kick it off etc ... so I dont have to be involved. * Also handle the external urls here (mailing lists, and interwiki limiters)
SpecialMnogo/Logs * Special page for admin people to see the indexing logs for what its worth
mediawiki-l@lists.wikimedia.org