So Chad and I feel like we've gotten far enough in our prototype of our new search backend for MediaWiki that we're ready to request comments. So here is our format RFC: https://www.mediawiki.org/wiki/Requests_for_comment/CirrusSearch
You'll note that the plugin is called CirrusSearch. SolrSearch seems to have been taken by an unrelated project so we had to pick a different name.
Please read and comment in whatever way is normal for these things.
Thanks so much for your attention,
Nik Everett
Everyone,
I'm reviving this old thread to update everyone on the status of the RFC:
We've continued working on implementation and everything seems to be proceeding smoothly. We evaluated Elasticsearch and were super impressed and decided it was very likely to be worth switching from Solr4 to it. The evaluation and the switch did cost some time but in my opinion doing it was time well spent.
Thanks so much for your comments a month ago when I first posted this. If you are interested please give the page another look. Just to be helpful, here is a link to what I changed: http://www.mediawiki.org/w/index.php?title=Requests_for_comment%2FCirrusSear...
Nik Everett
On Fri, Jun 14, 2013 at 4:21 PM, Nikolas Everett neverett@wikimedia.orgwrote:
So Chad and I feel like we've gotten far enough in our prototype of our new search backend for MediaWiki that we're ready to request comments. So here is our format RFC: https://www.mediawiki.org/wiki/Requests_for_comment/CirrusSearch
You'll note that the plugin is called CirrusSearch. SolrSearch seems to have been taken by an unrelated project so we had to pick a different name.
Please read and comment in whatever way is normal for these things.
Thanks so much for your attention,
Nik Everett
I wonder if there are queries or use cases we can support that *aren't* already better handled by google. Granted, users of private wikis can't simply use the 'site:' trick to reuse Google search results -- but users of private wikis also probably don't need superduper scalability.
Trying to brainstorm here, not start a flame war. What sorts of useful searches could we excel at? (Maybe these are searches/use cases that will facilitate editor engagement?) --scott
Scott,
I was going to respond to this a while ago but couldn't really do it justice. I'm still pretty sure my explanation won't be great, which is an indication of just how good Google is.
For strait search there is nothing we can do that Google can't. It might cost them more time and money to make searching mediawiki awesome but they lots of both so we're just not going to beat them there. There are a few things that we can do more easily/cheaply than Google: 1. We can update our search index right when changes are made including when changes are made to transcluded pages. 2. We can search based on redirects to a page. 3. We can filter (and maybe one day facet) based on categories. 4. We could search based on citations.
We will, on the other hand, be better about listening to what the community needs with regards to search. Part of the problem here is that historically we've let search languish and my first foray into making search nicer isn't going to provide much new stuff for the community. Instead its a solid platform on which to build things that the community needs and which should make search less exciting for operations engineers. That really isn't exciting for the community to hear and for that I am sorry. I can only promise that we'll do more later.
There are some more deep integrations into mediawiki that I don't see google doing but we could work on in the future: 1. We could create a section that allowed users to easily find "similar" pages. I'm a little fuzzy on exactly how we'd calculate similarity. 2. We could automatically dig around in commons for useful media for an article. We could use this to automatically provide extra media which might be relevant or as a curation aid. On second thought the second one sounds much better.
Actually, some kind of game around tagging media as relevant to an article might be quite a decent way to encourage engagement. By game I mean something like Galaxy Zoo or LinkedIn's endorsements. You could do this without a nice search but it'd help produce much more relevant results.
And then there is the cynic in me that says that it is worth doing just so we aren't reliant on external (corporate) entities. I'm really not sure how I would feel if the only way to find stuff on WMF's wikis was with Google/Bing/Yahoo....
Finally we have the private wikis like you mentioned - they mostly can't use google. We are trying to make sure CirrusSearch works for them. The idea there is to provide something that is better at finding results than the database based search because it uses the same analysis that we've optimized for WMF. Elasticsearch isn't some kind of precision tuned machine - you can actually get quite decent behaviour out of downloading the deb or rpm and installing it. You only really need one instance.
So now that I've created this wall of text I don't feel that I've really answered your question well, but I've answered it. That is the thing about hard questions: they are harder to answer than to ask.
I'd really love more brainstorming. Cross wiki search was another good idea someone added to the page a while ago.
Nik
On Fri, Jul 19, 2013 at 2:24 PM, C. Scott Ananian cananian@wikimedia.orgwrote:
I wonder if there are queries or use cases we can support that *aren't* already better handled by google. Granted, users of private wikis can't simply use the 'site:' trick to reuse Google search results -- but users of private wikis also probably don't need superduper scalability.
Trying to brainstorm here, not start a flame war. What sorts of useful searches could we excel at? (Maybe these are searches/use cases that will facilitate editor engagement?) --scott
-- (http://cscott.net) _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
It seems like there are also a bunch of hacky search-alike features built into the mediawiki database. For example, "all pages linking to this page", "my contributions", etc. From a code cleanup standpoint, it would also be worthwhile if these were all unified and brought together under a single search engine.
It would be really nice if the search engine allowed me to make these sorts of queries in a query language, so that I could combine features. "All pages which I have contributed to which link to Foo.jpg and have the word Bar in them", for example.
This would potentially simplify the codebase as well as provide a capability google.com does not. --scott
wikitech-l@lists.wikimedia.org