On Tue, 22 Apr 2008, Simetrical wrote:
On Tue, Apr 22, 2008 at 10:59 AM, Roan Kattouw roan.kattouw@home.nl wrote:
I missed the explanation of the fulltext implementation. Something like 'Foo With_spaces Bar' and then do a fulltext search for the cats you need? That would be more powerful, and would probably be faster for complex intersections. I'll write an alternative to CategoryIntersections that uses the fulltext schema and run some benchmarks. I expect to have some results by the end of the week.
Aerik Sylvan has already done an implementation of the backend using CLucene. If a front-end could be done in core, with a pluggable backend, that might have the best chance of getting enabled on Wikimedia relatively quickly. MyISAM fulltext is not necessarily going to be fast enough due to the locking.
Yes, I did a fulltext search (which works quite well - I forget the response times... I think it was around a third of a second even for intersections of large groups, like "Living_People") and the way it handles booleans and stuff is quite nice. I think I broke it when I moved servers, but I can put it back up. I think it would probably be a great addition to core, and would be very adequate for small wikis, but too slow for larger ones (performance at a few tenths of a second will really add up with tens or hundreds of hits...) I think doing updates is also an issue on large wikis, due to table locking of the MyISAM table. But, I think it will be fine for small wikis. MySQL doesn't break on underscores, so using the category as it appears in the url seems to work great for fulltext search, and the built in fulltext search is *much* faster than doing lookups on the categorylinks table, especially for large sets.
So, I'd propose in core we add a MyISAM table with a fulltext index of categories - this will suite small wikis. For big wikis, make this a InnoDB table and use it to build a Lucene index, which you'd search with whatever flavor of Lucene you like. This is a fairly straight path, that covers both core and large wikis, should have good performance for either application, and is flexible in that it does boolean searches. I don't have suggestions for an interface, but why not just start with a SpecialPage and see what happens? Once the functionality is there, suggestions for how to better use it will come out of the woodwork.
I'm working on a CLucene daemon (calling it clucened, which is on SF - with slightly out-of-date source in subversion - and at clucened.com), which could be used for this, or anything else. I'm planning to make it Solr compatible, but not a direct port of Solr, and the implementation will have some differences. So far I have only the daemon and the search function (takes a raw query, which can be boolean or have mulitple fields, and passes it through). I think this is really cool, but if we already have a GCJ Lucene search for En, it may be easier just to extend that to read at categories Lucene index than use another architecture. Either way, I think a search daemon will find an audience and will be a really cool thing :-)
Aerik
Aerik Sylvan schreef:
Yes, I did a fulltext search (which works quite well - I forget the response times... I think it was around a third of a second even for intersections of large groups, like "Living_People") and the way it handles booleans and stuff is quite nice. I think I broke it when I moved servers, but I can put it back up. I think it would probably be a great addition to core, and would be very adequate for small wikis, but too slow for larger ones (performance at a few tenths of a second will really add up with tens or hundreds of hits...) I think doing updates is also an issue on large wikis, due to table locking of the MyISAM table. But, I think it will be fine for small wikis. MySQL doesn't break on underscores, so using the category as it appears in the url seems to work great for fulltext search, and the built in fulltext search is *much* faster than doing lookups on the categorylinks table, especially for large sets.
What I was actually wondering is how fulltext compares to MinuteElectron's categoryintersections table (see posts earlier this week), but I guess fulltext will be faster, especially for complex queries.
So, I'd propose in core we add a MyISAM table with a fulltext index of categories - this will suite small wikis. For big wikis, make this a InnoDB table and use it to build a Lucene index, which you'd search with whatever flavor of Lucene you like. This is a fairly straight path, that covers both core and large wikis, should have good performance for either application, and is flexible in that it does boolean searches. I don't have suggestions for an interface, but why not just start with a SpecialPage and see what happens? Once the functionality is there, suggestions for how to better use it will come out of the woodwork.
That SpecialPage is present in the CategoryIntersections extension, you'd just need to change the backend code.
Roan Kattouw (Catrope)
Roan Kattouw wrote:
What I was actually wondering is how fulltext compares to MinuteElectron's categoryintersections table (see posts earlier this week), but I guess fulltext will be faster, especially for complex queries.
Just to clarify I have made no categoryintersection extension of any sort, it is Magnus'.
MinuteElectron.
2008/4/23 Aerik Sylvan aerik@thesylvans.com:
Yes, I did a fulltext search (which works quite well - I forget the response times... I think it was around a third of a second even for intersections of large groups, like "Living_People") and the way it handles booleans and stuff is quite nice. I think I broke it when I moved servers, but I can put it back up. I think it would probably be a great addition to core, and would be very adequate for small wikis, but too slow for larger ones (performance at a few tenths of a second will really add up with tens or hundreds of hits...)
What's the actual rate of searches on en:wp?
- d.
On Wed, Apr 23, 2008 at 3:48 PM, David Gerard dgerard@gmail.com wrote:
What's the actual rate of searches on en:wp?
For what, category intersections? Zero; the feature isn't enabled. :)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
David Gerard wrote:
2008/4/23 Aerik Sylvan aerik@thesylvans.com:
Yes, I did a fulltext search (which works quite well - I forget the response times... I think it was around a third of a second even for intersections of large groups, like "Living_People") and the way it handles booleans and stuff is quite nice. I think I broke it when I moved servers, but I can put it back up. I think it would probably be a great addition to core, and would be very adequate for small wikis, but too slow for larger ones (performance at a few tenths of a second will really add up with tens or hundreds of hits...)
What's the actual rate of searches on en:wp?
At this moment, about 184 searches per second on all sites together, spread over 16 backend servers. (Not sure offhand which are just enwiki.)
http://ganglia.wikimedia.org/pmtpa/?m=search_rate&c=Search
- -- brion vibber (brion @ wikimedia.org)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Aerik Sylvan wrote:
So, I'd propose in core we add a MyISAM table with a fulltext index of categories - this will suite small wikis.
Probably reasonable...
For big wikis, make this a InnoDB table and use it to build a Lucene index, which you'd search with whatever flavor of Lucene you like.
Probably also reasonable. :)
Should check whether Robert's already hacked some of this stuff into the lucene server or what changes it would require.
- -- brion vibber (brion @ wikimedia.org)
Brion Vibber schreef:
Should check whether Robert's already hacked some of this stuff into the lucene server or what changes it would require.
If I understand correctly, Lucene shouldn't really care what it stores, as long as it's text and it's searchable. Storing "Living_people Articles_needing_cleanup" would work just fine, right? We do need to think about case-sensitivity, though.
Roan Kattouw (Catrope)
Roan Kattouw wrote:
Brion Vibber schreef:
Should check whether Robert's already hacked some of this stuff into the lucene server or what changes it would require.
If I understand correctly, Lucene shouldn't really care what it stores, as long as it's text and it's searchable. Storing "Living_people Articles_needing_cleanup" would work just fine, right? We do need to think about case-sensitivity, though.
Let me briefly repeat what I said earlier about my experience with this category intersection thingy. Adding categories to lucene index is easy *IF* they are inside the article, e.g. try this:
http://en.wikipedia.org/w/index.php?title=Special%3ASearch&search=%2Binc...
This will give you category intersection of "Living People" and "English comedy writers" in fraction of the second.
What I found that the hard part is keeping the index updated. If we want a fancy category intersection system discussed here before we need to have an index that is frequently updated, that will be integrated with the job queue, that will understand templates etc..
Lucene is not that good with very frequent updates. The usual setting is to have an indexer, make snapshots of the index at regular intervals and then rsync it onto searchers. The whole process takes time, although for a category-only index it will probably be fast. I assume there would be at least few tens of minutes lag anyhow. Our current lucene framework could easily be used for index distribution and such.
What remains unsolved, however, is keeping the index updated with the latest changes on the site. If one changes a template with a category in it, the thing goes on the job queue. I assume there would need to be some kind of hook that will either log the change somewhere or send data to lucene somehow. This is the part of the backend that needs thinking and solving.
r.
Robert Stojnic schreef:
Let me briefly repeat what I said earlier about my experience with this category intersection thingy. Adding categories to lucene index is easy *IF* they are inside the article, e.g. try this:
http://en.wikipedia.org/w/index.php?title=Special%3ASearch&search=%2Binc...
This will give you category intersection of "Living People" and "English comedy writers" in fraction of the second.
What I found that the hard part is keeping the index updated. If we want a fancy category intersection system discussed here before we need to have an index that is frequently updated, that will be integrated with the job queue, that will understand templates etc..
You don't need the article text, just query the categorylinks table.
Lucene is not that good with very frequent updates. The usual setting is to have an indexer, make snapshots of the index at regular intervals and then rsync it onto searchers. The whole process takes time, although for a category-only index it will probably be fast. I assume there would be at least few tens of minutes lag anyhow. Our current lucene framework could easily be used for index distribution and such.
Categories don't change that often, so I don't think 10 minutes of lag is that bad.
What remains unsolved, however, is keeping the index updated with the latest changes on the site. If one changes a template with a category in it, the thing goes on the job queue. I assume there would need to be some kind of hook that will either log the change somewhere or send data to lucene somehow. This is the part of the backend that needs thinking and solving.
There's the LinksUpdate hook, which is also used in Magnus's implementation.
Roan Kattouw (Catrope)
On Wed, Apr 23, 2008 at 11:06 PM, Robert Stojnic rainmansr@gmail.com wrote:
What remains unsolved, however, is keeping the index updated with the latest changes on the site. If one changes a template with a category in it, the thing goes on the job queue. I assume there would need to be some kind of hook that will either log the change somewhere or send data to lucene somehow. This is the part of the backend that needs thinking and solving.
LinksUpdateComplete hook?
Bryan
Bryan Tong Minh wrote:
On Wed, Apr 23, 2008 at 11:06 PM, Robert Stojnic rainmansr@gmail.com wrote:
What remains unsolved, however, is keeping the index updated with the latest changes on the site. If one changes a template with a category in it, the thing goes on the job queue. I assume there would need to be some kind of hook that will either log the change somewhere or send data to lucene somehow. This is the part of the backend that needs thinking and solving.
LinksUpdateComplete hook?
Something like that, yes, but hook probably couldn't just connect to the lucene indexer and queue updates, since the indexer might be down for this or that reason... It might be a better solution to put updates into some table with date attached and then let the indexer fetch updates.
r.
Robert Stojnic schreef:
Something like that, yes, but hook probably couldn't just connect to the lucene indexer and queue updates, since the indexer might be down for this or that reason... It might be a better solution to put updates into some table with date attached and then let the indexer fetch updates.
Maybe use the job queue for this? We could put pending updates in the job queue, and re-add the job if the indexer is down.
Roan Kattouw (Catrope)
On Thu, Apr 24, 2008 at 8:33 PM, Robert Stojnic rainmansr@gmail.com wrote:
Bryan Tong Minh wrote:
On Wed, Apr 23, 2008 at 11:06 PM, Robert Stojnic rainmansr@gmail.com wrote:
What remains unsolved, however, is keeping the index updated with the latest changes on the site. If one changes a template with a category in it, the thing goes on the job queue. I assume there would need to be some kind of hook that will either log the change somewhere or send data to lucene somehow. This is the part of the backend that needs thinking and solving.
LinksUpdateComplete hook?
Something like that, yes, but hook probably couldn't just connect to the lucene indexer and queue updates, since the indexer might be down for this or that reason... It might be a better solution to put updates into some table with date attached and then let the indexer fetch updates.
Hmm too bad we don't have a recentlinkchanges table. (https://bugzilla.wikimedia.org/show_bug.cgi?id=13588)
Robert Stojnic schreef:
Let me briefly repeat what I said earlier about my experience with this category intersection thingy. Adding categories to lucene index is easy *IF* they are inside the article, e.g. try this:
http://en.wikipedia.org/w/index.php?title=Special%3ASearch&search=%2Binc...
This will give you category intersection of "Living People" and "English comedy writers" in fraction of the second.
That's the dirty way. I've gone ahead and written an alternative way of implementing category intersections using a fulltext search, which means you can run the most crazy intersections; in fact, you can search in an article's categories as if they were the page's contents. It's part of the AdvancedSearch extension which I'm paid to write, but it'll be easy to split off just the intersection functionality into another extension. The upside is that I also have a special page front end ready to go. I'll commit AdvancedSearch into SVN once I've worked out the bugs (provided there are any; it's close to midnight now so I don't really feel like testing stuff any more) and worked out stuff with my 'employer', which shouldn't take more than a few days.
On a technical level, the extension adds the categorysearch table (you need to run update.php to actually create the table), which is basically a rip-off from the searchindex table. It has a cs_page field referencing page_id, and keeps itself updated using the LinksUpdate and ArticleDeleteComplete hooks. There's also a maintenance script to populate the table from scratch.
What I found that the hard part is keeping the index updated. If we want a fancy category intersection system discussed here before we need to have an index that is frequently updated, that will be integrated with the job queue, that will understand templates etc..
Understanding templates is no problem here, since the updater uses the parser's notion of which categories the page is in, and the populate script uses the categorylinks table.
Lucene is not that good with very frequent updates. The usual setting is to have an indexer, make snapshots of the index at regular intervals and then rsync it onto searchers. The whole process takes time, although for a category-only index it will probably be fast. I assume there would be at least few tens of minutes lag anyhow. Our current lucene framework could easily be used for index distribution and such.
I really don't have the faintest idea how Lucene works or how MediaWiki interfaces with it, but I do know that Lucene can handle the stuff we put into the searchindex table. Since the categorysearch table is no different, I think Lucene *should* be able to handle it pretty easily as well. Could someone who actually has a clue about all this reply?
Roan Kattouw (Catrope)
wikitech-l@lists.wikimedia.org