On Thu, May 22, 2008, Roan Kattouw roan.kattouw@home.nl wrote:
I've gone ahead and written an alternative way of implementing category intersections using a fulltext search, which means you can run the most crazy intersections; in fact, you can search in an article's categories as if they were the page's contents. It's part of the AdvancedSearch extension which I'm paid to write, but it'll be easy to split off just the intersection functionality into another extension. The upside is that I also have a special page front end ready to go. I'll commit AdvancedSearch into SVN once I've worked out the bugs (provided there are any; it's close to midnight now so I don't really feel like testing stuff any more) and worked out stuff with my 'employer', which shouldn't take more than a few days.
Wow, awesome! - you (and your employer) beat the heck out of all my good intentions to acquaint myself with the current version of Mediawiki and write code good enough for production! I can't wait to see it.
On a technical level, the extension adds the categorysearch table (you
need to run update.php to actually create the table), which is basically a rip-off from the searchindex table. It has a cs_page field referencing page_id, and keeps itself updated using the LinksUpdate and ArticleDeleteComplete hooks. There's also a maintenance script to populate the table from scratch.
What I found that the hard part is keeping the index updated. If we want a fancy category intersection system discussed here before we need to have an index that is frequently updated, that will be integrated with the job queue, that will understand templates etc..
Understanding templates is no problem here, since the updater uses the parser's notion of which categories the page is in, and the populate script uses the categorylinks table.
Perfect - yes exactly the way to go.
Lucene is not that good with very frequent updates. The usual setting is
to have an indexer, make snapshots of the index at regular intervals and then rsync it onto searchers. The whole process takes time, although for a category-only index it will probably be fast. I assume there would be at least few tens of minutes lag anyhow. Our current lucene framework could easily be used for index distribution and such.
I really don't have the faintest idea how Lucene works or how MediaWiki interfaces with it, but I do know that Lucene can handle the stuff we put into the searchindex table. Since the categorysearch table is no different, I think Lucene *should* be able to handle it pretty easily as well. Could someone who actually has a clue about all this reply?
Lucene doesn't allow edits, it only allows add and delete. Presumably too many deletes make the index inefficient or something. But I think all that is moot - once you've got the categories into their own table, it *should* be simple to set up another index on the same type schedule/etc. as the base search index, and point it to that table. Then, change the interface to point to Lucene instead of MySQL. I'm not familiar with Wikipedia's Lucene backend, but... It seems reasonable to assume that this is not a major endeaver.
What's your UI for the intersections look like? That was the killer for me; I'm a weak UI guy. I'd imagine (and implemented a rough prototype years ago) that let you "browse" intersections - ie, given intersection a it would show you the set of all categories B that have documents that have category a. Ideally the most frequently used categories appear at the top :-) But I never did any performance testing for this set up, and additionally, I'm not sure how to do it in Lucene... Anyway, what's your interface like?
Best Regards, Aerik
Aerik Sylvan schreef:
Wow, awesome! - you (and your employer) beat the heck out of all my good intentions to acquaint myself with the current version of Mediawiki and write code good enough for production! I can't wait to see it.
I guess money helps. As does having more free time to develop it.
Lucene doesn't allow edits, it only allows add and delete. Presumably too many deletes make the index inefficient or something. But I think all that is moot - once you've got the categories into their own table, it *should* be simple to set up another index on the same type schedule/etc. as the base search index, and point it to that table. Then, change the interface to point to Lucene instead of MySQL. I'm not familiar with Wikipedia's Lucene backend, but... It seems reasonable to assume that this is not a major endeaver.
That's kind of what I was saying: if we have some kind of searchindex <--> Lucene interface, a categorysearch <--> Lucene interface should be easy.
What's your UI for the intersections look like? That was the killer for me; I'm a weak UI guy. I'd imagine (and implemented a rough prototype years ago) that let you "browse" intersections - ie, given intersection a it would show you the set of all categories B that have documents that have category a. Ideally the most frequently used categories appear at the top :-) But I never did any performance testing for this set up, and additionally, I'm not sure how to do it in Lucene... Anyway, what's your interface like?
I'm also not much of a UI guy, but the UI for this extension was mostly imposed on me by my 'employer', and after some discussion we settled on a format where the category intersection part (it does more) is basically a text box where you can enter "Living people AND American people OR Presidents of the United States". AND takes precedence over OR, so the example would get all living Americans plus all deceased ex-Presidents. Expressions with parentheses like "Living people AND (American people OR Canadian people)" aren't supported yet, but can be emulated with "Living people AND American people OR Living people AND Canadian people" (more complex expressions will probably be impossible to emulate that way, and of course the extension should really support parentheses, I'm working on that).
Anyway, you'll be able to play around with it around the beginning of next week, probably.
Roan Kattouw (Catrope)
2008/5/23 Roan Kattouw roan.kattouw@home.nl:
I'm also not much of a UI guy, but the UI for this extension was mostly imposed on me by my 'employer', and after some discussion we settled on a format where the category intersection part (it does more) is basically a text box where you can enter "Living people AND American people OR Presidents of the United States". AND takes precedence over OR, so the example would get all living Americans plus all deceased ex-Presidents. Expressions with parentheses like "Living people AND (American people OR Canadian people)" aren't supported yet, but can be emulated with "Living people AND American people OR Living people AND Canadian people" (more complex expressions will probably be impossible to emulate that way, and of course the extension should really support parentheses, I'm working on that). Anyway, you'll be able to play around with it around the beginning of next week, probably.
\o/ \o/ \o/
- d.
On Fri, May 23, 2008 at 5:04 AM, Roan Kattouw roan.kattouw@home.nl wrote:
Expressions with parentheses like "Living people AND (American people OR Canadian people)" aren't supported yet, but can be emulated with "Living people AND American people OR Living people AND Canadian people" (more complex expressions will probably be impossible to emulate that way, and of course the extension should really support parentheses, I'm working on that).
Any expression involving only AND and OR should be possible to express without parentheses, if AND binds more tightly than OR or vice versa.
http://en.wikipedia.org/wiki/Canonical_form_(Boolean_algebra)
2008/5/23 Simetrical Simetrical+wikilist@gmail.com:
On Fri, May 23, 2008 at 5:04 AM, Roan Kattouw roan.kattouw@home.nl wrote:
Expressions with parentheses like "Living people AND (American people OR Canadian people)" aren't supported yet, but can be emulated with "Living people AND American people OR Living people AND Canadian people" (more complex expressions will probably be impossible to emulate that way, and of course the extension should really support parentheses, I'm working on that).
Any expression involving only AND and OR should be possible to express without parentheses, if AND binds more tightly than OR or vice versa. http://en.wikipedia.org/wiki/Canonical_form_(Boolean_algebra)
Yes, but you're a geek, and casual users are very unlikely to be ;-)
- d.
Simetrical schreef:
Any expression involving only AND and OR should be possible to express without parentheses, if AND binds more tightly than OR or vice versa.
True. Whether it's a good think to let users write "a AND c OR a AND d OR b AND c OR b AND d" rather than "(a OR b) AND (b OR d)" is another issue.
Roan Kattouw (Catrope)
wikitech-l@lists.wikimedia.org