Roan Kattouw wrote:
Brion Vibber schreef:
Should check whether Robert's already hacked some of this stuff into the lucene server or what changes it would require.
If I understand correctly, Lucene shouldn't really care what it stores, as long as it's text and it's searchable. Storing "Living_people Articles_needing_cleanup" would work just fine, right? We do need to think about case-sensitivity, though.
Let me briefly repeat what I said earlier about my experience with this category intersection thingy. Adding categories to lucene index is easy *IF* they are inside the article, e.g. try this:
http://en.wikipedia.org/w/index.php?title=Special%3ASearch&search=%2Binc...
This will give you category intersection of "Living People" and "English comedy writers" in fraction of the second.
What I found that the hard part is keeping the index updated. If we want a fancy category intersection system discussed here before we need to have an index that is frequently updated, that will be integrated with the job queue, that will understand templates etc..
Lucene is not that good with very frequent updates. The usual setting is to have an indexer, make snapshots of the index at regular intervals and then rsync it onto searchers. The whole process takes time, although for a category-only index it will probably be fast. I assume there would be at least few tens of minutes lag anyhow. Our current lucene framework could easily be used for index distribution and such.
What remains unsolved, however, is keeping the index updated with the latest changes on the site. If one changes a template with a category in it, the thing goes on the job queue. I assume there would need to be some kind of hook that will either log the change somewhere or send data to lucene somehow. This is the part of the backend that needs thinking and solving.
r.