Samuel Wantman wrote:
Very interesting, but what puzzling results. I tried intersections of some
of the largest categories I could think of and got wildly different results. The first time I tried American actors intersected with Living people the results took 1.5 seconds the next time I tried it it was 0.0011 seconds.
I think a lot of the variation is due to server load - this is running on a shared webhost FYI (I'm too cheap for a dedicated server right now).
What does this mean? Is the long result because of server
traffic? Is the short result close to the time that it actually took? Are the results being cached?
They were, but I've added SQL_NO_CACHE so the query runs live all the time.
Also, the intersection of large categories with
numerous articles common to both seems to take longer than categories of similar size with no common articles. This is not what I expected. If you are just looking for the first 30 results, I'd expect the first case to be faster. The second has to look through the entire category for a match.
Hmm... I don't know. When I query "+Living_people +Frogs" it returns no results and takes about 2.5 seconds...
My question to developers is what is the criteria for an acceptable
solution? How much time can an intersection be allowed to take? How often do we think people will be looking for intersections?
I think the consensus is that once we have this, it will be very popular (from previous discussions on the mailing list). I asked Brion about this performance expectations once, and he said something like that he'd like to see it run in less than a second. I'm thinking that if it runs in 1 second +/- .5 secs, it might be okay, and having a ajax interface would make it palatable to the user experience (somehow waiting for a div to load seems less maddening than waiting for the whole page... plus, it might reduce some overhead associated with rendering the mediawiki page. OTOH... not friendly for mobile users etc... Hmm...)
Can we save the
results of a popular intersections (like American people intersected with Actors) so that the intersection is updated less frequently?
Yes, but then we have to add caching logic. You wouldn't want to save it too long, because then it becomes outdated.
User:Rick Block, User:Radiant! and I have been discussing ways to optimize
or limit intersections so that server load will be less of a problem and hopefully doable. Please see our discussions at [[en:User talk:Radiant!]] Also, if anyone hasn't seen it, we've done quite a bit of work on possible user interfaces. These are at [[en:Wikikpedia:Category intersection]].
Yes, I've been following ideas for reducing server load, but I am really putting all my effort in to finding the fastest realistic implementation - I want to see what we can do, and hopefully have as few limits as possible.
From the experimenting I'm doing, I'm getting FAR better (10x or better)
results from the fulltext index than from hitting the categorylinks table.
There are many ways intersections can be constrained to put less of a load
on the servers. I'm hoping that we can implement this at some level soon. The categorization system is being turned into a database search system. (snipped for brevity) The longer we wait for this to be implemented, the more broken up categories will become (at least on en: German Wikipedia doesn't seem to have this problem).
I couldn't agree with you more! I'm absolutely convinced that this (intersections/math/whatver) is a much more natural way to organize and find information.
Thanks, and keep up the great work.
Thanks! I'm going to (meant to do it by now) add some logging functionality to my test script so we can get some performance statistics. As soon as I do that, I'll point to it from En:Category_intersection to get more interested testers. (If I can find the time, I'll enhance the UI too.)
Once I've got some statistics, we can have a meaningful discussion about the viability of this approach. If this doesn't perform fast enough, I'll thinking I'll do zend/lucene index (never done that before, it would be interesting) - but then as I understand it, the reindex process can be rather lengthy. At any rate, that would probably fit will into existing processes since we're using lucene from general search, and this would just require maintaining and indexing the extra table field for the list of categories. We'll see... I still want to give MySQL's Fulltext index a good try because it's so much simpler to implement and would be easy to roll up into the general mediawiki app.
Thanks again for the feedback, Aerik