Samuel Wantman wrote:
Very interesting, but what puzzling results. I tried intersections of some
> of the largest categories I could think of and got wildly different
> results. The first time I tried American actors intersected with Living
> people the results took 1.5 seconds the next time I tried it it was 0.0011
> seconds.
I think a lot of the variation is due to server load - this is running on a
shared webhost FYI (I'm too cheap for a dedicated server right now).
What does this mean? Is the long result because of server
> traffic? Is the short result close to the time that it actually
> took? Are
> the results being cached?
They were, but I've added SQL_NO_CACHE so the query runs live all the time.
Also, the intersection of large categories with
> numerous articles common to both seems to take longer than categories of
> similar size with no common articles. This is not what I expected. If
> you
> are just looking for the first 30 results, I'd expect the first case to be
> faster. The second has to look through the entire category for a match.
Hmm... I don't know. When I query "+Living_people +Frogs" it returns no
results and takes about 2.5 seconds...
My question to developers is what is the criteria for an acceptable
> solution? How much time can an intersection be allowed to take? How
> often
> do we think people will be looking for intersections?
I think the consensus is that once we have this, it will be very popular
(from previous discussions on the mailing list). I asked Brion about this
performance expectations once, and he said something like that he'd like to
see it run in less than a second. I'm thinking that if it runs in 1 second
+/- .5 secs, it might be okay, and having a ajax interface would make it
palatable to the user experience (somehow waiting for a div to load seems
less maddening than waiting for the whole page... plus, it might reduce some
overhead associated with rendering the mediawiki page. OTOH... not friendly
for mobile users etc... Hmm...)
Can we save the
> results of a popular intersections (like American people intersected with
> Actors) so that the intersection is updated less frequently?
Yes, but then we have to add caching logic. You wouldn't want to save it
too long, because then it becomes outdated.
User:Rick Block, User:Radiant! and I have been discussing ways to optimize
> or limit intersections so that server load will be less of a problem and
> hopefully doable. Please see our discussions at [[en:User
> talk:Radiant!]] Also, if anyone hasn't seen it, we've done quite a bit of
> work on possible user interfaces. These are at [[en:Wikikpedia:Category
> intersection]].
Yes, I've been following ideas for reducing server load, but I am really
putting all my effort in to finding the fastest realistic implementation - I
want to see what we can do, and hopefully have as few limits as possible.
>From the experimenting I'm doing, I'm getting FAR better (10x or better)
results from the fulltext index than from hitting the categorylinks table.
There are many ways intersections can be constrained to put less of a load
> on the servers. I'm hoping that we can implement this at some level
> soon. The categorization system is being turned into a database search
> system. (snipped for brevity) The longer we wait for this to be
> implemented,
> the more broken up categories will become (at least on en: German
> Wikipedia
> doesn't seem to have this problem).
I couldn't agree with you more! I'm absolutely convinced that this
(intersections/math/whatver) is a much more natural way to organize and find
information.
Thanks, and keep up the great work.
Thanks! I'm going to (meant to do it by now) add some logging functionality
to my test script so we can get some performance statistics. As soon as I
do that, I'll point to it from En:Category_intersection to get more
interested testers. (If I can find the time, I'll enhance the UI too.)
Once I've got some statistics, we can have a meaningful discussion about the
viability of this approach. If this doesn't perform fast enough, I'll
thinking I'll do zend/lucene index (never done that before, it would be
interesting) - but then as I understand it, the reindex process can be
rather lengthy. At any rate, that would probably fit will into existing
processes since we're using lucene from general search, and this would just
require maintaining and indexing the extra table field for the list of
categories. We'll see... I still want to give MySQL's Fulltext index a good
try because it's so much simpler to implement and would be easy to roll up
into the general mediawiki app.
Thanks again for the feedback,
Aerik