Samuel Wantman wrote:
Very interesting, but what puzzling results. I tried intersections of some
of the largest categories I could think of and got
wildly different
results. The first time I tried American actors intersected with Living
people the results took 1.5 seconds the next time I tried it it was 0.0011
seconds.
I think a lot of the variation is due to server load - this is running on a
shared webhost FYI (I'm too cheap for a dedicated server right now).
What does this mean? Is the long result because of server
traffic? Is the short result close to the time that
it actually
took? Are
the results being cached?
They were, but I've added SQL_NO_CACHE so the query runs live all the time.
Also, the intersection of large categories with
numerous articles common to both seems to take longer
than categories of
similar size with no common articles. This is not what I expected. If
you
are just looking for the first 30 results, I'd expect the first case to be
faster. The second has to look through the entire category for a match.
Hmm... I don't know. When I query "+Living_people +Frogs" it returns no
results and takes about 2.5 seconds...
My question to developers is what is the criteria for an acceptable
solution? How much time can an intersection be
allowed to take? How
often
do we think people will be looking for intersections?
I think the consensus is that once we have this, it will be very popular
(from previous discussions on the mailing list). I asked Brion about this
performance expectations once, and he said something like that he'd like to
see it run in less than a second. I'm thinking that if it runs in 1 second
+/- .5 secs, it might be okay, and having a ajax interface would make it
palatable to the user experience (somehow waiting for a div to load seems
less maddening than waiting for the whole page... plus, it might reduce some
overhead associated with rendering the mediawiki page. OTOH... not friendly
for mobile users etc... Hmm...)
Can we save the
results of a popular intersections (like American
people intersected with
Actors) so that the intersection is updated less frequently?
Yes, but then we have to add caching logic. You wouldn't want to save it
too long, because then it becomes outdated.
User:Rick Block, User:Radiant! and I have been discussing ways to optimize
or limit intersections so that server load will be
less of a problem and
hopefully doable. Please see our discussions at [[en:User
talk:Radiant!]] Also, if anyone hasn't seen it, we've done quite a bit of
work on possible user interfaces. These are at [[en:Wikikpedia:Category
intersection]].
Yes, I've been following ideas for reducing server load, but I am really
putting all my effort in to finding the fastest realistic implementation - I
want to see what we can do, and hopefully have as few limits as possible.
From the experimenting I'm doing, I'm getting
FAR better (10x or better)
results from the fulltext index than from hitting the
categorylinks table.
There are many ways intersections can be constrained to put less of a load
on the servers. I'm hoping that we can implement
this at some level
soon. The categorization system is being turned into a database search
system. (snipped for brevity) The longer we wait for this to be
implemented,
the more broken up categories will become (at least on en: German
Wikipedia
doesn't seem to have this problem).
I couldn't agree with you more! I'm absolutely convinced that this
(intersections/math/whatver) is a much more natural way to organize and find
information.
Thanks, and keep up the great work.
Thanks! I'm going to (meant to do it by now) add some logging functionality
to my test script so we can get some performance statistics. As soon as I
do that, I'll point to it from En:Category_intersection to get more
interested testers. (If I can find the time, I'll enhance the UI too.)
Once I've got some statistics, we can have a meaningful discussion about the
viability of this approach. If this doesn't perform fast enough, I'll
thinking I'll do zend/lucene index (never done that before, it would be
interesting) - but then as I understand it, the reindex process can be
rather lengthy. At any rate, that would probably fit will into existing
processes since we're using lucene from general search, and this would just
require maintaining and indexing the extra table field for the list of
categories. We'll see... I still want to give MySQL's Fulltext index a good
try because it's so much simpler to implement and would be easy to roll up
into the general mediawiki app.
Thanks again for the feedback,
Aerik