Tim Starling <tstarling@...> writes (edited for brevity):
It's not a union actually, it's a straight
join. An OR operation is a
union, both in set theory and in SQL.
To get an idea of the potential performance problems, I made a list of the
categories with the most members on commons.
How long does it take to calculate the intersection between the two
biggest categories? On lomaria:
1 row in set (1.94 sec)
1 row in set (1.91 sec)
Which is really not that bad, considering it is the worst case.
The figures show that the timing depends on the size of each category not
on the size of the intersection, which is what you'd expect.
I use cache-primed figures (i.e. the second attempt), because this better
reflects the worst case scenario of high request rates. I'm assuming that
the whole index would be held in memory under such circumstances.
A special page would be better, it would reduce the number of unnecessary
requests. And there may be other performance problems with DPL that I'm
not aware of. But on the basis of these results, it looks feasible.
Hi Tim,
I did a lot of the same tests, using a both the join sql, and also a count/group
by version (they performed comparably) on a copy of categorylinks from En. The
worst case scenarios for those was not terribly attractive (I forgot what it
was, but I want to say something like 4 or 5 seconds for intersections with
"Living People"). Then I had this idea that using MySQL's fulltext index
should
take a layer of complexity out - after all, what we're doing is *exactly* a
boolean search on a phrase, so I made a table having the all categories (with
underscores) for each article in a row, and indexed it (so it had to be a MyIsam
table). This performed much better, typically returned worst case scenario
results in just over a second.
I've got some data at
http://aerik.com/wikintslog.txt (query/rows/time/IP - note
that my server didn't seem to allow me to suppress the cache - couldn't figure
that out - got any ideas?). But it seems like this is likely not to be good
enough performance to run on En, for example, sending lots of queries that might
run a second or longer is probably a bad idea.
So, for comparison, I was going to try to create a lucene index (Brion's
suggestion) of categorylinks. I don't have a java setup and don't know java, so
I started using Zend_Search_Lucene, but ran into a bug that has stopped me cold
for the time being. I've been in touch with Alexander Veremyev at Zend, but I
dont' know if there's any progress.
But I was thinking, Tim, aren't you the guy who set up the Java Lucene index?
What do you think of this approach?
Best Regards,
Aerik