Re: [Wikitech-l] What does it take to get DPLs added to a project?

5 May 2007


      Rob Church wrote:
...
The classic intersections use UNION queries which are too expensive;
this is part of the reason we don't have a straightforward
intersecting search, and hence are having to investigate other means.
It's not a union actually, it's a straight join. An OR operation is a
union, both in set theory and in SQL.
To get an idea of the potential performance problems, I made a list of the
categories with the most members on commons.
563713  GFDL
405084  Self-published_work
218472  PD-self
205838  PD_Old
139486  CC-BY-SA-2.5,2.0,1.0
125657  CC-BY-SA-2.5
117172  CC-BY-2.5
38695   PD-user
29312   CC-BY-2.0
22821   Insignia
21902   CC-BY-SA-2.0
20906   User-created_GFDL_images
How long does it take to calculate the intersection between the two
biggest categories? On lomaria:
mysql> select count(*) from categorylinks as c1, categorylinks as c2 where
c1.cl_to='GFDL' and c2.cl_to='Self-published_work' and c1.cl_from=c2.cl_from;
+----------+
| count(*) |
+----------+
|   317870 |
+----------+
1 row in set (1.94 sec)
And again immediately afterwards, with a comment to override the query
cache, but testing the index cache:
mysql> select /**/ count(*) from categorylinks as c1, categorylinks as c2
where c1.cl_to='GFDL' and c2.cl_to='Self-published_work' and
c1.cl_from=c2.cl_from;
+----------+
| count(*) |
+----------+
|   317870 |
+----------+
1 row in set (1.91 sec)
Which is really not that bad, considering it is the worst case. The index
size for commonswiki.categorylinks is 670MB, which our DB servers
shouldn't have too much trouble with at present. Indeed, the query would
be a lot slower if it were hitting the disk. Maybe it is practical, as
long as we put it in a query group so that it doesn't crash the whole s2
cluster if someone decides to put such an intersection on the commons main
page.
Intersecting GFDL with various other categories down the list, here is the
timing after priming the cache:
PD-self: 1.17s
CC-BY-SA-2.5,2.0,1.0: 0.75s
PD-user: 0.22s
Insignia: 0.15s
User-created_GFDL_images: 0.10s
The figures show that the timing depends on the size of each category not
on the size of the intersection, which is what you'd expect. The
intersection for PD-self had 1442 rows, and the intersection for
"CC-BY-SA-2.5,2.0,1.0" had 135107 rows, but the timing was similar.
I use cache-primed figures (i.e. the second attempt), because this better
reflects the worst case scenario of high request rates. I'm assuming that
the whole index would be held in memory under such circumstances.
A special page would be better, it would reduce the number of unnecessary
requests. And there may be other performance problems with DPL that I'm
not aware of. But on the basis of these results, it looks feasible.
-- Tim Starling

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] What does it take to get DPLs added to a project?