Here's something that looks promising: Using MySQL's fulltext search against categories only. What I did was create a table having pagenumbers and category text that looks like this:
http://athlonx2/phpmyadmin/sql.php?db=categorylinks&table=pages&token=71a69e555a82e3d94a8450b4ace4496a&sql_query=SELECT+%2A+FROM+%60pages%60&session_max_rows=30&pos=0&disp_direction=horizontal&repeat_cells=100&goto=tbl_properties_structure.php&dontlimitchars=1 IDhttp://athlonx2/phpmyadmin/sql.php?db=categorylinks&table=pages&token=71a69e555a82e3d94a8450b4ace4496a&pos=0&session_max_rows=30&disp_direction=horizontal&repeat_cells=100&dontlimitchars=0&sql_query=SELECT+%2AFROM+%60pages%60++ORDER+BY+%60ID%60+ASC pagenumhttp://athlonx2/phpmyadmin/sql.php?db=categorylinks&table=pages&token=71a69e555a82e3d94a8450b4ace4496a&pos=0&session_max_rows=30&disp_direction=horizontal&repeat_cells=100&dontlimitchars=0&sql_query=SELECT+%2AFROM+%60pages%60++ORDER+BY+%60pagenum%60+ASC pagenamehttp://athlonx2/phpmyadmin/sql.php?db=categorylinks&table=pages&token=71a69e555a82e3d94a8450b4ace4496a&pos=0&session_max_rows=30&disp_direction=horizontal&repeat_cells=100&dontlimitchars=0&sql_query=SELECT+%2AFROM+%60pages%60++ORDER+BY+%60pagename%60+ASC catcounthttp://athlonx2/phpmyadmin/sql.php?db=categorylinks&table=pages&token=71a69e555a82e3d94a8450b4ace4496a&pos=0&session_max_rows=30&disp_direction=horizontal&repeat_cells=100&dontlimitchars=0&sql_query=SELECT+%2AFROM+%60pages%60++ORDER+BY+%60catcount%60+ASC catlisthttp://athlonx2/phpmyadmin/sql.php?db=categorylinks&table=pages&token=71a69e555a82e3d94a8450b4ace4496a&pos=0&session_max_rows=30&disp_direction=horizontal&repeat_cells=100&dontlimitchars=0&sql_query=SELECT+%2AFROM+%60pages%60++ORDER+BY+%60catlist%60+ASC
7343 0 Kurt000 14 Chinese_Wikipedians Leo_Wikipedians User_cello Us...
7344 1 AaA 3 Redirects_from_CamelCase Redirects_from_other_cap...
7345 5 AlgeriA 2 Redirects_from_CamelCase Unprintworthy_redirects
7346 6 AmericanSamoa 2 Redirects_from_CamelCase Unprintworthy_redirects So the categories have underscores instead of spaces, thus forcing mysql to index the whole category phrase as one word. In a table with about 114,000 rows with categories (it takes a long time to recombine the categories into one row - I'm sure the way I'm doing it is clunky), I get very reasonable query result times. Take our worst case "Living_People" category for example:
Showing rows 0 - 29 (4,105 total, Query took 0.0042 sec) SQL query: SELECT * FROM `pages` WHERE MATCH ( catlist ) AGAINST ( 'Living_People' ) LIMIT 0 , 30
This also give us an immediate solution to category math (Boolean searches):
Showing rows 0 - 29 (61 total, Query took 0.0058 sec) SQL query: SELECT * FROM `pages` WHERE MATCH ( catlist ) AGAINST ( '+Living_People +1969_births' IN BOOLEAN MODE ) LIMIT 0 , 30
So, I know 114,000 records is much smaller than one million +, but this looks quite promising.
I'm using PHP to cycle through records in the categorylinks table to recreate this index. If one of you smart guys will tell me a better way, I'll try this on a larger data set.
Best Regards, Aerik
wikitech-l@lists.wikimedia.org