Re: [Wikitech-l] Category intersection: New extension available

7 Mar 2008

...
  -----Original Message-----
 From: wikitech-l-bounces(a)lists.wikimedia.org 
 [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of 
 Simetrical
 Sent: 07 March 2008 14:48
 To: Wikimedia developers
 Subject: Re: [Wikitech-l] Category intersection: New 
 extension available

 On Thu, Mar 6, 2008 at 4:16 PM, Magnus Manske 
 &lt;magnusmanske(a)googlemail.com&gt; wrote:
   I tried it on my (mostly empty) MediaWiki test
setup, and   it works  
  peachy. However, *I NEED HELP* with
  * testing it on a large-scale installation
  * integrating it with MediaWiki more tightly (database wrappers, 
 caching, etc.)
  * Brionizing the code, so it actually has a chance to be used on  
 Wikipedia and/or Commons  
 I would help out, but I don't think there's any reason to 
 settle for a sharply limited number of intersections, which I 
 guess this approach requires.

   * More than two intersections are implemented by
nesting subqueries  
 Subqueries only work in MySQL 4.1.  You'll need to rewrite 
 those as joins if you want this to run on Wikimedia, or 
 probably to perform acceptably on any version of MySQL (MySQL 
 is pretty terrible even in 5.0 at optimizing subqueries).  
 And then we're back to the poor join performance that was an 
 issue to start with, just with one join less, aren't we? 
Yeah did notice that, think it could be replaced with something like.

SELECT ci_page FROM {$table_categoryintersections} WHERE ci_hash IN
(implode(',', $hashes)) 
GROUP BY ci_page
HAVING COUNT(*) = count($hashes)
LIMIT $this->max_hash_results

...

   * Hash values are implemented as VARCHAR(32).
Could easily   switch to  
  INTEGER if desirable (less storage, faster
lookup, but more false
  positives)  
 BIGINT would give a trivial number of false positives.  INT 
 would probably be a bit faster, especially on 32-bit 
 machines, and while it would inevitably give some false 
 positives, those should be rare enough to be easily filtered 
 on the application side, if you don't have to run extra 
 queries to do the filtering.

   * The hash values will only give good candidates
(pages   that *might*  
  intersect in these categories). The candidates
have then to   be checked  
  in a second run, which will have to be optimized;
database   people to  
  the front!  
 Why don't they give definite info if you're using the full MD5 hash? 
Yeah, I think chances of hash collisions are unlikely, whats far more likely
is someone recategorizing a page after a search. Which means the double
check could be removed.

Jared

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Category intersection: New extension available