[Wikitech-l] Tag intersection, crazy idea

29 Feb 2008


      I just had the following thought: For a tag intersection system,
* we limit queries to two intersections (show all pages with categories A and B)
* we assume on average 5 categories per page (can someone check that?)
then we have 5*4=20 intersections per page.
Now, for each intersection, we calculate MD5("A|B") to get an integer
hash, and store that in a new table (page_id INTEGER,intersection_hash
INTERGER).
That table would be 4 times as long as the categorylinks table.
* Memory usage: Acceptable (?)
* Update: Fast, on page edit only
* Works for non-existing categories
On a query, we look for the (indexed) hash in a subquery, then check
those against the actual categorylinks.
Looking up an integer in the subquery should be fast enough ;-)
Given the number of categories and INTEGER >4bn, that would make the
hash unique for all combinations of 65K categories (if the hash were
truely randomly distributed, which it isn't), which should mean that
the number of false positives (to be filtered by the main query)
should be rather low.
If that's fast enough, we could even expand to three intersections (A,
B, and C), querying "A|B", "A|C", and "B|C", and let PHP find the ones
common to all three sets.
Summary: Fixing slow MySQL lookup by throwing memory at it...
Feasible? Or pipe dream?
Magnus

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] Tag intersection, crazy idea