On Thu, Mar 6, 2008 at 4:16 PM, Magnus Manske
<magnusmanske(a)googlemail.com> wrote:
I tried it on my (mostly empty) MediaWiki test setup,
and it works
peachy. However, *I NEED HELP* with
* testing it on a large-scale installation
* integrating it with MediaWiki more tightly (database wrappers, caching, etc.)
* Brionizing the code, so it actually has a chance to be used on
Wikipedia and/or Commons
I would help out, but I don't think there's any reason to settle for a
sharply limited number of intersections, which I guess this approach
requires.
* More than two intersections are implemented by
nesting subqueries
Subqueries only work in MySQL 4.1. You'll need to rewrite those as
joins if you want this to run on Wikimedia, or probably to perform
acceptably on any version of MySQL (MySQL is pretty terrible even in
5.0 at optimizing subqueries). And then we're back to the poor join
performance that was an issue to start with, just with one join less,
aren't we?
* Hash values are implemented as VARCHAR(32). Could
easily switch to
INTEGER if desirable (less storage, faster lookup, but more false
positives)
BIGINT would give a trivial number of false positives. INT would
probably be a bit faster, especially on 32-bit machines, and while it
would inevitably give some false positives, those should be rare
enough to be easily filtered on the application side, if you don't
have to run extra queries to do the filtering.
* The hash values will only give good candidates
(pages that *might*
intersect in these categories). The candidates have then to be checked
in a second run, which will have to be optimized; database people to
the front!
Why don't they give definite info if you're using the full MD5 hash?
* SQL queries are currently "plain text"
and not constructed through
the DB wrappers; I wan't sure how to do that for the subqueries
You can't. Left joins also need to manually written out, I think. Of
course, for subqueries there's a particularly good reason there's no
way to do it, since MediaWiki code doesn't use anything from later
than MySQL 4.0. :)