Category intersection: New extension available

List overview All Threads
Download

newer

older

Re: [Wikitech-l] Category...

Wahlvorschlag

Magnus Manske

7 Mar 2008 7 Mar '08

1:16 a.m.

(cross-posting wikitech-l and commons-l)

I created a new extension ("CategoryIntersection") that allows for quick lookup of pages (and image) in intersecting categories. That would enable wiki(m|p)edia sites to use categories as tags, eliminating the need for oh-so-specialized categories.

Intersection of two categories works very fast, but intersecting more categories is possible, and already implemented; the maximum number can be limited.

I tried it on my (mostly empty) MediaWiki test setup, and it works peachy. However, *I NEED HELP* with * testing it on a large-scale installation * integrating it with MediaWiki more tightly (database wrappers, caching, etc.) * Brionizing the code, so it actually has a chance to be used on Wikipedia and/or Commons

Techinical notes: * This was recently discussed on wikitech-l * More than two intersections are implemented by nesting subqueries * Hash values are implemented as VARCHAR(32). Could easily switch to INTEGER if desirable (less storage, faster lookup, but more false positives) * The hash values will only give good candidates (pages that *might* intersect in these categories). The candidates have then to be checked in a second run, which will have to be optimized; database people to the front! * Table to store hash values has to be created manually; SQL is in the main file * I didn't implement code to fill the table for an existing installation; however, since hash table updates solely hang on the LinksUpdate hook, this should be easy * There is no code covering page moves and deletions yet; do those hang on LinksUpdate as well? * SQL queries are currently "plain text" and not constructed through the DB wrappers; I wan't sure how to do that for the subqueries

Cheers, Magnus

Show replies by date

Jared Williams

7 Mar 7 Mar

5:26 a.m.

Hi, Quick glance at it, and seems the index is the wrong way around?

ci_hash should be first, as that's what your searching by primarily.

Jared

...

-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Magnus Manske Sent: 06 March 2008 21:16 To: Wikimedia developers; Wikimedia Commons Discussion List Subject: [Wikitech-l] Category intersection: New extension available

(cross-posting wikitech-l and commons-l)

I created a new extension ("CategoryIntersection") that allows for quick lookup of pages (and image) in intersecting categories. That would enable wiki(m|p)edia sites to use categories as tags, eliminating the need for oh-so-specialized categories.

Intersection of two categories works very fast, but intersecting more categories is possible, and already implemented; the maximum number can be limited.

I tried it on my (mostly empty) MediaWiki test setup, and it works peachy. However, *I NEED HELP* with

testing it on a large-scale installation

integrating it with MediaWiki more tightly (database

wrappers, caching, etc.)

Brionizing the code, so it actually has a chance to be used

on Wikipedia and/or Commons

Techinical notes:

This was recently discussed on wikitech-l

More than two intersections are implemented by nesting subqueries

Hash values are implemented as VARCHAR(32). Could easily

switch to INTEGER if desirable (less storage, faster lookup, but more false positives)

The hash values will only give good candidates (pages that

*might* intersect in these categories). The candidates have then to be checked in a second run, which will have to be optimized; database people to the front!

Table to store hash values has to be created manually; SQL

is in the main file

I didn't implement code to fill the table for an existing

installation; however, since hash table updates solely hang on the LinksUpdate hook, this should be easy

There is no code covering page moves and deletions yet; do

those hang on LinksUpdate as well?

SQL queries are currently "plain text" and not constructed

through the DB wrappers; I wan't sure how to do that for the subqueries

Cheers, Magnus

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Magnus Manske

12:53 p.m.

On Fri, Mar 7, 2008 at 1:26 AM, Jared Williams jared.williams1@ntlworld.com wrote:

...

Hi, Quick glance at it, and seems the index is the wrong way around?
    ci_hash should be first, as that's what your searching by primarily.

Of course!

Thanks, Magnus

Simetrical

6:48 p.m.

On Thu, Mar 6, 2008 at 4:16 PM, Magnus Manske magnusmanske@googlemail.com wrote:

...

I tried it on my (mostly empty) MediaWiki test setup, and it works peachy. However, *I NEED HELP* with

testing it on a large-scale installation

integrating it with MediaWiki more tightly (database wrappers, caching, etc.)

Brionizing the code, so it actually has a chance to be used on

Wikipedia and/or Commons

I would help out, but I don't think there's any reason to settle for a sharply limited number of intersections, which I guess this approach requires.

...

More than two intersections are implemented by nesting subqueries

Subqueries only work in MySQL 4.1. You'll need to rewrite those as joins if you want this to run on Wikimedia, or probably to perform acceptably on any version of MySQL (MySQL is pretty terrible even in 5.0 at optimizing subqueries). And then we're back to the poor join performance that was an issue to start with, just with one join less, aren't we?

...

Hash values are implemented as VARCHAR(32). Could easily switch to

INTEGER if desirable (less storage, faster lookup, but more false positives)

BIGINT would give a trivial number of false positives. INT would probably be a bit faster, especially on 32-bit machines, and while it would inevitably give some false positives, those should be rare enough to be easily filtered on the application side, if you don't have to run extra queries to do the filtering.

...

The hash values will only give good candidates (pages that *might*

intersect in these categories). The candidates have then to be checked in a second run, which will have to be optimized; database people to the front!

Why don't they give definite info if you're using the full MD5 hash?

...

SQL queries are currently "plain text" and not constructed through

the DB wrappers; I wan't sure how to do that for the subqueries

You can't. Left joins also need to manually written out, I think. Of course, for subqueries there's a particularly good reason there's no way to do it, since MediaWiki code doesn't use anything from later than MySQL 4.0. :)

Jared Williams

9:17 p.m.

...

-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Simetrical Sent: 07 March 2008 14:48 To: Wikimedia developers Subject: Re: [Wikitech-l] Category intersection: New extension available

On Thu, Mar 6, 2008 at 4:16 PM, Magnus Manske magnusmanske@googlemail.com wrote:

...
I tried it on my (mostly empty) MediaWiki test setup, and

it works

...
peachy. However, *I NEED HELP* with

testing it on a large-scale installation

integrating it with MediaWiki more tightly (database wrappers,

caching, etc.)

Brionizing the code, so it actually has a chance to be used on

Wikipedia and/or Commons

I would help out, but I don't think there's any reason to settle for a sharply limited number of intersections, which I guess this approach requires.

...

More than two intersections are implemented by nesting subqueries

Subqueries only work in MySQL 4.1. You'll need to rewrite those as joins if you want this to run on Wikimedia, or probably to perform acceptably on any version of MySQL (MySQL is pretty terrible even in 5.0 at optimizing subqueries). And then we're back to the poor join performance that was an issue to start with, just with one join less, aren't we?

Yeah did notice that, think it could be replaced with something like.

SELECT ci_page FROM {$table_categoryintersections} WHERE ci_hash IN (implode(',', $hashes)) GROUP BY ci_page HAVING COUNT(*) = count($hashes) LIMIT $this->max_hash_results

...

...

Hash values are implemented as VARCHAR(32). Could easily

switch to

...
INTEGER if desirable (less storage, faster lookup, but more false positives)

BIGINT would give a trivial number of false positives. INT would probably be a bit faster, especially on 32-bit machines, and while it would inevitably give some false positives, those should be rare enough to be easily filtered on the application side, if you don't have to run extra queries to do the filtering.

...

The hash values will only give good candidates (pages

that *might*

...
intersect in these categories). The candidates have then to

be checked

...
in a second run, which will have to be optimized; database

people to

...
the front!

Why don't they give definite info if you're using the full MD5 hash?

Yeah, I think chances of hash collisions are unlikely, whats far more likely is someone recategorizing a page after a search. Which means the double check could be removed.

Jared

Magnus Manske

8 Mar 8 Mar

2 a.m.

On Fri, Mar 7, 2008 at 2:48 PM, Simetrical Simetrical+wikilist@gmail.com wrote:

...

On Thu, Mar 6, 2008 at 4:16 PM, Magnus Manske magnusmanske@googlemail.com wrote:

...
I tried it on my (mostly empty) MediaWiki test setup, and it works peachy. However, *I NEED HELP* with

testing it on a large-scale installation

integrating it with MediaWiki more tightly (database wrappers, caching, etc.)

Brionizing the code, so it actually has a chance to be used on

Wikipedia and/or Commons

I would help out, but I don't think there's any reason to settle for a sharply limited number of intersections, which I guess this approach requires.

Well, unless you have a better idea...? Yes, there's fulltext, but is that really the most efficient way? I certainly isn't the most elegant;-)

...

...

More than two intersections are implemented by nesting subqueries

Subqueries only work in MySQL 4.1. You'll need to rewrite those as joins if you want this to run on Wikimedia, or probably to perform acceptably on any version of MySQL (MySQL is pretty terrible even in 5.0 at optimizing subqueries). And then we're back to the poor join performance that was an issue to start with, just with one join less, aren't we?

Huh. Remind me again, why are we stuck at 4.0.26?

Even so, we can do fast A|B intersections. Also, we can do 4 intersections with A|B and C|D, which is "half the joins".

For more intersections, I see three ways out: * Move to MySQL 4.1 :-) * Request complete list for every A|B combination, and merge in PHP. Too much memory usage ()? * Store "higher order" hashes as well - A|B|C, A|B|C|D... Waste of diskspace?

...

...

Hash values are implemented as VARCHAR(32). Could easily switch to

INTEGER if desirable (less storage, faster lookup, but more false positives)

BIGINT would give a trivial number of false positives. INT would probably be a bit faster, especially on 32-bit machines, and while it would inevitably give some false positives, those should be rare enough to be easily filtered on the application side, if you don't have to run extra queries to do the filtering.

I've switched to INT UNSIGNED, based on the first 8 hex chars of the MD5. Also, I've made the second check optional, turned off by default.

Do you think we can rescue this? Even if not, it has most of the infrastructure needed for other approaches, e.g., fulltext.

Cheers, Magnus

Simetrical

2:25 a.m.

On Fri, Mar 7, 2008 at 5:00 PM, Magnus Manske magnusmanske@googlemail.com wrote:

...

Well, unless you have a better idea...? Yes, there's fulltext, but is that really the most efficient way? I certainly isn't the most elegant;-)

Manually calculating all intersections beforehand isn't terribly elegant either. :)

...

Huh. Remind me again, why are we stuck at 4.0.26?

Later versions mostly gain new features that contribute nothing to performance, which is what we care about. In most cases, MySQL internally rewrites subqueries to joins or worse anyway, so it's not like you gain much.

Try running EXPLAIN on the query. I'd expect to see a temporary table created with intermediate results at each step: want to prove me wrong?

...

For more intersections, I see three ways out:

Move to MySQL 4.1 :-)

Almost certainly doesn't help, as I said. I would be surprised if subqueries worked with any kind of remotely acceptable efficiency.

...

Request complete list for every A|B combination, and merge in PHP.

Too much memory usage ()?

And too much network latency, and too much network traffic. You're talking about megabytes being sent either way in the worst case, with intersections of hundreds of thousands (consider Men intersect Living people, with flattened categories so that Men is properly populated).

...

Store "higher order" hashes as well - A|B|C, A|B|C|D... Waste of diskspace?

The most extreme would be to just store every possible intersection individually. This isn't quite as insane as it sounds, since almost all intersections would be empty, and so not stored. It would result in exponential storage requirements in the number of rows per article, however, specifically 2^n - 1 rows per article (the number of nonzero bit patterns in n bits, if you like). For an article with 50 or even 20 categories this would obviously be ludicrous, so some compromise would be necessary. If you stored up to four or five, it would only be five or four joins for even a 20-way intersection, but I haven't calculated the storage requirements of that (got to go in a minute). Nor am I sure of the efficiency in other ways.

It's not like fulltext is O(1) in the number of search terms, or the number of categories per page. Maybe I'm totally wrong. But I expect it to work better, is all. It will certainly handle subtractions, which are valuable and not conceivably possible to cover in your scheme, as far as I can tell. It's also a proven technique used by more than one popular software product for tag intersection, AFAIK.

...

Do you think we can rescue this?

Well, I'm not going to tell you what to spend your time on. It's always good to have a plurality of ideas. It's not the approach I'd focus on first, but what do I know?

Ilya Haykinson

7 Mar 7 Mar

9:15 p.m.

New subject: [Commons-l] Category intersection: New extension available

For what it's worth, the extension http://www.mediawiki.org/wiki/DynamicPageList has been in use on various Wikimedia sites for a while now with great success to allow for category intersections, and I think the latest versions support image galleries etc.

-ilya

On Thu, Mar 6, 2008 at 1:16 PM, Magnus Manske magnusmanske@googlemail.com wrote:

...

(cross-posting wikitech-l and commons-l)

I created a new extension ("CategoryIntersection") that allows for quick lookup of pages (and image) in intersecting categories. That would enable wiki(m|p)edia sites to use categories as tags, eliminating the need for oh-so-specialized categories.

Intersection of two categories works very fast, but intersecting more categories is possible, and already implemented; the maximum number can be limited.

I tried it on my (mostly empty) MediaWiki test setup, and it works peachy. However, *I NEED HELP* with

testing it on a large-scale installation

integrating it with MediaWiki more tightly (database wrappers, caching, etc.)

Brionizing the code, so it actually has a chance to be used on

Wikipedia and/or Commons

Techinical notes:

This was recently discussed on wikitech-l

More than two intersections are implemented by nesting subqueries

Hash values are implemented as VARCHAR(32). Could easily switch to

INTEGER if desirable (less storage, faster lookup, but more false positives)

The hash values will only give good candidates (pages that *might*

intersect in these categories). The candidates have then to be checked in a second run, which will have to be optimized; database people to the front!

Table to store hash values has to be created manually; SQL is in the main file

I didn't implement code to fill the table for an existing

installation; however, since hash table updates solely hang on the LinksUpdate hook, this should be easy

There is no code covering page moves and deletions yet; do those

hang on LinksUpdate as well?

SQL queries are currently "plain text" and not constructed through

the DB wrappers; I wan't sure how to do that for the subqueries

Cheers, Magnus

Commons-l mailing list Commons-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

Simetrical

10:40 p.m.

New subject: [Commons-l] Category intersection: New extension available

On Fri, Mar 7, 2008 at 12:15 PM, Ilya Haykinson haykinson@gmail.com wrote:

...

For what it's worth, the extension http://www.mediawiki.org/wiki/DynamicPageList has been in use on various Wikimedia sites for a while now with great success to allow for category intersections, and I think the latest versions support image galleries etc.

We know. DPL is not suitable for use on large wikis.

On Fri, Mar 7, 2008 at 12:17 PM, Jared Williams jared.williams1@ntlworld.com wrote:

...

Yeah did notice that, think it could be replaced with something like.

SELECT ci_page FROM {$table_categoryintersections} WHERE ci_hash IN (implode(',', $hashes)) GROUP BY ci_page HAVING COUNT(*) = count($hashes) LIMIT $this->max_hash_results

I'm not going to spend too much time parsing that, but it's an automatic filesort of the entire set included by the WHERE clause, i.e., the union of all the category intersections in question, since MySQL doesn't support loose index scans for WHERE x IN (...) GROUP BY y. Repeated join seems likely to be faster, although maybe not, I haven't benchmarked it or anything.

...

Yeah, I think chances of hash collisions are unlikely, whats far more likely is someone recategorizing a page after a search. Which means the double check could be removed.

It's not just unlikely, it's so unlikely as to be impossible to all intents and purposes, barring deliberately-constructed collisions (which are possible with MD5, although maybe not for such short strings, I forget). Worry about a meteor wiping out the data center before you worry about MD5 collisions by chance on sets with cardinality in the billions.

Jared Williams

11:56 p.m.

New subject: [Commons-l] Category intersection: New extensionavailable

...

-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Simetrical Sent: 07 March 2008 18:41 To: Wikimedia developers Subject: Re: [Wikitech-l] [Commons-l] Category intersection: New extensionavailable

On Fri, Mar 7, 2008 at 12:15 PM, Ilya Haykinson haykinson@gmail.com wrote:

...
For what it's worth, the extension http://www.mediawiki.org/wiki/DynamicPageList has been in use on various Wikimedia sites for a while now with great success

to allow

...
for category intersections, and I think the latest versions

support

...
image galleries etc.

We know. DPL is not suitable for use on large wikis.

On Fri, Mar 7, 2008 at 12:17 PM, Jared Williams jared.williams1@ntlworld.com wrote:

...
Yeah did notice that, think it could be replaced with

something like.

...
SELECT ci_page FROM {$table_categoryintersections} WHERE

ci_hash IN

...
(implode(',', $hashes)) GROUP BY ci_page HAVING COUNT(*) = count($hashes) LIMIT $this->max_hash_results

I'm not going to spend too much time parsing that, but it's an automatic filesort of the entire set included by the WHERE clause, i.e., the union of all the category intersections in question, since MySQL doesn't support loose index scans for WHERE x IN (...) GROUP BY y. Repeated join seems likely to be faster, although maybe not, I haven't benchmarked it or anything.

Ah yeah, I forget MySql specifics. I wouldn't be surprised if the optimium style of query changes given the number of categories.

...

...
Yeah, I think chances of hash collisions are unlikely,

whats far more

...
likely is someone recategorizing a page after a search.

Which means

...
the double check could be removed.

It's not just unlikely, it's so unlikely as to be impossible to all intents and purposes, barring deliberately-constructed collisions (which are possible with MD5, although maybe not for such short strings, I forget). Worry about a meteor wiping out the data center before you worry about MD5 collisions by chance on sets with cardinality in the billions.

My point was even it if was even remotely likely it doesn't really matter, as soon as the data leaves the db, it could be stale in anycase. So inaccuracies should be expected, therefore making the extra queries redundant.

Jared

Simetrical

8 Mar 8 Mar

12:04 a.m.

New subject: [Commons-l] Category intersection: New extensionavailable

On Fri, Mar 7, 2008 at 2:56 PM, Jared Williams jared.williams1@ntlworld.com wrote:

...

My point was even it if was even remotely likely it doesn't really matter, as soon as the data leaves the db, it could be stale in anycase. So inaccuracies should be expected, therefore making the extra queries redundant.

Well, no. There's a major difference between data that's stale and data that's totally wrong. Getting results that have, I dunno, [[Che Guevara]] in [[Category:Terrorists]] despite the fact that someone removed him ten seconds before is rather different from a [[Category:Terrorists]] that contains [[Heart of Atlanta Motel v. United States]].

Jared Williams

12:18 a.m.

New subject: [Commons-l] Category intersection: Newextensionavailable

...

-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Simetrical Sent: 07 March 2008 20:04 To: Wikimedia developers Subject: Re: [Wikitech-l] [Commons-l] Category intersection: Newextensionavailable

On Fri, Mar 7, 2008 at 2:56 PM, Jared Williams jared.williams1@ntlworld.com wrote:

...
My point was even it if was even remotely likely it doesn't really matter, as soon as the data leaves the db, it could be stale in anycase. So inaccuracies should be expected, therefore making the extra queries redundant.

Well, no. There's a major difference between data that's stale and data that's totally wrong. Getting results that have, I dunno, [[Che Guevara]] in [[Category:Terrorists]] despite the fact that someone removed him ten seconds before is rather different from a [[Category:Terrorists]] that contains [[Heart of Atlanta Motel v. United States]].

Could be totally wrong in the database that's why its being changed.

Jared

Simetrical

12:23 a.m.

New subject: [Commons-l] Category intersection: Newextensionavailable

On Fri, Mar 7, 2008 at 3:18 PM, Jared Williams jared.williams1@ntlworld.com wrote:

...

Could be totally wrong in the database that's why its being changed.

It's still "right" technically. The results are accurately reflecting the actual contents of the database at a definite point in time. That's all integrity asks for. Returning results that were never correct just because of bad hashing algorithms is something completely different. It's the difference between your stock ticker being five minutes outdated, and generating entirely random numbers sometimes. To say there's no difference, or that the former makes the latter okay, is ridiculous.

Jared Williams

1:17 a.m.

New subject: [Commons-l] Category intersection:Newextensionavailable

...

-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Simetrical Sent: 07 March 2008 20:23 To: Wikimedia developers Subject: Re: [Wikitech-l] [Commons-l] Category intersection:Newextensionavailable

On Fri, Mar 7, 2008 at 3:18 PM, Jared Williams jared.williams1@ntlworld.com wrote:

...
Could be totally wrong in the database that's why its

being changed.

It's still "right" technically. The results are accurately reflecting the actual contents of the database at a definite point in time. That's all integrity asks for. Returning results that were never correct just because of bad hashing algorithms is something completely different. It's the difference between your stock ticker being five minutes outdated, and generating entirely random numbers sometimes. To say there's no difference, or that the former makes the latter okay, is ridiculous.

But having accuracy in a stock ticker, and wikipedia category intersection are completely different and comparisons with between two are meaningless.

I think if giving an approximate answer is more scalable, then I think its worth it.

A better comparison would be google, with all they're resources they still approximate the number of seach results it expects

Jared

6112

Age (days ago)

6113

Last active (days ago)

wikitech-l@lists.wikimedia.org

13 comments

4 participants

tags (0)

participants (4)

Ilya Haykinson
Jared Williams
Magnus Manske
Simetrical