Tag intersection, crazy idea

List overview All Threads
Download

newer

older

File redirects

MediaWiki 1.11.2 released...

Magnus Manske

29 Feb 2008 29 Feb '08

11:28 a.m.

I just had the following thought: For a tag intersection system, * we limit queries to two intersections (show all pages with categories A and B) * we assume on average 5 categories per page (can someone check that?) then we have 5*4=20 intersections per page. Now, for each intersection, we calculate MD5("A|B") to get an integer hash, and store that in a new table (page_id INTEGER,intersection_hash INTERGER). That table would be 4 times as long as the categorylinks table. * Memory usage: Acceptable (?) * Update: Fast, on page edit only * Works for non-existing categories On a query, we look for the (indexed) hash in a subquery, then check those against the actual categorylinks. Looking up an integer in the subquery should be fast enough ;-) Given the number of categories and INTEGER >4bn, that would make the hash unique for all combinations of 65K categories (if the hash were truely randomly distributed, which it isn't), which should mean that the number of false positives (to be filtered by the main query) should be rather low. If that's fast enough, we could even expand to three intersections (A, B, and C), querying "A|B", "A|C", and "B|C", and let PHP find the ones common to all three sets. Summary: Fixing slow MySQL lookup by throwing memory at it... Feasible? Or pipe dream? Magnus

Show replies by date

Platonides

29 Feb 29 Feb

12:51 p.m.

Magnus Manske wrote:

...

A mysql integer is 32 *bits* while a md5 hash has 32 *characters* when expressed in hexadecimal (16 *bytes*). If you want a 4-byte hash, maybe you could use Crc32. What's its collision resistance without the NUL char? Other than that, it *could* work.

Magnus Manske

1:30 p.m.

On Fri, Feb 29, 2008 at 11:51 AM, Platonides <Platonides(a)gmail.com> wrote:

...

Magnus Manske wrote:

Or, we could use the first 32 bits of the 128-bit MD5 value... ;-) Or the first 64 bits, to make false duplicates less likely. Does MySQL have a 64 bit value? If so, how does that scale on 32 bit systems? Magnus

Simetrical

6:59 p.m.

On Fri, Feb 29, 2008 at 5:28 AM, Magnus Manske <magnusmanske(a)googlemail.com> wrote:

...

Actually, there are on average only 2 categories per page, it seems: 12,000,000 page rows and 25,000,000 category rows. Of course, that doesn't just count articles.

...

then we have 5*4=20 intersections per page.

You're double-counting. It's only 10 intersections: A|B is the same as B|A. (For the purposes of storage, of course, you'd want to consistently pick one of the two categories to go first in the hashed string, say the ASCIIbetically first. So you'd store B|A as A|B, since A < B. Or whatever.)

...

Now, for each intersection, we calculate MD5("A|B") to get an integer hash, and store that in a new table (page_id INTEGER,intersection_hash INTERGER). That table would be 4 times as long as the categorylinks table.

It would be smaller, probably. Categorylinks is int, varchar(255), varchar(86), timestamp, which is 4 + ? + ? + 4 bytes, with the ?'s being pretty large. And that's duplicated a few times for various indexes. Even if you go for a 64-bit hash, your table data is still only 12 bytes per row, with the primary key (page_id, intersection_hash) taking no additional storage and a reverse key taking another 24 (or maybe 12? is it that smart?) bytes per row. For a point of comparison, on my old dump of simple-wiki, categorylinks has 39671, and the length is 3637248 for data and 4980736 for indexes. That makes 217 bytes per row average. The table you propose should be only 36 bytes per row, if I'm not wrong, plus I guess some percentage overhead for various things. I think the percentage will be pretty large: an InnoDB table I have that's just (int, int, int) is 53 bytes of data per row, not just 12. But still, your table should be considerably smaller per row than categorylinks. Of course, the number of rows used is not going to be proportional to the average number of rows per page, but the average of the square of the number of rows per page. Specifically it will be equal to the sum over all pages of n(n-1)/2 (where n is the number of categories on that page), so pages with one category contribute nothing, but those with 20 categories will give you an extra 190 entries. So whether it will be more or fewer than four times the rows depends on the distribution of categories. Still, it should be not much larger than categorylinks, at worst. The idea is interesting, and might have merit. Certainly simple lookups would be very fast, even lookups of three or perhaps four categories. It's not quite as flexible as fulltext search (where you can do arbitrarily large intersections, and subtractions too), but it could be that it's faster. I don't *see* any pitfalls.

Steve Sanbeg

8:50 p.m.

On Fri, 29 Feb 2008 10:28:14 +0000, Magnus Manske wrote:

...

How important are intersections for non-existent categories? Without we could have something like (page_id int, cat_intersect bigint) or (page_id int, cat1 int, cat2 int) to get two cat intersection without collisions; and maybe even scale up by defining n-intersections recursively, without collisions.

Thomas Dalton

8:59 p.m.

...

How fast are ANDs in SELECT WHEREs? I would guess it's quicker to search by hash than by 2 ints.

Steve Sanbeg

9:52 p.m.

On Fri, 29 Feb 2008 19:59:20 +0000, Thomas Dalton wrote:

...

How fast are ANDs in SELECT WHEREs? I would guess it's quicker to search by hash than by 2 ints.

I'm not sure, which is why suggested the version that concatenates them into a big int. That may be faster than keeping to indices to and.

Simetrical

10:39 p.m.

On Fri, Feb 29, 2008 at 2:50 PM, Steve Sanbeg <ssanbeg(a)ask.com> wrote:

...

Maybe, except we don't have category id's. If we did, there would be no such thing as a nonexistent category, logically: there would be categories with no associated article pages, but they would still have category ID's. Unless you're proposing we use article id's, but currently categories do not need any article associated with them, and I'm not sure it's valuable to change that. On Fri, Feb 29, 2008 at 2:59 PM, Thomas Dalton <thomas.dalton(a)gmail.com> wrote:

...

How fast are ANDs in SELECT WHEREs? I would guess it's quicker to search by hash than by 2 ints.

It makes no difference, even if category id's existed (which they should, and sooner or later will). It's a sub-millisecond query either way. A B-tree index on (page, 32-bit cat1, 32-bit cat2) would have exactly the same cardinality as a B-tree index on (page, 64-bit hash), and values of the same length, so traversing them should take the same time, I'd imagine. (But I don't know how the storage works exactly for composite indexes, or anything about B-trees except the most basic things.)

Steve Sanbeg

1 Mar 1 Mar

12:35 a.m.

On Fri, 29 Feb 2008 16:39:47 -0500, Simetrical wrote:

...

On Fri, Feb 29, 2008 at 2:50 PM, Steve Sanbeg <ssanbeg(a)ask.com> wrote:

That was my thinking; that categories without a page ID are probably typos, and anyway less useful for intersection; if not, the articles could be added. So using the IDs and some recursion could be simpler and more scalable than using a hash. This is the first I've heard of category IDs. I can see where it would be more consistent to implement to use those once they're available.

Simetrical

2 Mar 2 Mar

1:16 a.m.

On Fri, Feb 29, 2008 at 6:35 PM, Steve Sanbeg <ssanbeg(a)ask.com> wrote:

...

I don't really see that it's simpler or more scalable, particularly. It does have moderately better locality of reference, although it's still not great. The denormalization means that half of a category's entries are scattered across the entire table, where it's in the second position, and only the half (on average) where it's first will be clustered in the same pages. I don't know if there's a very good reason to prefer either way. Domas, do you have any thoughts on this scheme?

Steve Bennett

1:44 a.m.

On 2/29/08, Magnus Manske <magnusmanske(a)googlemail.com> wrote:

...

I just had the following thought: For a tag intersection system, * we limit queries to two intersections (show all pages with categories A and B)

What kind of use cases do you imagine for category intersection? I suspect that as soon as you open the door to intersections ("hey everyone you can find all the articles that are both 1895 deaths and poets!") then people will want arbitrary numbers of intersections ("hmm, '1895 deaths, poets and unreferenced' didn't work"). Maybe it would be possible, along the lines you suggest, to allow greater numbers of intersections in some time-restricted way ("You requested a 4-level intersection less than a minute ago. Please wait and try again.") Is it not feasible, for example, to perform all intersections on a different machine or something? I seem to recall someone was operating a 3rd party category intersection site...? Steve

David Gerard

2:03 a.m.

On 02/03/2008, Steve Bennett <stevagewp(a)gmail.com> wrote:

...

This is something that isn't going to (a) solve the problem of subcategories of subcategories (b) work as a tagging system on Commons.

...

Is it not feasible, for example, to perform all intersections on a different machine or something? I seem to recall someone was operating a 3rd party category intersection site...?

If using a different machine means we can have intersections of 20 categories - or, if you like, of 20 tags - then that would be a feature that would be useful for the purpose of the request. - d.

Simetrical

2:04 a.m.

On Sat, Mar 1, 2008 at 7:44 PM, Steve Bennett <stevagewp(a)gmail.com> wrote:

...

Throttling is a fairly ugly solution. There's not a great reason for it to be necessary either. With this mechanism, we could allow up to five- or six-way intersections reasonably enough. It would just be a range scan, picking 10 or 15 or whatever rows and then merging them PHP-side. This is a primary key lookup and should be very fast. In fact, we don't even need a reverse index, or at least I can't think why we would: reverse lookups (pages to intersections) could just use categorylinks. I don't think a primary-key lookup on InnoDB of under 20 rows per query is anything to worry much about. The problem with allowing a large number of intersections, with this method, is that it might be that too many rows are returned, if one of the component intersections is too large. You could easily get hundreds or thousands of rows being returned if you retrieved all Americans + politicians, or whatever. If you wanted all American politician/artists, you'd get all American politicians, all American artists, and all politician/artists. So that's actually a problem for even three-way intersection. And retrieving in sorted order will be a problem. So I guess I've answered my own question from before, about drawbacks. Fulltext of some kind is probably a better solution than this.

...

Is it not feasible, for example, to perform all intersections on a different machine or something? I seem to recall someone was operating a 3rd party category intersection site...?

Third parties or users of the toolserver may implement poor-performance intersections. Users of those services may be willing to wait a long time for results, because they recognize that they're not using an official service. And they don't have the traffic that an integrated feature would have, probably not to within an order of magnitude. So an inefficient implementation is not really acceptable for running on the main servers, even if it might be by a third party.

David Gerard

2:06 a.m.

On 02/03/2008, Simetrical <Simetrical+wikilist(a)gmail.com> wrote:

...

Throttling is a fairly ugly solution. There's not a great reason for it to be necessary either. With this mechanism, we could allow up to five- or six-way intersections reasonably enough.

For the feature to satisfy the reason for its request, think 10 or 20, not 5 or 6. Is this computationally infeasible on all open source database engines known to humanity? - d.

Simetrical

2:10 a.m.

On Sat, Mar 1, 2008 at 8:06 PM, David Gerard <dgerard(a)gmail.com> wrote:

...

For the feature to satisfy the reason for its request, think 10 or 20, not 5 or 6. Is this computationally infeasible on all open source database engines known to humanity?

No. I was discussing this particular idea, which would not remotely scale to 10 or 20 intersections. Fulltext search might. It remains to be seen: someone has to write up a working implementation that's integrated in the software and try enabling it on Wikipedia.

David Gerard

2:18 a.m.

On 02/03/2008, Simetrical <Simetrical+wikilist(a)gmail.com> wrote:

...

On Sat, Mar 1, 2008 at 8:06 PM, David Gerard <dgerard(a)gmail.com> wrote:

...

> For the feature to satisfy the reason for its request, think 10 or 20, > not 5 or 6. Is this computationally infeasible on all open source > database engines known to humanity?

...

Fulltext search updated every 24 hours might be very workable for a Commons tagging system. I wonder what a test implementation would look like. - d.

Simetrical

2:29 a.m.

On Sat, Mar 1, 2008 at 8:18 PM, David Gerard <dgerard(a)gmail.com> wrote:

...

Fulltext search updated every 24 hours might be very workable for a Commons tagging system.

How fast does Lucene search currently update?

...

I wonder what a test implementation would look like.

In terms of what, interface?

David Gerard

3:19 p.m.

On 02/03/2008, Simetrical <Simetrical+wikilist(a)gmail.com> wrote:

...

On Sat, Mar 1, 2008 at 8:18 PM, David Gerard <dgerard(a)gmail.com> wrote:

...

> I wonder what a test implementation would look like.

...

In terms of what, interface?

Yeah. How to put into place something Commons users could start filling out *now*. I'm envisioning something running on the toolserver - the question I'm wondering is what interface to present on pages to add tags, sensible auto-tagging, etc. At this point I'm wishing I was more of a coder myself ... currently working through Shapiro's "Common Lisp", fwiw. I'm sure I can write bodgy Bourne shell scripts in Lisp as well as I can in Perl ;-) - d.

Simetrical

4:15 p.m.

On Sun, Mar 2, 2008 at 9:19 AM, David Gerard <dgerard(a)gmail.com> wrote:

...

Ah, I wasn't thinking about that. Really, that's sort of independent. You could have categories addable without the edit interface now, and you can have category intersections while still requiring page edits to change categories. As long as we store category links in the text, it will be somewhat difficult to add or remove them automatically, especially since they can be inherited from templates and funky stuff like that. Doable, but not something I'd be interested in trying to get to work, at least.

David Gerard

4:25 p.m.

On 02/03/2008, Simetrical <Simetrical+wikilist(a)gmail.com> wrote:

...

On Sun, Mar 2, 2008 at 9:19 AM, David Gerard <dgerard(a)gmail.com> wrote:

...

> Yeah. How to put into place something Commons users could start > filling out *now*. I'm envisioning something running on the toolserver > - the question I'm wondering is what interface to present on pages to > add tags, sensible auto-tagging, etc.

...

Yeah. If tag intersection works well enough to replace querulously specific sub-sub-sub-categories with a defined intersection of tags, that'll be a major win in itself. Then the question is how to mark an image (or indeed page) as having a particular set of tags. Then the question is how to fill out the database (which shouldn't actually be too hard - tag all people with "people" or whatever). Don't forget tag equivalence, e.g. "cat" = "chat" = "felis" or whatever. I think that covers the key points needed on Commons. There, now it's just a Simple Matter Of Coding! - d.

Simetrical

4:30 p.m.

On Sun, Mar 2, 2008 at 10:25 AM, David Gerard <dgerard(a)gmail.com> wrote:

...

Don't forget tag equivalence, e.g. "cat" = "chat" = "felis" or whatever.

Well, that's more or less independent of this, implementationally. What you'd do is pick a representative from each equivalence class to be used in the database, and have all other members redirect to it in some fashion. So the database storage is basically unaffected, you'd just have to add a few translation bits here and there. One thing at a time.

Magnus Manske

8:15 p.m.

On Sun, Mar 2, 2008 at 3:30 PM, Simetrical <Simetrical+wikilist(a)gmail.com> wrote:

...

On Sun, Mar 2, 2008 at 10:25 AM, David Gerard <dgerard(a)gmail.com> wrote:

Don't forget tag equivalence, e.g. "cat" = "chat" = "felis" or whatever.

Just for clarification, are we still talking category intersections? Or would a separate tagging system, in parallel to categories, be a better way? * No messing with the existing category system * Tagging/untagging without editing (?) * No need to alter categories (tree) to fit new tagging scheme (flat), preserving categories and not cluttering tagging with "category leftovers" * Selective seeding of tags with categories * Make (technically) sure tags are always "defined", so multiple equivalent tags, can use integer IDs internally, etc. Magnus

David Gerard

8:26 p.m.

On 02/03/2008, Magnus Manske <magnusmanske(a)googlemail.com> wrote:

...

Hopefully something that can replace the category system once the bugs are shaken out, such that the present microscopically small sub-sub-sub-sub-categories can be replaced with an intersection of tags. That is: the main difference being in the back-end implementation, because we can't just run frequent queries on the intersection of ten categories without crippling the database server. - d.

Jim Hu

3 Mar 3 Mar

5:36 a.m.

Perhaps I am misunderstanding what can and cannot be represented, but the DAG matters a lot to some of us. Consider: Cellular organisms Archaea Bacteria Eukaryotes Animals Vertebrates Invertebrates Plants Fungi etc. These are not correct taxonomy terms, but just for example. How would these be represented? With categories, it's straightforward. I don't see how a flat tag intersection does this. Replacing categories seems like a strange thing to do, since one of the common requests I've seen in the past was to implement the ability to create multiple category systems for different namespaces. Jim On Mar 2, 2008, at 1:26 PM, David Gerard wrote:

...

On 02/03/2008, Magnus Manske <magnusmanske(a)googlemail.com> wrote:

===================================== Jim Hu Associate Professor Dept. of Biochemistry and Biophysics 2128 TAMU Texas A&M Univ. College Station, TX 77843-2128 979-862-4054

Simetrical

5:41 a.m.

On Sun, Mar 2, 2008 at 11:36 PM, Jim Hu <jimhu(a)tamu.edu> wrote:

...

I don't think there are any plans to remove subcategory functionality. Only unnecessarily specific categories will be ditched, presumably, that logically consist of nothing more or less than an intersection of other categories. Categories like "Plants" don't, so they'd stay.

David Gerard

11:23 a.m.

On 03/03/2008, Simetrical <Simetrical+wikilist(a)gmail.com> wrote:

...

On Sun, Mar 2, 2008 at 11:36 PM, Jim Hu <jimhu(a)tamu.edu> wrote:

...

> Perhaps I am misunderstanding what can and cannot be represented, but > the DAG matters a lot to some of us. Consider: > Cellular organisms > Archaea > Bacteria > Eukaryotes > Animals > Vertebrates > Invertebrates > Plants > Fungi > etc. > These are not correct taxonomy terms, but just for example. How would > these be represented? With categories, it's straightforward. I don't > see how a flat tag intersection does this.

...

Yeah, categories that are really intersections of other categories is the case I was thinking of. So yeah, as I said earlier, I suppose implementing this feature would first require a backend for the existing category feature that can cope with lots of intersection requests all the time. - d.

Simetrical

12:27 a.m.

On Sun, Mar 2, 2008 at 2:15 PM, Magnus Manske <magnusmanske(a)googlemail.com> wrote:

...

Ugh. Tags and categories are logically identical, except that tags traditionally have a little less semantic info attached (no sub-tags). If a new system is implemented, we want to have a transition from categories to the new system, not have them coexist redundantly for all time.

David Gerard

1:28 a.m.

On 02/03/2008, Simetrical <Simetrical+wikilist(a)gmail.com> wrote:

...

On Sun, Mar 2, 2008 at 2:15 PM, Magnus Manske <magnusmanske(a)googlemail.com> wrote:

...

> Just for clarification, are we still talking category intersections? > Or would a separate tagging system, in parallel to categories, be a > better way? > * No messing with the existing category system > * Tagging/untagging without editing (?) > * No need to alter categories (tree) to fit new tagging scheme (flat), > preserving categories and not cluttering tagging with "category > leftovers" > * Selective seeding of tags with categories > * Make (technically) sure tags are always "defined", so multiple > equivalent tags, can use integer IDs internally, etc.

...

The trick is how to make sure a tagging system actually works reliably as part of basic MediaWiki the way categories do now. So if that means re-implementing the category backend so intersections of 10-20 categories is trivial enough that we can do it *all the time*, that's just fine and satisfies the basic feature request. So: will it help to think of it not as a new system, but as re-plumbing the existing category backend? - d.

5891

days inactive

5894

days old

wikitech-l@lists.wikimedia.org

Manage subscription

27 comments

8 participants

tags (0)

participants (8)

David Gerard
Jim Hu
Magnus Manske
Platonides
Simetrical
Steve Bennett
Steve Sanbeg
Thomas Dalton