Okay, so we (or those few of us that spoke up about it) generally agree that category intersections will be popular. I also have the impression that SQL queries against the existing table will likely not be fast enough (Domas, Brion, anyone else care to comment more explicitly?) - based on a few tests on my wiki (only about 25,000 rows in my categorylinks table) the sql query alone to get the intersection of 2 categories takes about 2/10 of a second, so certainly for more categories and more rows it will take longer (anyone have any suggestions on optimising sql queries for use on the existing table? I tried a couple of methods, and they took the same amount of time).
So, *if* throwing sql at the existing table is too slow (and it looks like it probably is), then the next alternative is to create records for the intersections themselves, so the retrieval query is a simple one. This would make the write time longer, but save a lot on the many reads. So, here's a table of how many combinations that would have to be written for up to 10 categories. Anybody have a histogram of how many categories are on a given page? I'd also guess pages would tend to get *more* categories once the intersection tool is available.
categories distinct combinations 1 1 2 3 3 7 4 15 5 31 6 63 7 127 8 255 9 511 10 1023 So you can see that the number of combinations would get to be quite a lot. I would propose writing them into the categorylinks table, and making the existing category pages smart enough to deal with them, rather than writing them into a new table. I think sorting the categories on a given page, then concatenating the each possible with some character (perhaps a tab or newline) between them, and then posting them into the database would work. Only downside I can think of, is I'd guess it would take a *long* time to post hundreds of records. I guess I'll try it and see. Limiting the number of categories in an intersection to 3 or 4 would reduce the number of combinations very significantly.
Any ideas? Feedback?
Best Regards, Aerik
On 12/09/06, Aerik Sylvan aerik@thesylvans.com wrote:
So, *if* throwing sql at the existing table is too slow (and it looks like it probably is), then the next alternative is to create records for the intersections themselves, so the retrieval query is a simple one. This would make the write time longer, but save a lot on the many reads. So, here's a table of how many combinations that would have to be written for up to 10 categories. Anybody have a histogram of how many categories are on a given page? I'd also guess pages would tend to get *more* categories once the intersection tool is available.
Well, yes ... How does other software with arbitrary tags do it?
- d.
On 9/12/06, Aerik Sylvan aerik@thesylvans.com wrote:
So, *if* throwing sql at the existing table is too slow (and it looks like it probably is), then the next alternative is to create records for the intersections themselves, so the retrieval query is a simple one.
Common queries would need to be stored and maintained separately, yes. This is fine and dandy as long as no one tries exotic intersections, it wouldn't be too much more load than the current system, but that's an empirical question whose answer is unknown and can't be assumed.
So you can see that the number of combinations would get to be quite a lot. I would propose writing them into the categorylinks table
Wait, write an *exponential* number of combinations into the categorylinks table? Even a thousand categories (and Wikipedia has far more) would work out to a *googol cubed* different combinations. Not only is there not enough memory and disk space on the planet to store that, there aren't that many *particles in the universe*.
Am I misunderstanding you? Maybe only non-empty category intersections? That requires slightly more thought as to scalability. If you have a million pages, and each has two random categories, there can be a maximum of a million intersections (if no two pages have the same two categories). If you bump that up to three categories, however, the maximum possible number is three million; four is six million, etc. It's O(m*2^n), or linear-exponential time. That's also far too high to be acceptable, unless you can guarantee that n remains very small (which you can't).
Basically, the most practical solution in SQL would probably be to pick the ten thousand or hundred thousand or million most-visited intersections and maintain them as virtual categories, so they can be retrieved in the same time as normal categories. The only question is what percentage of queries this will deal with, which (as I said) is unknown at this point. Domas seemed rather skeptical of the efficiency of even this scheme, and given your benchmarks in a much smaller wiki his skepticism seems quite warranted.
Brion's idea of a full-text search seems to be the way to go at this point, at least to me. After all, searching for "apple pie" does nothing but give you the intersection of articles containing the words "apple" and "pie", and we already do that, so it can't be overwhelmingly expensive, I suppose. (But a disclaimer: I am not, in fact, particularly knowledgeable in *any* of this stuff.)
On 9/12/06, David Gerard dgerard@gmail.com wrote:
Well, yes ... How does other software with arbitrary tags do it?
And that's a complementary way to go.
Simetrical <Simetrical+wikitech@...> writes:
On 9/12/06, Aerik Sylvan <aerik@...> wrote:
So, *if* throwing sql at the existing table is too slow (and it looks like it probably is), then the next alternative is to create records for the intersections themselves, so the retrieval query is a simple one.
Common queries would need to be stored and maintained separately, yes. This is fine and dandy as long as no one tries exotic intersections, it wouldn't be too much more load than the current system, but that's an empirical question whose answer is unknown and can't be assumed.
Well, I'm actually proposing to write all the non-empty interesections (up to some limit) into the database *IF* throwing sql at the existing structure is too inefficient. These would be written when a page is saved (perhaps only if the categories in the page have changed?)
So you can see that the number of combinations would get to be quite a lot. I would propose writing them into the categorylinks table
Wait, write an *exponential* number of combinations into the categorylinks table? Even a thousand categories (and Wikipedia has far more) would work out to a *googol cubed* different combinations. Not only is there not enough memory and disk space on the planet to store that, there aren't that many *particles in the universe*.
Am I misunderstanding you? Maybe only non-empty category intersections? That requires slightly more thought as to scalability. If you have a million pages, and each has two random categories, there can be a maximum of a million intersections (if no two pages have the same two categories). If you bump that up to three categories, however, the maximum possible number is three million; four is six million, etc. It's O(m*2^n), or linear-exponential time. That's also far too high to be acceptable, unless you can guarantee that n remains very small (which you can't).
It's not exponential actually (because we're only taking distinct *unordered* calculations), after a bunch of research I found out that the number of intersections is calculated in math-speak as "n choose k" (see http://mathworld.wolfram.com/BinomialCoefficient.html) and the way to calculate it (for each k value, which is the number of intersecting categories) is n!/((n-k)!k!). So for our purposes, I did a quick sampling of 10 English Wikipedia pages, and got an average of 3.8 categories per page. Let's round up and say 4. If we want to store intersections (with a limit of 4 intersecting categories), it would be the sum of the little equation for values of k=2, k=3, and k=4 (and n=4). This is 11 intersections, in addition to the single categories we're already storing.
If the page has 5 categories, the numbers go up, and instead of 11 we'd have to store 25 categories. (By corollary, we are doing 11 or 25 extra INSERT statements, respectively...) So, if we had a histogram of categories for wikipedia, we could predict quite accurately how many non-empty intersections there would be. I'd guess between 11 and 25 million. What's the retrieval time for a simple select query against 25 million records?
Basically, the most practical solution in SQL would probably be to pick the ten thousand or hundred thousand or million most-visited intersections and maintain them as virtual categories, so they can be retrieved in the same time as normal categories. The only question is what percentage of queries this will deal with, which (as I said) is unknown at this point. Domas seemed rather skeptical of the efficiency of even this scheme, and given your benchmarks in a much smaller wiki his skepticism seems quite warranted.
I suppose we could come up with some scheme to do this - cache intersections for a period of time, then keep the ones that are popular... The only thing is, by definition, this would be at query time, and it would take some figure around (probably greater than) 2 tenths of a second for each one, until it is cached. Perhaps that's the way to do it - launch it as a beta feature, fill up the cache, then release it fully.
Brion's idea of a full-text search seems to be the way to go at this point, at least to me. After all, searching for "apple pie" does nothing but give you the intersection of articles containing the words "apple" and "pie", and we already do that, so it can't be overwhelmingly expensive, I suppose. (But a disclaimer: I am not, in fact, particularly knowledgeable in *any* of this stuff.)
On 9/12/06, David Gerard <dgerard@...> wrote:
Well, yes ... How does other software with arbitrary tags do it?
And that's a complementary way to go.
Yeah, but that's the exact same problem - how does a full text search / tagging engine do it? It's facing the same issues we're talking about. The only think I can think of is they are using a completely different data structure (and that would be?) and/or a faster (compiled?) language.
Best Regards, Aerik
P.S. Sorry for the long post, but I wanted to leave everything in context.
On 9/13/06, Aerik aerik@thesylvans.com wrote:
It's not exponential actually (because we're only taking distinct *unordered* calculations), after a bunch of research I found out that the number of intersections is calculated in math-speak as "n choose k" (see http://mathworld.wolfram.com/BinomialCoefficient.html) and the way to calculate it (for each k value, which is the number of intersecting categories) is n!/((n-k)!k!).
Erk, you're right. I was too hasty. It's O(n choose k), but it's easily provable that that's the same as O(n^k), which is of course very slow for even smallish k's (although not ridiculously so like exponentials). On the other hand, if *n* remains small, say ten or whatever, then it's not such a major issue, I suppose.
So, if we had a histogram of categories for wikipedia, we could predict quite accurately how many non-empty intersections there would be. I'd guess between 11 and 25 million. What's the retrieval time for a simple select query against 25 million records?
Looking at it that way, I'll grant that you're probably about on target with your estimate, give or take some. Of course, that will increase database size what, tenfold? We only have a couple million pages.
Yeah, but that's the exact same problem - how does a full text search / tagging engine do it? It's facing the same issues we're talking about. The only think I can think of is they are using a completely different data structure (and that would be?) and/or a faster (compiled?) language.
They're definitely using a totally different data structure. They just index every word, I assume in a hash table or something. I'm pretty sure they don't store all the intersections, but presumably however they handle that it's fast enough to be workable.
One-second delays are fairly typical for Wikipedia searches, and I would expect the category list to be somewhat faster, since it's much smaller. (Even the largest categories are substantially smaller than the number of pages that contain some indexed words, like "all" [545,636 pages] or "not" [755,149 pages], and there are more of the latter than the former.)
Faster language isn't really relevant, since the slowdown is on the database side, not the PHP side. You can bet that MySQL is written in a compiled language.
So you can see that the number of combinations would get to be quite a lot. I would propose writing them into the categorylinks table
Wait, write an *exponential* number of combinations into the categorylinks table?
I don't think any more than the square is actually necessary, and if not, at most the cube.
The reason we think the SQL is too inefficient is because there can be large numbers of articles in one category. If we store all intersections of two categories, which is easy to do and doesn't take much space at all, then each intersection will likely be small enough to make the SQL for three-way intersections economical, so higher-order intersections need not be stored directly.
Besides, I think you're all forgetting that if we have a table that stores, say, all two-way category intersections, we can actually get rid of the categorylinks table itself -- it would be contained within that new table and would be wholly redundant. Similarly, a table with all three-way intersections contains in it all two-way intersections as well.
Lastly, you're also forgetting that the size of the table is irrelevant as long as the hardware to store it is available. The processing power required to access such a large table is NOT, as is commonly assumed, proportional to the size of the table! This is what databases have indexes for, and it is my understanding that in modern RDBMSs, the time taken to access a record in the table is not directly correlated to the size of the table at all (only to the size of the index, but even then not proportionally).
Timwi
On 9/14/06, Timwi timwi@gmx.net wrote:
The reason we think the SQL is too inefficient is because there can be large numbers of articles in one category. If we store all intersections of two categories, which is easy to do and doesn't take much space at all, then each intersection will likely be small enough to make the SQL for three-way intersections economical, so higher-order intersections need not be stored directly.
Well, then, is anyone up for writing and benchmarking this idea? :)
Besides, I think you're all forgetting that if we have a table that stores, say, all two-way category intersections, we can actually get rid of the categorylinks table itself -- it would be contained within that new table and would be wholly redundant. Similarly, a table with all three-way intersections contains in it all two-way intersections as well.
Yeah, but then surely you'd have to take the union of a potentially large number of tables to display a single-category view, which I suspect is going to remain a more common request than a category-intersection view. Isn't that going to give you a substantial performance hit for large categories?
(And incidentally, pages that are only in a single category will have to remain in their own table. You can't get that from intersection tables.)
Lastly, you're also forgetting that the size of the table is irrelevant as long as the hardware to store it is available.
Not quite irrelevant. It becomes more expensive to move it around, download it, and so on. A factor of 1.5 or 2, who cares, but a factor of five or ten could be sort of inconvenient.
More to the point: if you're maintaining many tables in parallel like this, you need that many more DB calls. If I add a category to a page with a dozen categories, it will need to update the dozen intersection tables instead of one categorylinks table. That's why I suggested only maintaining some of the most commonly-requested intersection tables, rather than all of them; there's a tradeoff here between time for reads and writes. Benchmarking is needed to know which side to lean toward more.
Yeah, but then surely you'd have to take the union of a potentially large number of tables to display a single-category view, which I suspect is going to remain a more common request than a category-intersection view. Isn't that going to give you a substantial performance hit for large categories?
I'm not sure where you get this idea that each intersection would be a separate table. Obviously it wouldn't, it would still be one single database table, containing everything. And this single table can be queried for a single category, which (assuming proper indexes are in place) would not be any less efficient than it already is currently.
(And incidentally, pages that are only in a single category will have to remain in their own table. You can't get that from intersection tables.)
I admit I forgot about pages that are in a single category, but they can still be placed in the same table (just NULL the second category column for those).
More to the point: if you're maintaining many tables in parallel like this, you need that many more DB calls.
Nope. See above.
Timwi
On 9/18/06, Timwi timwi@gmx.net wrote:
Yeah, but then surely you'd have to take the union of a potentially large number of tables to display a single-category view, which I suspect is going to remain a more common request than a category-intersection view. Isn't that going to give you a substantial performance hit for large categories?
I'm not sure where you get this idea that each intersection would be a separate table. Obviously it wouldn't, it would still be one single database table, containing everything. And this single table can be queried for a single category, which (assuming proper indexes are in place) would not be any less efficient than it already is currently.
Yes, I see now. But not only the table would be larger, you would have many more indices as well, and however you cast it you'd need to complicate queries for single-category retrieval. As to how much of a performance hit that would be, well, that's out of my league (my SQL skills tend to be limited to simple SELECT statements), so all I can do at this point is appeal to Domas' authority.
Simetrical wrote:
On 9/18/06, Timwi timwi@gmx.net wrote:
Yeah, but then surely you'd have to take the union of a potentially large number of tables to display a single-category view, which I suspect is going to remain a more common request than a category-intersection view. Isn't that going to give you a substantial performance hit for large categories?
I'm not sure where you get this idea that each intersection would be a separate table. Obviously it wouldn't, it would still be one single database table, containing everything. And this single table can be queried for a single category, which (assuming proper indexes are in place) would not be any less efficient than it already is currently.
Yes, I see now. But not only the table would be larger, you would have many more indices as well,
Yeah, like, maybe 4 instead of 2.
and however you cast it you'd need to complicate queries for single-category retrieval.
Yeah, by adding the incredibly long and complex "DISTINCT" keyword and nothing else.
As to how much of a performance hit that would be, well, that's out of my league
I get the impression that it's not the only thing that's out of your league.
Timwi
On 9/19/06, Timwi timwi@gmx.net wrote:
I get the impression that it's not the only thing that's out of your league.
I deserved that. :(
/me goes off to the corner to watch
Timwi,
I'm not sure where you get this idea that each intersection would be a separate table. Obviously it wouldn't, it would still be one single database table, containing everything. And this single table can be queried for a single category, which (assuming proper indexes are in place) would not be any less efficient than it already is currently.
Sure, now there's another issue - what happens (well, it already happens with Living People), when people use categories like tags. Simply, amount of big categories increases.
When you do an intersection of a small and big categories, you end up with nested loop starting on a small one, checking for entry existence in big one. Then you sort the small resultset. When you do an intersection of two big categories, you end up with nested loop going over two big tables, and a big resultset finally, with big resultset to sort.
Right now we try to avoid sorting for categories at all - we use index-based sorts.
B+Trees are not that good for intersections, now maybe fulltext indexing in general is more suitable for such task, that would mean us using external facility for lookups like that (well, that was discussed in frankfurt 2005 ;-), but it is far more complex than to throw few more queries at our core databases and expecting a miracle.
BR,
Domas Mituzas wrote:
Timwi,
Sure, now there's another issue - what happens (well, it already happens with Living People), when people use categories like tags. Simply, amount of big categories increases.
Obviously I know that, hence why I suggested to go for a three-way intersection table right from the start.
When you do an intersection of a small and big categories, you end up with nested loop starting on a small one, checking for entry existence in big one. Then you sort the small resultset. When you do an intersection of two big categories, you end up with nested loop going over two big tables, and a big resultset finally, with big resultset to sort. Right now we try to avoid sorting for categories at all - we use index-based sorts.
If you already know that having the proper indexes eliminates the need for sorting, why did you first claim that any sorting was necessary?
Additionally, I'm sure you're aware that if you intersect two big categories, you end up looping over only ONE big "table" (by which I think you mean resultset) and looking each value up in the index for the other (which makes its size irrelevant).
You haven't really explained to me where you see a problem :)
B+Trees are not that good for intersections
Hence (again) why I suggested to store the intersections directly.
Allow me to repeat that I think the paranoia about the three-way intersection table becoming too large is unfounded.
Timwi
On 9/19/06, Domas Mituzas midom.lists@gmail.com wrote: [snip]
B+Trees are not that good for intersections, now maybe fulltext indexing in general is more suitable for such task, that would mean us using external facility for lookups like that (well, that was discussed in frankfurt 2005 ;-), but it is far more complex than to throw few more queries at our core databases and expecting a miracle.
[snip]
Or another database engine which supports an index type which is an inverted index with fast intersections.
Hey Greg,
Or another database engine which supports an index type which is an inverted index with fast intersections.
MyISAM? \o/
Domas Mituzas wrote:
Hey Greg,
Or another database engine which supports an index type which is an inverted index with fast intersections.
MyISAM? \o/
I am in process of mapping MediaWiki to a file system I wrote that provides Wiki-like database behaviors. I am a ways out on getting this finished and it will be end of the year until all of it is completed, but it will solve a lot of these issues of scaling and make MediaWiki a lot thinner and a ton faster.
Keep everyone updated.
Jeff
wikitech-l@lists.wikimedia.org