I'm going to begin working on the following bugs:
* "Support collation by a certain locale (sorting order of characters)", https://bugzilla.wikimedia.org/show_bug.cgi?id=164 (only parts related to category sorting) * "Subcategory paging is not separate from article or image paging", https://bugzilla.wikimedia.org/show_bug.cgi?id=1211 * "CategoryTree is inefficient", https://bugzilla.wikimedia.org/show_bug.cgi?id=23682
As well as possibly:
* "Categories need to be structured by namespace", https://bugzilla.wikimedia.org/show_bug.cgi?id=450 * "Natural number sorting in category listings", https://bugzilla.wikimedia.org/show_bug.cgi?id=6948
There are essentially two problems here:
1) We currently sort articles on category pages by the Unicode code point of their sort key. This is terrible for anything other than English, and dodgy sometimes even for English. (This is bugs 164 and 6948.)
2) We have no way to efficiently get all items that are in a category and also in a particular namespace. Particularly, we can't retrieve all subcategories without scanning all items in the category, which is inefficient when we have a few (or no) subcategories and tons of items. (This is bugs 1211, 23682, and 450.)
One part of (2) needs to be clarified. The primary use-case is obviously that we want to be able to count subcategories efficiently, or display all of them when we only display some of the items in the category: this is bugs 1211 and 23682. Secondarily, we have a request at bug 450 to organize category pages by namespace, so main, Talk:, User:, etc. are all paginated separately.
I think the goal for (2) should be to allow efficient separate retrieval of subcategories, files, and other pages, but not to distinguish between namespaces otherwise. The major motivation is that to do this efficiently, we'll need to add namespace info to the categorylinks table, and we want this to stay consistent with the info in the page table. Categories, files, and other types of pages cannot be moved to one another, as far as I know (it would hardly make sense), so it automatically stays consistent this way. This is a big plus, because there are inevitably bugs that cause denormalized data to fall out of sync (look at cat_pages).
Furthermore, I don't think it's obvious that we want separate namespaces to display separately at all on category pages. What's a case where that would be desired? It would break up the display a lot, with a bunch of separate headers for different namespaces, when each namespace might only have a few items. Most categories whose sort appearance you'd care about (i.e., excepting maintenance categories) will have nearly everything in one namespace anyway. You could always split the category into separate ones per namespace if you want them separate.
So I propose that we keep the current category/normal page/file split, and paginate those three parts of the page separately. So you'd have up to 200 subcategories, then below that up to 200 normal pages, then below that up to 200 files. (The numbers could be adjusted. Currently they're hardcoded, which is stupid.) Paginating subcategories separately is obviously needed. Paginating files separately is not really needed, but it would be much more consistent.
The overall solution, then, would be:
1) Change the way category sortkeys are generated. Start them with a letter depending on namespace, like 'C' for category, 'P' for regular page, 'F' for file. After that first letter, append a sortkey generated by ICU or whatever. I think Tim has opinions on what would be a good choice to convert the article title into sort key -- if not, I'll have to research it and hopefully not come up with a completely incorrect answer.
2) On category pages, maintain three offsets and do three queries (or maybe UNION them together, doesn't matter), one for each of categories/regular pages/files. Because of (1), this will be efficient and will also sort less unreasonably for non-English languages.
One problem that was pointed out somewhere in the massive useless discussion on bug 164 is that we'd have to do something to display the first letter for each section. Currently it's just the first letter of the sortkey, but if that's some binary string, that becomes a problem. I'm not seeing an obvious solution, since the sortkey-generation algorithm will be opaque to us. If it sorts Á the same as A, then how do we figure out that the "canonical" first letter for the section should be "A" and not "Á"? How do we even figure out where the sections begin or end? Would that even make sense in all cases? At a first pass, I'd say we should just skip the first letter and display all the items straight from beginning to end without section divisions. I don't think that's a big problem.
This is just my initial thoughts. Feedback appreciated. If people agree with the general approach, I can start coding this up tomorrow.
2010/7/21 Aryeh Gregor Simetrical+wikilist@gmail.com:
Categories, files, and other types of pages cannot be moved to one another, as far as I know (it would hardly make sense), so it automatically stays consistent this way.
This is true for categories but not for files: http://www.mediawiki.org/w/index.php?title=Special:Log&dir=prev&offs...
So I propose that we keep the current category/normal page/file split, and paginate those three parts of the page separately. So you'd have up to 200 subcategories, then below that up to 200 normal pages, then below that up to 200 files. (The numbers could be adjusted. Currently they're hardcoded, which is stupid.) Paginating subcategories separately is obviously needed. Paginating files separately is not really needed, but it would be much more consistent.
Sounds good to me. I do think we will want to page all three of these things separately, it'd be stupidly inconsistent not to do that.
The overall solution, then, would be:
- Change the way category sortkeys are generated. Start them with a
letter depending on namespace, like 'C' for category, 'P' for regular page, 'F' for file. After that first letter, append a sortkey generated by ICU or whatever. I think Tim has opinions on what would be a good choice to convert the article title into sort key -- if not, I'll have to research it and hopefully not come up with a completely incorrect answer.
Note that different languages will want different orders. For instance, German generally sorts ä as ae, ö as oe and ü as ue, whereas the Swedish sort å, ä and ö at the end of the alphabet (so they actually say A, B, C, ... Z, Å, Ä, Ö and use the phrase "from A to Ö"). These collation schemes obviously conflict in their handling of ä and ö, and I'm sure there's crazier stuff out there.
This could be solved by having a different collation scheme for each content language (these have to be standardized *somewhere*, right?) and using {{DEFAULTSORT:}} for those rare cases where you have an article about a German person on a non-German wiki and want it to sort the German way.
- On category pages, maintain three offsets and do three queries (or
maybe UNION them together, doesn't matter),
In my personal opinion, UNION makes zero sense because you'd have to pull the data apart again after querying it, as you're displaying it separately as well. Separate queries are much cleaner in this case.
One problem that was pointed out somewhere in the massive useless discussion on bug 164 is that we'd have to do something to display the first letter for each section. Currently it's just the first letter of the sortkey, but if that's some binary string, that becomes a problem. I'm not seeing an obvious solution, since the sortkey-generation algorithm will be opaque to us. If it sorts Á the same as A, then how do we figure out that the "canonical" first letter for the section should be "A" and not "Á"? How do we even figure out where the sections begin or end? Would that even make sense in all cases? At a first pass, I'd say we should just skip the first letter and display all the items straight from beginning to end without section divisions. I don't think that's a big problem.
I agree that the first-letter thing is a nice-to-have, but I'm more worried about the general problem that sortkeys won't be human-readable strings anymore (the API currently displays them and, obviously, uses them for paging) nor possible to decode into human-readable strings (because the encoding essentially loses information when e.g. a and á are folded). It would be nice if we could store the original, unmunged sortkey in the categorylinks table, although I realize that would eat space for display and debugging purposes only.
Roan Kattouw (Catrope)
On 21 July 2010 14:49, Roan Kattouw roan.kattouw@gmail.com wrote:
2010/7/21 Aryeh Gregor Simetrical+wikilist@gmail.com:
Note that different languages will want different orders. For instance, German generally sorts ä as ae, ö as oe and ü as ue, whereas the Swedish sort å, ä and ö at the end of the alphabet (so they actually say A, B, C, ... Z, Å, Ä, Ö and use the phrase "from A to Ö"). These collation schemes obviously conflict in their handling of ä and ö, and I'm sure there's crazier stuff out there.
This could be solved by having a different collation scheme for each content language (these have to be standardized *somewhere*, right?) and using {{DEFAULTSORT:}} for those rare cases where you have an article about a German person on a non-German wiki and want it to sort the German way.
For Wiktionary, every language is included in one wiki (and even on one page) - it would be phenomenal to be able to select the collation per category. As per-page or per-wiki will not help very much at all.
- On category pages, maintain three offsets and do three queries (or
maybe UNION them together, doesn't matter),
In my personal opinion, UNION makes zero sense because you'd have to pull the data apart again after querying it, as you're displaying it separately as well. Separate queries are much cleaner in this case.
One problem that was pointed out somewhere in the massive useless discussion on bug 164 is that we'd have to do something to display the first letter for each section. Currently it's just the first letter of the sortkey, but if that's some binary string, that becomes a problem. I'm not seeing an obvious solution, since the sortkey-generation algorithm will be opaque to us. If it sorts Á the same as A, then how do we figure out that the "canonical" first letter for the section should be "A" and not "Á"? How do we even figure out where the sections begin or end? Would that even make sense in all cases? At a first pass, I'd say we should just skip the first letter and display all the items straight from beginning to end without section divisions. I don't think that's a big problem.
I agree that the first-letter thing is a nice-to-have, but I'm more worried about the general problem that sortkeys won't be human-readable strings anymore (the API currently displays them and, obviously, uses them for paging) nor possible to decode into human-readable strings (because the encoding essentially loses information when e.g. a and á are folded). It would be nice if we could store the original, unmunged sortkey in the categorylinks table, although I realize that would eat space for display and debugging purposes only.
There is no way to go from the sort-key to the first letter and indeed, you can't even put the first letter at the start of the sort key, as you need to sort the sections differently per language. The solution I use for generating the indices on Wiktionary is to store the first letter explicitly (either of the page or the user-provided sort key before they are fed into ICU). This would (in the future) allow "topical" categories, but that's juts a distraction for now.
Conrad
On Wed, Jul 21, 2010 at 5:49 PM, Roan Kattouw roan.kattouw@gmail.com wrote:
This is true for categories but not for files: http://www.mediawiki.org/w/index.php?title=Special:Log&dir=prev&offs...
Blech. Does this make any sense? Can we change it? It would simply this considerably.
Note that different languages will want different orders. For instance, German generally sorts ä as ae, ö as oe and ü as ue, whereas the Swedish sort å, ä and ö at the end of the alphabet (so they actually say A, B, C, ... Z, Å, Ä, Ö and use the phrase "from A to Ö"). These collation schemes obviously conflict in their handling of ä and ö, and I'm sure there's crazier stuff out there.
This could be solved by having a different collation scheme for each content language (these have to be standardized *somewhere*, right?) and using {{DEFAULTSORT:}} for those rare cases where you have an article about a German person on a non-German wiki and want it to sort the German way.
Yes, of course. I'm assuming that the magical sortkey-generator I'm plugging into here is locale-specific.
In my personal opinion, UNION makes zero sense because you'd have to pull the data apart again after querying it, as you're displaying it separately as well. Separate queries are much cleaner in this case.
It's pretty simple to do either way. Makes no big difference.
I agree that the first-letter thing is a nice-to-have, but I'm more worried about the general problem that sortkeys won't be human-readable strings anymore (the API currently displays them and, obviously, uses them for paging) nor possible to decode into human-readable strings (because the encoding essentially loses information when e.g. a and á are folded). It would be nice if we could store the original, unmunged sortkey in the categorylinks table, although I realize that would eat space for display and debugging purposes only.
This would also require altering the table. Why is it necessary? For paging, we can just use cl_from to stick in the URL, and retrieve cl_sortkey based on that and cl_to. That will make it be short and not look horribly ugly. When do we ever need a human-readable form of the sortkey, as opposed to a human-readable form of the title? API users should keep working when this happens with no special code changes on server or client, just they'll have horribly long and ugly URLs with encoded binary. Sortkeys are often weird and not suitable for display to humans anyway, like when "*" is used.
I'm not seeing this as worth adding a fourth field to categorylinks, which is a huge table already.
On Wed, Jul 21, 2010 at 6:04 PM, Conrad Irwin conrad.irwin@gmail.com wrote:
For Wiktionary, every language is included in one wiki (and even on one page) - it would be phenomenal to be able to select the collation per category. As per-page or per-wiki will not help very much at all.
Why won't per-page help? I'm not understanding clearly here. I don't think it would be too much trouble to add per-page and per-category parser functions to set the language used for sort keys, though.
There is no way to go from the sort-key to the first letter and indeed, you can't even put the first letter at the start of the sort key, as you need to sort the sections differently per language. The solution I use for generating the indices on Wiktionary is to store the first letter explicitly (either of the page or the user-provided sort key before they are fed into ICU). This would (in the future) allow "topical" categories, but that's juts a distraction for now.
But different articles that are sorted as though they started with the same letter might not actually start with the same letter, so how do we figure out which first letter is the correct one? This is a problem even if you're just dealing with accented letters -- I have no idea how this stuff works (or doesn't work) for CJK or whatnot. (Judging by these:
http://ja.wikipedia.org/wiki/Category:%E5%AD%98%E5%91%BD%E4%BA%BA%E7%89%A9 http://zh.wikipedia.org/wiki/Category:%E5%9C%A8%E4%B8%96%E4%BA%BA%E7%89%A9 http://zh-yue.wikipedia.org/wiki/Category:%E5%9C%A8%E4%B8%96%E4%BA%BA%E7%89%...
the strategy is just to manually force sortkeys to begin with something like "A" or "あ". Cantonese doesn't do this, and it ends up with one article per "letter" in many cases.)
On 21 July 2010 15:15, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
Why won't per-page help? I'm not understanding clearly here. I don't think it would be too much trouble to add per-page and per-category parser functions to set the language used for sort keys, though.
Because there are multiple languages on each page - so you need lots of different sort keys.
But different articles that are sorted as though they started with the same letter might not actually start with the same letter, so how do we figure out which first letter is the correct one? This is a problem even if you're just dealing with accented letters -- I have no idea how this stuff works (or doesn't work) for CJK or whatnot.
If it's sorted as starting with "a" it should appear under "a". The alternative would be to have different explicit sorting for the sections in the category than for the words in the section, which I think is unnecessary.
Conrad
2010/7/22 Aryeh Gregor Simetrical+wikilist@gmail.com:
On Wed, Jul 21, 2010 at 5:49 PM, Roan Kattouw roan.kattouw@gmail.com wrote:
This is true for categories but not for files: http://www.mediawiki.org/w/index.php?title=Special:Log&dir=prev&offs...
Blech. Does this make any sense? Can we change it? It would simply this considerably.
It doesn't make a great deal of sense and can be changed fairly easily in Title::isValidMoveTarget().
This would also require altering the table. Why is it necessary? For paging, we can just use cl_from to stick in the URL, and retrieve cl_sortkey based on that and cl_to. That will make it be short and not look horribly ugly. When do we ever need a human-readable form of the sortkey, as opposed to a human-readable form of the title? API users should keep working when this happens with no special code changes on server or client, just they'll have horribly long and ugly URLs with encoded binary. Sortkeys are often weird and not suitable for display to humans anyway, like when "*" is used.
I'm not seeing this as worth adding a fourth field to categorylinks, which is a huge table already.
Hm, you're right that even in the API there is very little value in displaying the sortkeys, so let's just drop it.
But different articles that are sorted as though they started with the same letter might not actually start with the same letter, so how do we figure out which first letter is the correct one?
Yeah, this is pretty much impossible: when you fold A and Á, there's no guarantee that the first entry (or even the majority of entries) for A will really be A's rather than Á's.
Roan Kattouw (Catrope)
Στις 21-07-2010, ημέρα Τετ, και ώρα 23:49 +0200, ο/η Roan Kattouw έγραψε:
2010/7/21 Aryeh Gregor Simetrical+wikilist@gmail.com:
The overall solution, then, would be:
- Change the way category sortkeys are generated. Start them with a
letter depending on namespace, like 'C' for category, 'P' for regular page, 'F' for file. After that first letter, append a sortkey generated by ICU or whatever. I think Tim has opinions on what would be a good choice to convert the article title into sort key -- if not, I'll have to research it and hopefully not come up with a completely incorrect answer.
Note that different languages will want different orders. For instance, German generally sorts ä as ae, ö as oe and ü as ue, whereas the Swedish sort å, ä and ö at the end of the alphabet (so they actually say A, B, C, ... Z, Å, Ä, Ö and use the phrase "from A to Ö"). These collation schemes obviously conflict in their handling of ä and ö, and I'm sure there's crazier stuff out there.
This could be solved by having a different collation scheme for each content language (these have to be standardized *somewhere*, right?) and using {{DEFAULTSORT:}} for those rare cases where you have an article about a German person on a non-German wiki and want it to sort the German way.
It's much worse than that. On the Wiktionary projects, which have the modest goal of documenting all words in all languages *on each language edition*, you can expect to find german, french, arabic, etc. lemmas on say the russian project. Even worse, you can expect to find multiple lemmas on one page if the lemma has the same orthography in more than one language. If the sort order differs between languages it's loads of fun.
Ariel
Aryeh Gregor schrieb:
- "Categories need to be structured by namespace",
https://bugzilla.wikimedia.org/show_bug.cgi?id=450
- "Natural number sorting in category listings",
While we definitly need efficient retrieval by namespace, the default sort key should *not* include the namespace prefix. it's very annoying that all files get sorted under "F" currently, or that pages from the Wikipedia namespace all end up under "W".
-- daniel
On Wed, Jul 21, 2010 at 6:18 PM, Daniel Kinzler daniel@brightbyte.de wrote:
While we definitly need efficient retrieval by namespace, the default sort key should *not* include the namespace prefix. it's very annoying that all files get sorted under "F" currently, or that pages from the Wikipedia namespace all end up under "W".
That's totally orthogonal and is like a one-line change. Probably you just have to change getPrefixedDBkey() to getDBkey() somewhere.
On Wed, Jul 21, 2010 at 6:22 PM, Conrad Irwin conrad.irwin@gmail.com wrote:
Because there are multiple languages on each page - so you need lots of different sort keys.
Could you point me to an example of some pages and categories where this is an issue? I'm not clear on how categories/pages/sort keys are being used here.
If it's sorted as starting with "a" it should appear under "a". The alternative would be to have different explicit sorting for the sections in the category than for the words in the section, which I think is unnecessary.
So if we have three pages "Áa", "Ab", "Ác" and they're sorted in the category in that order, should they be in one section? I don't see how you'd put them in two or three sections. If they're in one section, what letter do you use for it, "Á" or "A"? We can figure out "A" is correct here, but how do you do that in general automatically?
On 21 July 2010 15:28, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
On Wed, Jul 21, 2010 at 6:18 PM, Daniel Kinzler daniel@brightbyte.de wrote:
While we definitly need efficient retrieval by namespace, the default sort key should *not* include the namespace prefix. it's very annoying that all files get sorted under "F" currently, or that pages from the Wikipedia namespace all end up under "W".
That's totally orthogonal and is like a one-line change. Probably you just have to change getPrefixedDBkey() to getDBkey() somewhere.
On Wed, Jul 21, 2010 at 6:22 PM, Conrad Irwin conrad.irwin@gmail.com wrote:
Because there are multiple languages on each page - so you need lots of different sort keys.
Could you point me to an example of some pages and categories where this is an issue? I'm not clear on how categories/pages/sort keys are being used here.
I don't have an example to hand (as the page is not yet complete on Wiktionary) The Hungarian letter "cs" sorts after "c", so while in English "cs" (for centi-seconds) should come before "CV", in Hungarian the entry for the letter (which is missing) should come afterwards. Both English and Hungarian would be on the same Wiktionary page.
If it's sorted as starting with "a" it should appear under "a". The alternative would be to have different explicit sorting for the sections in the category than for the words in the section, which I think is unnecessary.
So if we have three pages "Áa", "Ab", "Ác" and they're sorted in the category in that order, should they be in one section? I don't see how you'd put them in two or three sections. If they're in one section, what letter do you use for it, "Á" or "A"? We can figure out "A" is correct here, but how do you do that in general automatically?
Some languages treat accented letters as the same primary letter, and use it only in the secondary or tertiary sort key (Which the current category table's keys of 80 bytes are in danger of truncating), others have variations on a theme, again Hungarian makes a good example, ö and ő are the one letter with two stresses, o and ó is a different letter. It should be automatically possible to extract the first letter from the words to be sorted (I don't know if ICU covers that, if not, just ask some people who speak the language, or Wikipedia) - but it's not possible to get that information from the sort keys directly, so either we store the user provided sort key, and our derived sort key, so we can use the former to find the first letter at render time, or we just store the first letter.
Conrad
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Wed, Jul 21, 2010 at 6:45 PM, Conrad Irwin conrad.irwin@gmail.com wrote:
I don't have an example to hand (as the page is not yet complete on Wiktionary) The Hungarian letter "cs" sorts after "c", so while in English "cs" (for centi-seconds) should come before "CV", in Hungarian the entry for the letter (which is missing) should come afterwards. Both English and Hungarian would be on the same Wiktionary page.
Okay, I see. I don't think this would be terribly hard, although I don't think it's needed for an initial implementation. The major problem I see is that if the sort collation is per-category, then changing it on a preexisting large category will require reparsing all the pages, probably. (Unless we store the raw sortkeys as well.)
Some languages treat accented letters as the same primary letter, and use it only in the secondary or tertiary sort key (Which the current category table's keys of 80 bytes are in danger of truncating), others have variations on a theme, again Hungarian makes a good example, ö and ő are the one letter with two stresses, o and ó is a different letter. It should be automatically possible to extract the first letter from the words to be sorted (I don't know if ICU covers that, if not, just ask some people who speak the language, or Wikipedia) - but it's not possible to get that information from the sort keys directly, so either we store the user provided sort key, and our derived sort key, so we can use the former to find the first letter at render time, or we just store the first letter.
I don't see an answer to my question here. Given a sorted list of sortkeys, possibly including the raw sortkey as well as the one that's been put through ICU/CLDR/whatever, what algorithm do you propose to break it up into sections labeled with first letters? In particular, any such algorithm should not conflict with the sort order, in the sense that you should not have three words A, B, C sorted as A < B < C where firstLetter(A) == firstLetter(C) != firstLetter(B). Is this reasonably possible to guarantee in all alphabetic languages' conventional sort orders?
If we do store the raw sort key, we could have some Language method to retrieve the section name, and just write our own implementations for various languages. However, I'm not sure this is worth the effort.
On Wed, Jul 21, 2010 at 7:03 PM, Roan Kattouw roan.kattouw@gmail.com wrote:
It doesn't make a great deal of sense and can be changed fairly easily in Title::isValidMoveTarget().
On Thu, Jul 22, 2010 at 3:01 AM, Tim Starling tstarling@wikimedia.org wrote:
This restriction is enforced by Title::isValidMoveOperation().
Any objections to changing this so files can't be moved over non-files or vice versa?
An alternative would be to add a column to the categorylinks table, say cl_type. It could be an ENUM or some short text type. Then the index could be altered to include this field at the start of it.
Presumably the rationale for combining these two things into cl_sortkey is to avoid a schema change, and to make the paging code slightly simpler. But I worry that future generations of MediaWiki developers will curse us for laziness and obfuscation.
I'm okay with this.
Well, I've said ICU, possibly with a PHP simulation of some Western European sort key algorithm for the benefit of users without access to ICU. But I formed that opinion years ago, and I never properly surveyed all the possible solutions in the first place. It probably makes sense to do a little of your own research.
Gerard Meijssen felt strongly that we should use something based on CLDR. Apparently we have connections there and work with them a lot, and I guess he feels it's higher-quality or such.
Note that I specifically excluded the actual implementation of language-dependent sort keys from the requirements list when I wrote up this project. It could easily eat up a lot of time, and it's not necessary for a proof-of-principle implementation.
All right. Then I'll just do whatever's readily available and fits in the database column.
Work out how much space we would need to additionally store the category keys in plain text. Then we will know what sort of tradeoff we are looking at. Have you got a toolserver account you can use to do the sums?
Yes, I'm a toolserver root.
Since we won't be sorting on the plain text form anymore, we could use some tricks to save space. For instance, if the sort key is the same as the article title, we could store NULL instead of another copy of the article title. That should save 95% or so.
It doesn't seem like it would save nearly that much. On the Welsh Wikipedia (small enough database to be manageable), I get the following:
mysql> SELECT SUM(LENGTH(cl_sortkey)) FROM categorylinks JOIN page ON cl_from=page_id WHERE REPLACE(cl_sortkey, ' ', '_') != page_title; +-------------------------+ | SUM(LENGTH(cl_sortkey)) | +-------------------------+ | 551851 | +-------------------------+ 1 row in set (1.94 sec)
mysql> SELECT SUM(LENGTH(cl_sortkey)) FROM categorylinks JOIN page ON cl_from=page_id; +-------------------------+ | SUM(LENGTH(cl_sortkey)) | +-------------------------+ | 1619747 | +-------------------------+ 1 row in set (0.44 sec)
mysql> SELECT SUM(LENGTH(cl_sortkey)) FROM categorylinks JOIN page ON cl_from=page_id WHERE REPLACE(cl_sortkey, ' ', '_') != page_title AND page_namespace = 0; +-------------------------+ | SUM(LENGTH(cl_sortkey)) | +-------------------------+ | 347539 | +-------------------------+ 1 row in set (0.20 sec)
mysql> SELECT SUM(LENGTH(cl_sortkey)) FROM categorylinks JOIN page ON cl_from=page_id WHERE page_namespace = 0; +-------------------------+ | SUM(LENGTH(cl_sortkey)) | +-------------------------+ | 1067588 | +-------------------------+ 1 row in set (0.19 sec)
I filtered out the main namespace in the last two to avoid false positives from namespace prefixes. This suggests savings of maybe 50-75%. The story may be different on larger wikis. It's worth remembering, though, that a lot of these sortkeys might be set to work around deficiencies in the current default sortkey generation, so maybe it would be higher savings in the long term.
It's still not at all clear to me that saving a raw copy in the database is worth it. If we really need sectioning by first letter on category pages, we could save the first letter instead, and leave that NULL when it's the same as the first letter of the page title (all of this for some locale-specific definition of "first letter"). But I don't know if we need that.
On Thu, Jul 22, 2010 at 3:37 AM, Roan Kattouw roan.kattouw@gmail.com wrote:
There is another reason to prefer this schema, which is that the orginially proposed one is susceptible to weird transition bugs. After this feature is deployed, there will be old-format (i.e. plain) sortkeys sticking around in the database for quite some time after (they won't go away until LinksUpdate fixes them), which means that pages whose sortkey starts with a C or F will be recognized as categories and files respectively, even if they're normal pages.
The best way to mitigate that is to populate the namespace information prior to deployment. In Tim's schema, that means filling the cl_type field based on page_namespace. In the sortkey prefix schema, that means prefixing the sortkey with the relevant sortkey, but that also requires the sortkey updating code has already been updated at that time (so it doesn't overwrite new-style sortkeys with old-style ones), which means you'd have to partially deploy the code while running the population script. Yuck.
This whole problems arises for sortkey changes generally. It will be just as much of a problem when going to a new sortkey type (based on CLDR or whatever). The only way to avoid it is to create a new column, populate it while maintaining both columns at once, start using the new column once it's fully populated, and then drop the old column. That seems excessive. Remember that we can convert the current raw sortkey into ICU/CLDR/whatever without reparsing pages, as long as we can reliably tell old from new sortkeys (should be pretty easy to do heuristically). So it shouldn't take forever -- surely no more than a day or two even for enwiki.
On Thu, Jul 22, 2010 at 5:34 AM, David Gerard dgerard@gmail.com wrote:
Please don't remove the feature where the first letter of the sort key is displayed in the rendered category page, and if necessary add what it takes to keep it.
There are scripts where this will be a hard problem, but it's still much-used and much-loved in those where it isn't.
Is it? What use does it serve? We don't have it for any other type of list. We have zillions of types of page lists, and category pages are the only ones that have the first letter displayed. It makes the columns uneven, and is completely crazy for some scripts (like CJK, AFAICT).
On 22 July 2010 17:34, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
On Thu, Jul 22, 2010 at 5:34 AM, David Gerard dgerard@gmail.com wrote:
Please don't remove the feature where the first letter of the sort key is displayed in the rendered category page, and if necessary add what it takes to keep it. There are scripts where this will be a hard problem, but it's still much-used and much-loved in those where it isn't.
Is it? What use does it serve? We don't have it for any other type of list. We have zillions of types of page lists, and category pages are the only ones that have the first letter displayed. It makes the columns uneven, and is completely crazy for some scripts (like CJK, AFAICT).
I'm basing this claim on the remarkable effort people go to to get things to sort on the category page *just right*. People clearly use this feature, and work hard to use it.
Take it to the VP (technical) on en:wp and I would be amazed if people don't protest hugely at the prospect of it disappearing. Maybe I'm wrong, but asking the users would be a good idea first.
- d.
On Thu, Jul 22, 2010 at 12:38 PM, David Gerard dgerard@gmail.com wrote:
I'm basing this claim on the remarkable effort people go to to get things to sort on the category page *just right*. People clearly use this feature, and work hard to use it.
Take it to the VP (technical) on en:wp and I would be amazed if people don't protest hugely at the prospect of it disappearing. Maybe I'm wrong, but asking the users would be a good idea first.
Feel free to do so if you think it's an issue. It wouldn't be terribly hard to keep the current behavior, I don't think.
On Thu, Jul 22, 2010 at 1:25 PM, Brad Jorsch b-jorsch@northwestern.edu wrote:
"Categorymembers namespace filtering is inefficient, uses ugly hack in miser mode", https://bugzilla.wikimedia.org/show_bug.cgi?id=19640 , is also very related.
I personally don't think it's worth fixing. The likelihood of error in keeping the categorylinks table up-to-date with page moves is just too great to be worth bothering over.
2010/7/22 Aryeh Gregor Simetrical+wikilist@gmail.com:
On Thu, Jul 22, 2010 at 1:25 PM, Brad Jorsch b-jorsch@northwestern.edu wrote:
"Categorymembers namespace filtering is inefficient, uses ugly hack in miser mode", https://bugzilla.wikimedia.org/show_bug.cgi?id=19640 , is also very related.
I personally don't think it's worth fixing. The likelihood of error in keeping the categorylinks table up-to-date with page moves is just too great to be worth bothering over.
We could at least partially fix it by allowing to filter by type (file|category|page) instead of namespace, and provide a mapping from clnamespace to cltype for backwards compatibility (6 -> file, 14 -> category, any other number -> page).
Roan Kattouw (Catrope)
Aryeh Gregor wrote:
On Wed, Jul 21, 2010 at 7:03 PM, Roan Kattouw roan.kattouw@gmail.com wrote:
It doesn't make a great deal of sense and can be changed fairly easily in Title::isValidMoveTarget().
On Thu, Jul 22, 2010 at 3:01 AM, Tim Starling tstarling@wikimedia.org wrote:
This restriction is enforced by Title::isValidMoveOperation().
Any objections to changing this so files can't be moved over non-files or vice versa?
It could even make sense to move the text revisions to/from the file namespace (with an appropiate warning). I once wanted to moving a file talk page to the file namespace, to archive (delete) them together. Or you might want to move an overdescription to NS_MAIN.
...
Since we won't be sorting on the plain text form anymore, we could use some tricks to save space. For instance, if the sort key is the same as the article title, we could store NULL instead of another copy of the article title. That should save 95% or so.
It doesn't seem like it would save nearly that much. On the Welsh Wikipedia (small enough database to be manageable), I get the following:
(...)
I filtered out the main namespace in the last two to avoid false positives from namespace prefixes. This suggests savings of maybe 50-75%. The story may be different on larger wikis. It's worth remembering, though, that a lot of these sortkeys might be set to work around deficiencies in the current default sortkey generation, so maybe it would be higher savings in the long term.
It's still not at all clear to me that saving a raw copy in the database is worth it. If we really need sectioning by first letter on category pages, we could save the first letter instead, and leave that NULL when it's the same as the first letter of the page title (all of this for some locale-specific definition of "first letter"). But I don't know if we need that.
Note that even if you're not storing it, the database is likely to be reserving the space for that. It can be useful to discern explicit sortkeys when the rules for language parsing change, though.
On Thu, Jul 22, 2010 at 3:06 PM, Platonides Platonides@gmail.com wrote:
It could even make sense to move the text revisions to/from the file namespace (with an appropiate warning). I once wanted to moving a file talk page to the file namespace, to archive (delete) them together. Or you might want to move an overdescription to NS_MAIN.
...
Is this worth the architectural complications in this case? It could be done, but . . .
Or maybe this would be no extra overhead? Do we normally do links updates on page move? We must already be updating categorylinks with new sortkeys if applicable, no? We could update the namespace at that point too. Maybe I'm being too averse to this denormalization -- I got rather disillusioned when the counts in the category table started constantly falling out of whack.
Note that even if you're not storing it, the database is likely to be reserving the space for that.
The entire point of varchar/varbinary is that it uses a variable amount of space. It will use one byte per row or such if it's empty.
It can be useful to discern explicit sortkeys when the rules for language parsing change, though.
Yeah, I thought of that too. If we kept no raw copy in the database, we'd have to reparse all pages to update the collation. Plus we could use it for the initial letter. Any other uses that it has?
On 23/07/10 02:34, Aryeh Gregor wrote:
On Thu, Jul 22, 2010 at 3:01 AM, Tim Starling tstarling@wikimedia.org wrote:
This restriction is enforced by Title::isValidMoveOperation().
Any objections to changing this so files can't be moved over non-files or vice versa?
No objection. That's mostly how it is already, except when the file doesn't exist but the description page does.
Since we won't be sorting on the plain text form anymore, we could use some tricks to save space. For instance, if the sort key is the same as the article title, we could store NULL instead of another copy of the article title. That should save 95% or so.
It doesn't seem like it would save nearly that much. On the Welsh Wikipedia (small enough database to be manageable), I get the following:
Welsh is not really what I was thinking when I said get statistics. On the English Wikipedia (db38):
mysql> show table status like 'categorylinks'\G *************************** 1. row *************************** Name: categorylinks Engine: InnoDB Version: 10 Row_format: Compact Rows: 38875439 Avg_row_length: 161 Data_length: 6271123456 Max_data_length: 0 Index_length: 7946960896 Data_free: 7340032 Auto_increment: NULL Create_time: 2010-05-24 11:29:52 Update_time: NULL Check_time: NULL Collation: binary Checksum: NULL Create_options: Comment: 1 row in set (0.15 sec)
SELECT count(*), sum(length(cl_sortkey)) as raw_length, sum( if(REPLACE(cl_sortkey, ' ', '_') = page_title, 0, length(cl_sortkey) ) ) as compact_length FROM categorylinks,page WHERE cl_from=page_id and page_namespace=0 and page_id % 10 = 0
*************************** 1. row *************************** count(*): 1957629 raw_length: 34177525 compact_length: 14857665 1 row in set (19 min 26.05 sec)
So we're looking at 17 bytes per row for raw text, and 8 bytes per row for compacted text, plus 1 byte per row for the length byte. Overall, assuming the lengths are the same across all namespaces, it would be approximately 680 MB in the raw form for the English Wikipedia, and presumably several times that for all wikis. Our English Wikipedia core DB servers have between 700 GB and 2 TB of storage space, with ~450 GB currently in use. So the impact of adding an extra 1 GB or so would be minimal.
No doubt Domas will complain anyway, but without developers adding new features, I figure his volunteer DBA work would get very boring.
It's still not at all clear to me that saving a raw copy in the database is worth it. If we really need sectioning by first letter on category pages, we could save the first letter instead, and leave that NULL when it's the same as the first letter of the page title (all of this for some locale-specific definition of "first letter"). But I don't know if we need that.
Truncating after the first letter would only save about 260MB for the entire English Wikipedia. And it would limit the applications. For instance, it would prevent fast updates of the collation algorithm. Instead we would have to reparse the pages. That could take weeks, even with a dozen servers dedicated to the task.
This whole problems arises for sortkey changes generally. It will be just as much of a problem when going to a new sortkey type (based on CLDR or whatever). The only way to avoid it is to create a new column, populate it while maintaining both columns at once, start using the new column once it's fully populated, and then drop the old column. That seems excessive.
If we're going to have multiple locale-specific collation algorithms (and that seems likely), then it may make sense to add a collation ID foreign key to the categorylinks table, to track updates. Sensible sorting behaviour mid-way through an update is probably not feasible, but we can at least make it possible to track the problem.
On Thu, Jul 22, 2010 at 5:34 AM, David Gerard dgerard@gmail.com wrote:
Please don't remove the feature where the first letter of the sort key is displayed in the rendered category page, and if necessary add what it takes to keep it.
There are scripts where this will be a hard problem, but it's still much-used and much-loved in those where it isn't.
Is it? What use does it serve? We don't have it for any other type of list. We have zillions of types of page lists, and category pages are the only ones that have the first letter displayed. It makes the columns uneven, and is completely crazy for some scripts (like CJK, AFAICT).
We have zillions of lists, but category pages are by far the most visible and heavily-used, that's why so much work has been done on making them look nice, and why so many people are complaining about category sorting instead of [[Special:DeadendPages]] sorting.
The CJK issue could be fixed by making the feature optional. The uneven column issue is fixable using the multi-column layout feature in CSS 3 and more recent versions of the major browsers.
-- Tim Starling
This looks like a very important feature, and a hard one to get right. You guys are real world heros :-)
On 23 July 2010 09:24, Tei oscar.vives@gmail.com wrote:
This looks like a very important feature, and a hard one to get right. You guys are real world heros :-)
+1
- d.
On Thu, Jul 22, 2010 at 3:01 AM, Tim Starling tstarling@wikimedia.org wrote:
An alternative would be to add a column to the categorylinks table, say cl_type. It could be an ENUM or some short text type. Then the index could be altered to include this field at the start of it.
Presumably the rationale for combining these two things into cl_sortkey is to avoid a schema change, and to make the paging code slightly simpler. But I worry that future generations of MediaWiki developers will curse us for laziness and obfuscation.
One problem with this, though, that just occurred to me: won't it mean that any existing users of cl_sortkey will suddenly not be using an index? I see a whole bunch of extensions that do ORDER BY cl_sortkey. Hacking up cl_sortkey to store weird stuff in a prefix will make them sort things oddly, but they won't filesort large categories. I assume we don't want to keep an extra index just for these marginal uses.
On Thu, Jul 22, 2010 at 11:02 PM, Tim Starling tstarling@wikimedia.org wrote:
No objection. That's mostly how it is already, except when the file doesn't exist but the description page does.
Okay, done in r69802.
Welsh is not really what I was thinking when I said get statistics.
Well, it's a lot faster, particular on the toolserver. Wikimedia database servers don't have to cope with ludicrously stupid queries being run all the time. :)
So we're looking at 17 bytes per row for raw text, and 8 bytes per row for compacted text, plus 1 byte per row for the length byte. Overall, assuming the lengths are the same across all namespaces, it would be approximately 680 MB in the raw form for the English Wikipedia, and presumably several times that for all wikis. Our English Wikipedia core DB servers have between 700 GB and 2 TB of storage space, with ~450 GB currently in use. So the impact of adding an extra 1 GB or so would be minimal.
So within the range of my previous estimate in relative terms, 50% to 75% space saved.
Truncating after the first letter would only save about 260MB for the entire English Wikipedia. And it would limit the applications. For instance, it would prevent fast updates of the collation algorithm. Instead we would have to reparse the pages. That could take weeks, even with a dozen servers dedicated to the task.
Okay, I'll store the raw versions.
If we're going to have multiple locale-specific collation algorithms (and that seems likely), then it may make sense to add a collation ID foreign key to the categorylinks table, to track updates. Sensible sorting behaviour mid-way through an update is probably not feasible, but we can at least make it possible to track the problem.
One way to have sensible sorting behavior midway through collation (suggested by Philippe Verdy on bug 164, if I understood right, but at least that's the inspiration) would be to have a cl_collation to track this, and then extend the unique index on (cl_from, cl_to) to (cl_from, cl_to, cl_collation), adjust all other indexes similarly where necessary, and add WHERE cl_collation = 73 or whatever to all the queries. Then when switching collations, we could have the code start keeping them updated in parallel; do a batch job to add extra rows for the new collation where they don't exist; and when that's done, stop maintaining the old collation and DELETE it. Do you think that's a good idea? How often do we expect to have to change the collation?
We have zillions of lists, but category pages are by far the most visible and heavily-used, that's why so much work has been done on making them look nice, and why so many people are complaining about category sorting instead of [[Special:DeadendPages]] sorting.
Okay, so I'll make sure that the infrastructure is present for this to work, including for languages that don't want it to turn it off.
The uneven column issue is fixable using the multi-column layout feature in CSS 3 and more recent versions of the major browsers.
We'd have to scratch the "C cont." labels at the top of each column, though. Is that too much of an imposition on the page's prettiness? :)
On Fri, Jul 23, 2010 at 12:40 AM, Krinkle krinklemail@gmail.com wrote:
Something that comes to mind is Wikimedia Commons. Would the same category with a different uselang be sorted differently ?
No. That's not practical at all. It can only be per-category. (I don't plan to implement per-category sort orders in this implementation, unless I'm asked to.)
No objection either. However, make sure that the other way around is blocked too. Else one might accidently move a page to the File-namespace without being able to get it back.
Of course.
Okay, so I think I have enough feedback to write up an initial implementation. I'll start on that now. To begin with I'll probably just stick in a dummy collation function, like use "convert everything to uppercase" and test it with $wgCapitalLinks false. Outline of my initial implementation plans:
* Add $wgExperimentalCategorySort, default to false, for testing, so that I can commit this to trunk (I don't like using branches). * Add unindexed varchar(255) NULL cl_raw_sortkey column, to serve the same function as the current cl_sortkey, except it will be NULL if the raw sortkey is identical to the one retrieved from the name. * Add cl_collation tinyint column to track collation revisions, with an index. Add a $wgCollationVersion variable, initially set to 0 (for no collation). * Add cl_type column, an ENUM of ('page', 'subcat', 'file'). ENUMs are somewhat evil to change in MySQL, but this isn't likely to change. * Change existing indexes appropriately. * Write code to keep the new columns populated and use them on category page display. * Write a script to populate the new columns. * Write some trivial collation function and try to migrate to it.
I'm aiming to be done with this on Monday. There are still some outstanding issues I'm not sure about that I've outlined here, but they won't be much work to change later.
On Fri, Jul 23, 2010 at 1:24 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
Okay, so I think I have enough feedback to write up an initial implementation. I'll start on that now. To begin with I'll probably just stick in a dummy collation function, like use "convert everything to uppercase" and test it with $wgCapitalLinks false. Outline of my initial implementation plans:
- Add $wgExperimentalCategorySort, default to false, for testing, so
that I can commit this to trunk (I don't like using branches).
- Add unindexed varchar(255) NULL cl_raw_sortkey column, to serve the
same function as the current cl_sortkey, except it will be NULL if the raw sortkey is identical to the one retrieved from the name.
- Add cl_collation tinyint column to track collation revisions, with
an index. Add a $wgCollationVersion variable, initially set to 0 (for no collation).
- Add cl_type column, an ENUM of ('page', 'subcat', 'file'). ENUMs
are somewhat evil to change in MySQL, but this isn't likely to change.
- Change existing indexes appropriately.
- Write code to keep the new columns populated and use them on
category page display.
- Write a script to populate the new columns.
- Write some trivial collation function and try to migrate to it.
I'm aiming to be done with this on Monday. There are still some outstanding issues I'm not sure about that I've outlined here, but they won't be much work to change later.
I've made my first commit now: http://www.mediawiki.org/wiki/Special:Code/MediaWiki/69810. Please comment on the direction I'm taking here, as well as telling me if it causes any problems with $wgExperimentalCategorySort off. I'll keep working on this for a while longer today, then pick up work again on Sunday or Monday.
The code in trunk now has all the basic features. I encourage people to test it by setting $wgExperimentalCategorySort = true;, applying the patch in maintenance/archives/patch-categorylinks-better-collation.sql, and running maintenance/updateCollation.php. Note that:
* The code is not stable or well-tested. It needs preliminary testing right now; testing it on a production site is probably a bad idea. * The entire architecture will probably be changed significantly in the near future based on feedback. * To reverse the changes, you need to disable the config option, manually reverse the SQL patch, and then run refreshLinks.php. When things change, I'll probably include instructions in my commit messages for how best to convert your schema to the new version, but it will not be done automatically as long as the feature is unstable. Obviously, once this is enabled by default everything will be automatic.
Things that currently work:
* Uppercase and lowercase letters are sorted identically. (This is my proof-of-concept collation function, it just uppercases everything.) * Categories, files, and other pages all page separately, and you get the first 200 of each independent of the others. * When you specify the same custom sort key for multiple pages, they'll sort amongst themselves according to what their default sort key would have been, not more or less randomly as it used to be.
But probably there are bugs and such. I've made a wiki page to track my work:
http://www.mediawiki.org/wiki/User:Simetrical/Collation
I don't have any further changes planned until I get feedback on my approach from Tim, other than fixing specific problems people point out to me (please do!).
The code is currently enabled in trunk and is still awaiting review. It's basically complete, but there are some issues left:
* What sortkey algorithm to use? Currently it just ASCII uppercases the words, which is okay for a proof-of-concept but doesn't actually solve bug 164. * What should be done about compatibility, particularly given the schema change? Should I try to convert all the extensions I've broken? Should we keep the old cl_sortkey index so that out-of-tree uses don't suddenly become extremely inefficient (but then we have to maintain an extra largely pointless index)? * On a much more minor note, cl_sortkey should probably be redefined to varbinary(255), instead of varchar(70). I'd do this, but I'm not totally sure about it, so I'm awaiting feedback from Tim, since it would be annoying to convert it and then convert it back.
Дана Tuesday 17 August 2010 20:37:44 Aryeh Gregor написа:
The code is currently enabled in trunk and is still awaiting review. It's basically complete, but there are some issues left:
- What sortkey algorithm to use? Currently it just ASCII uppercases
the words, which is okay for a proof-of-concept but doesn't actually solve bug 164.
For some time now, I am thinking about a stupidly simple solution:
php -r 'for($i = 0; $i < 65536; $i++) { echo pack("nx", $i); echo "\n"; }'| iconv -f ucs-2be -t utf8 | sort | php -r 'foreach(file("php://stdin") as $v) { echo var_export(substr($v, 0, -1)) . " => "" . str_pad(base_convert($i, 10, 36), 4, 0, STR_PAD_LEFT) . "",\n"; $i++; }'
This, more or less, should:
- Print every Unicode (UCS-2 only) character on its own line - Sort that according to the current locale - Print a PHP array to replace each Unicode character (UTF-8 encoded) with appropriate base36 number
If an UTF-8 string is encoded with this array, the resulting strings should be sorted exactly the same as in the locale through mere ASCII sorting. Or am I missing something big? (Except contextual sensitivity, but it occurs relatively rarely and this should still be better than what we have now.)
On Tue, Aug 17, 2010 at 4:06 PM, Nikola Smolenski smolensk@eunet.rs wrote:
For some time now, I am thinking about a stupidly simple solution:
php -r 'for($i = 0; $i < 65536; $i++) { echo pack("nx", $i); echo "\n"; }'| iconv -f ucs-2be -t utf8 | sort | php -r 'foreach(file("php://stdin") as $v) { echo var_export(substr($v, 0, -1)) . " => "" . str_pad(base_convert($i, 10, 36), 4, 0, STR_PAD_LEFT) . "",\n"; $i++; }'
This doesn't account for how complicated proper locale-specific sorting is. Multi-character strings do not sort just based on splitting them into characters and sorting those. You can have the same character sorting differently in different contexts. There are well-established libraries for Unicode sorting, and we certainly should not try to reinvent the wheel here.
Дана Tuesday 17 August 2010 22:11:32 Aryeh Gregor написа:
On Tue, Aug 17, 2010 at 4:06 PM, Nikola Smolenski smolensk@eunet.rs wrote:
For some time now, I am thinking about a stupidly simple solution:
php -r 'for($i = 0; $i < 65536; $i++) { echo pack("nx", $i); echo "\n"; }'| iconv -f ucs-2be -t utf8 | sort | php -r 'foreach(file("php://stdin") as $v) { echo var_export(substr($v, 0, -1)) . " => "" . str_pad(base_convert($i, 10, 36), 4, 0, STR_PAD_LEFT) . "",\n"; $i++; }'
This doesn't account for how complicated proper locale-specific sorting is. Multi-character strings do not sort just based on splitting them into characters and sorting those. You can have the same character sorting differently in different contexts. There are well-established libraries for Unicode sorting, and we certainly should not try to reinvent the wheel here.
All right; but right now we are not paying attention to character context too, and not properly sorting even single characters. I mean, we are sorting Ђ before А! Surely this would be an improvement?
On Tue, Aug 17, 2010 at 4:45 PM, Nikola Smolenski smolensk@eunet.rs wrote:
All right; but right now we are not paying attention to character context too, and not properly sorting even single characters. I mean, we are sorting Ђ before А! Surely this would be an improvement?
Yes, but it would probably be easier to just use CLDR or ICU to begin with, rather than trying to write and debug our own sort function.
On 17 August 2010 13:06, Nikola Smolenski smolensk@eunet.rs wrote:
Дана Tuesday 17 August 2010 20:37:44 Aryeh Gregor написа:
The code is currently enabled in trunk and is still awaiting review. It's basically complete, but there are some issues left:
- What sortkey algorithm to use? Currently it just ASCII uppercases
the words, which is okay for a proof-of-concept but doesn't actually solve bug 164.
For some time now, I am thinking about a stupidly simple solution:
php -r 'for($i = 0; $i < 65536; $i++) { echo pack("nx", $i); echo "\n"; }'| iconv -f ucs-2be -t utf8 | sort | php -r 'foreach(file("php://stdin") as $v) { echo var_export(substr($v, 0, -1)) . " => "" . str_pad(base_convert($i, 10, 36), 4, 0, STR_PAD_LEFT) . "",\n"; $i++; }'
This, more or less, should:
- Print every Unicode (UCS-2 only) character on its own line
- Sort that according to the current locale
- Print a PHP array to replace each Unicode character (UTF-8 encoded) with
appropriate base36 number
If an UTF-8 string is encoded with this array, the resulting strings should be sorted exactly the same as in the locale through mere ASCII sorting. Or am I missing something big? (Except contextual sensitivity, but it occurs relatively rarely and this should still be better than what we have now.)
You are missing most of it :). In many cases a single "letter" is made up of multiple code-points (of which there are considerably more than 65536 by the way) - think of Hungarian gy, then there are all kinds of conventions for sorting accents - in French you sort á after a but only if the rest of the word is spelt the same (i.e ab <- áb <- ac).
There is the ICU, and it is available to PHP (in some versions) http://docs.php.net/manual/en/class.collator.php, using those sort keys should be "good enough" for now I imagine. There are languages on Wiktionary that won't be in the ICU yet (just because they are ludicrously obscure) but it's probably best to start with something manageable.
Conrad
2010/7/22 Aryeh Gregor Simetrical+wikilist@gmail.com:
On Wed, Jul 21, 2010 at 6:18 PM, Daniel Kinzler daniel@brightbyte.de wrote:
While we definitly need efficient retrieval by namespace, the default sort key should *not* include the namespace prefix. it's very annoying that all files get sorted under "F" currently, or that pages from the Wikipedia namespace all end up under "W".
That's totally orthogonal and is like a one-line change. Probably you just have to change getPrefixedDBkey() to getDBkey() somewhere.
$wgCategoryPrefixedDefaultSortkey currently defaults to true, we could make that default to false instead.
Roan Kattouw (Catrope)
On 22/07/10 07:00, Aryeh Gregor wrote:
Categories, files, and other types of pages cannot be moved to one another, as far as I know (it would hardly make sense), so it automatically stays consistent this way.
This restriction is enforced by Title::isValidMoveOperation().
- Change the way category sortkeys are generated. Start them with a
letter depending on namespace, like 'C' for category, 'P' for regular page, 'F' for file. After that first letter, append a sortkey generated by ICU or whatever.
An alternative would be to add a column to the categorylinks table, say cl_type. It could be an ENUM or some short text type. Then the index could be altered to include this field at the start of it.
Presumably the rationale for combining these two things into cl_sortkey is to avoid a schema change, and to make the paging code slightly simpler. But I worry that future generations of MediaWiki developers will curse us for laziness and obfuscation.
I think Tim has opinions on what would be a good choice to convert the article title into sort key -- if not, I'll have to research it and hopefully not come up with a completely incorrect answer.
Well, I've said ICU, possibly with a PHP simulation of some Western European sort key algorithm for the benefit of users without access to ICU. But I formed that opinion years ago, and I never properly surveyed all the possible solutions in the first place. It probably makes sense to do a little of your own research.
Note that I specifically excluded the actual implementation of language-dependent sort keys from the requirements list when I wrote up this project. It could easily eat up a lot of time, and it's not necessary for a proof-of-principle implementation.
- On category pages, maintain three offsets and do three queries (or
maybe UNION them together, doesn't matter), one for each of categories/regular pages/files. Because of (1), this will be efficient and will also sort less unreasonably for non-English languages.
One problem that was pointed out somewhere in the massive useless discussion on bug 164 is that we'd have to do something to display the first letter for each section. Currently it's just the first letter of the sortkey, but if that's some binary string, that becomes a problem. I'm not seeing an obvious solution, since the sortkey-generation algorithm will be opaque to us. If it sorts Á the same as A, then how do we figure out that the "canonical" first letter for the section should be "A" and not "Á"? How do we even figure out where the sections begin or end? Would that even make sense in all cases? At a first pass, I'd say we should just skip the first letter and display all the items straight from beginning to end without section divisions. I don't think that's a big problem.
Roan is also asking for a store of the plain text form in this thread.
Work out how much space we would need to additionally store the category keys in plain text. Then we will know what sort of tradeoff we are looking at. Have you got a toolserver account you can use to do the sums?
Since we won't be sorting on the plain text form anymore, we could use some tricks to save space. For instance, if the sort key is the same as the article title, we could store NULL instead of another copy of the article title. That should save 95% or so.
-- Tim Starling
2010/7/22 Tim Starling tstarling@wikimedia.org:
An alternative would be to add a column to the categorylinks table, say cl_type. It could be an ENUM or some short text type. Then the index could be altered to include this field at the start of it.
Presumably the rationale for combining these two things into cl_sortkey is to avoid a schema change, and to make the paging code slightly simpler. But I worry that future generations of MediaWiki developers will curse us for laziness and obfuscation.
There is another reason to prefer this schema, which is that the orginially proposed one is susceptible to weird transition bugs. After this feature is deployed, there will be old-format (i.e. plain) sortkeys sticking around in the database for quite some time after (they won't go away until LinksUpdate fixes them), which means that pages whose sortkey starts with a C or F will be recognized as categories and files respectively, even if they're normal pages.
The best way to mitigate that is to populate the namespace information prior to deployment. In Tim's schema, that means filling the cl_type field based on page_namespace. In the sortkey prefix schema, that means prefixing the sortkey with the relevant sortkey, but that also requires the sortkey updating code has already been updated at that time (so it doesn't overwrite new-style sortkeys with old-style ones), which means you'd have to partially deploy the code while running the population script. Yuck.
Roan is also asking for a store of the plain text form in this thread.
I have since conceded that it takes space for almost zero gain (as it doesn't solve the first letter problem and doesn't have much gain for the API either), see my previous post.
Roan Kattouw (Catrope)
The only thing I would ask as a user (and filer of bug 1211, all those years ago) in terms of the interface:
Please don't remove the feature where the first letter of the sort key is displayed in the rendered category page, and if necessary add what it takes to keep it.
There are scripts where this will be a hard problem, but it's still much-used and much-loved in those where it isn't.
- d.
On Wed, Jul 21, 2010 at 05:00:43PM -0400, Aryeh Gregor wrote:
- We have no way to efficiently get all items that are in a category
and also in a particular namespace. Particularly, we can't retrieve all subcategories without scanning all items in the category, which is inefficient when we have a few (or no) subcategories and tons of items. (This is bugs 1211, 23682, and 450.)
"Categorymembers namespace filtering is inefficient, uses ugly hack in miser mode", https://bugzilla.wikimedia.org/show_bug.cgi?id=19640 , is also very related.
I think the goal for (2) should be to allow efficient separate retrieval of subcategories, files, and other pages, but not to distinguish between namespaces otherwise.
That wouldn't work if you did want to also fix bug 19640. It is quite plausible that a bot might want to query all ns0 pages in a category, or all talk pages, or all non-talk pages, or all templates, etc.
After reading the entire thread I've got several responses:
On Wed Jul 21 22:27:44 UTC 2010, Ariel T. Glenn ariel at wikimedia.org wrote:
It's much worse than that. On the Wiktionary projects, which have the modest goal of documenting all words in all languages *on each language edition*, you can expect to find german, french, arabic, etc. lemmas on say the russian project. Even worse, you can expect to find multiple lemmas on one page if the lemma has the same orthography in more than one language. If the sort order differs between languages it's loads of fun.
Ariel
Something that comes to mind is Wikimedia Commons. Would the same category with a different uselang be sorted differently ?
So far in this thread I've read that the current idea is to have different sorting depending on language. The language would be set to a certain default per wiki. And it would be possible to overwrite that per page or per category with a certain parser function (I think that would mean a certain piece of Technical metadata in a "Variable" type of Magic word (as defined here: http://www.mediawiki.org/wiki/Help:Magic_words#Variables ).
Perhaps an idea is to not set a default per wiki, but a default per language. That way it doesn't have to be set for every wiki that is in the same language. And is easily overridable for multilingual wikis by the uselang-parameter and/or the user's preference. Ofcourse the magic word would override that.
Though I don't know much about how a store-order is stored. It may be possible to maintain it in an interface message. That way a Germanic category on Commons or Wiktionary could use something like:
{{SORTORDER:{{MediaWiki:Categorysortorder/de}}}}
On Wed Jul 21 22:18:02 UTC 2010, Daniel Kinzler daniel at brightbyte.de wrote:
Aryeh Gregor schrieb:
- "Categories need to be structured by namespace",
https://bugzilla.wikimedia.org/show_bug.cgi?id=450
- "Natural number sorting in category listings",
While we definitly need efficient retrieval by namespace, the default sort
key
should *not* include the namespace prefix. it's very annoying that all files
get
sorted under "F" currently, or that pages from the Wikipedia namespace all
end
up under "W".
-- daniel
A good example of this are Templates. The vast majority of templates (as far as I can see) are sorted like this, with PAGENAME: [[Category:Special templates|{{PAGENAME}}]]
Setting to false by default may be good. But that would case all types of things to go mixed up. (except ofcourse for Files, Subcategories those would be seperated from 'Pages'). But templates, articles and users would be sorted by pagename. I'm not sure if that behaviour is good. I dont think it's that bad actually, if at all.
Being able to set per namespace may be a solution. Not sure yet.
On Fri Jul 23 03:02:49 UTC 2010, Tim Starling tstarling at wikimedia.org wrote:
On 23/07/10 02:34, Aryeh Gregor wrote:
On Thu, Jul 22, 2010 at 3:01 AM, Tim Starling <tstarling at wikimedia.org>
wrote:
This restriction is enforced by Title::isValidMoveOperation().
Any objections to changing this so files can't be moved over non-files or vice versa?
No objection. That's mostly how it is already, except when the file doesn't exist but the description page does.
No objection either. However, make sure that the other way around is blocked too. Else one might accidently move a page to the File-namespace without being able to get it back.
-- Krinkle
2010/7/23 Krinkle krinklemail@gmail.com
Something that comes to mind is Wikimedia Commons. Would the same category with a different uselang be sorted differently ?
No. That would require storing a separate ordering (set of munged sortkeys) in the database for each of the ~200 languages we support, which would take way too much space. For this reason, every category has exactly one ordering. It's fine if that ordering doesn't match the content language of the wiki, or if a page is sorted differently in different categories because they have different orderings, or if a category's sorting order is changeable, as long as each category is only sorted in one order at any one time.
No objection either. However, make sure that the other way around is blocked too. Else one might accidently move a page to the File-namespace without being able to get it back.
It's also needed to ensure consistency: the goal is to ensure that if a page is of type X (category, file or page), it cannot ever change its type. For this reason, we need to disallow moves to and from the File and Category namespaces.
Roan Kattouw (Catrope)
wikitech-l@lists.wikimedia.org