More aggressive DEFAULTSORT

List overview All Threads
Download

newer

older

Database Abstraction

Open Document Format

Lars Aronsson

12 May 2009 12 May '09

2:29 a.m.

Category sorting in MediaWiki has always been done wrong. Categories are not sorted alphabetically, but in Unicode order. I don't know when or why DEFAULTSORT was introduced, but today it is being used mostly for sorting people by surname, e.g. {{DEFAULTSORT:Wales, Jimmy}}, and for sorting other topics by another word than the first, so that the "European Commission" is sorted under C rather than E in http://en.wikipedia.org/wiki/Category:Institutions_of_the_European_Union

If you look closer at that category, you see that some items are sorted under E, which probably means somebody forgot to use DEFAULTSORT there:

* European Court of Auditors (0) * European Union Mission (1) * European quarter of Brussels (1)

What's even more remarkable is that "quarter" is sorted after "Union". This is because lower case letters sort after all the upper case letters in ASCII and Unicode. That is how broken category sorting is in MediaWiki.

Another example of broken sorting is when whitespace is compared to letters. In ASCII and Unicode, whitespace (position 32) sorts ahead of all printable characters. This means Moon illusion sorts ahead of Moonbow in http://en.wikipedia.org/wiki/Category:Moon because the whitespace before "illusion" is compared to the b in Moonbow. I'm not sure if this is correct in English, but in Swedish it is wrong; bow should sort before illusion, regardless of the whitespace.

There is a way to avoid all such problems, namely by a more aggressive use of DEFAULTSORT that removes from sorting all upper case letters (except the initial one), all whitespace and all commas. It would mean almost every article needs a DEFAULTSORT. In the examples above:

{{DEFAULTSORT:Walesjimmy}} {{DEFAULTSORT:Europeancourtofauditors}} {{DEFAULTSORT:Europeanunionmission}} {{DEFAULTSORT:Europeanquarterofbrussels}} {{DEFAULTSORT:Moonillusion}}

This can be done with bots, for sure, if we agree that it should be done. Is this something we should strive for? Has any language of Wikipedia (or Wikinews or...) already started to do this?

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Show replies by date

Petr Kadlec

12 May 12 May

3:09 a.m.

2009/5/11 Lars Aronsson lars@aronsson.se:

...

Category sorting in MediaWiki has always been done wrong. Categories are not sorted alphabetically, but in Unicode order.

Sure thing. See https://bugzilla.wikimedia.org/show_bug.cgi?id=164 with 95 votes…

...

Another example of broken sorting is when whitespace is compared to letters. In ASCII and Unicode, whitespace (position 32) sorts ahead of all printable characters. This means Moon illusion sorts ahead of Moonbow in http://en.wikipedia.org/wiki/Category:Moon because the whitespace before "illusion" is compared to the b in Moonbow. I'm not sure if this is correct in English, but in Swedish it is wrong; bow should sort before illusion, regardless of the whitespace.

In Czech, this is correct, Czech collation works on individual words. As you see, the rules are language-specific.

...

There is a way to avoid all such problems, namely by a more aggressive use of DEFAULTSORT that removes from sorting all upper case letters (except the initial one), all whitespace and all commas.

The problem is much more difficult than that (see the linked bug). Commas, case sensitivity and whitespace are a trivial problem in comparison with non-ASCII letters.

-- [[cs:User:Mormegil | Petr Kadlec]]

Lars Aronsson

3:03 p.m.

Petr Kadlec wrote:

...

The problem is much more difficult than that (see the linked bug). Commas, case sensitivity and whitespace are a trivial problem in comparison with non-ASCII letters.

It is exactly this attitude of "oh, it's so difficult" that has blocked bug 164 from being fixed. I doubt it will get fixed in the coming year. But the more aggressive use of DEFAULTSORT could be applied within a year. I know this is not the "right" or most advanced solution, it's just an improvement that *can be made*.

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Petr Kadlec

6:18 p.m.

2009/5/12 Lars Aronsson lars@aronsson.se:

...

Petr Kadlec wrote: It is exactly this attitude of "oh, it's so difficult" that has blocked bug 164 from being fixed.

Well, not really. Bug 164 would be fixed almost completely for Czech-language wikis by using database features designed for exactly this problem. [1] But, I guess you know the situation.

...

I doubt it will get fixed in the coming year. But the more aggressive use of DEFAULTSORT could be applied within a year. I know this is not the "right" or most advanced solution, it's just an improvement that *can be made*.

More aggressive use of DEFAULTSORT could be applied right now, and some people already do that. But, to ask people to create strange codes (like Šácholanotvaré → SŠahωolanotvare [2]) to enforce proper sorting… I can’t believe this would be that much improvement… (Not to say that users would make many errors with such a complicated system.)

If Swedish sorting rules are simple enough that removing all whitespace and punctuation and converting to lower case would solve most of the problems, I would say that such feature would not be too difficult to implement right into MediaWiki (into LanguageSv.php), writing those DEFAULTSORT codes explicitly into every article would be nonsense, IMHO. (So, go ahead with it, I won’t stop you or anything, I’m just trying to say that this is not really a solution for Czech language.)

-- [[cs:User:Mormegil | Petr Kadlec]]

[1] http://dev.mysql.com/doc/refman/4.1/en/charset-collation-effect.html [2] Really! http://jdem.cz/beqz6 (And that is only a partial solution, implementing only about a half of the Czech sorting rules.)

Brion Vibber

13 May 13 May

3:38 a.m.

El 5/12/09 4:18 PM, Petr Kadlec escribió:

...

2009/5/12 Lars Aronssonlars@aronsson.se:

...
Petr Kadlec wrote: It is exactly this attitude of "oh, it's so difficult" that has blocked bug 164 from being fixed.

Well, not really. Bug 164 would be fixed almost completely for Czech-language wikis by using database features designed for exactly this problem. [1] But, I guess you know the situation.

* Not available until MySQL 5 migration complete (Domas, what's the status on testing?)

* Collation use for sorting needs to be double-checked to confirm it wouldn't interfere with present uniqueness constraints

* Collations not available in MySQL for all languages

* Multilingual sites possibly not well served by table-wide language-specific coding

Doing our own localized sort key encoding and adding another indexed column to sort on would avoid some dependency issues but has its own deployment and maintenance difficulties.

It would also be possible to use a separate column for the collated sorting while using MySQL 4.1+'s native collations, if the uniqueness constraints are a problem, but this is still dependent on rolling out an upgrade from 4.0.

-- brion

Aryeh Gregor

3:49 a.m.

On Tue, May 12, 2009 at 4:38 PM, Brion Vibber brion@wikimedia.org wrote:

...

Collation use for sorting needs to be double-checked to confirm it

wouldn't interfere with present uniqueness constraints

Since cl_sortkey isn't part of any unique key, this appears not to be an issue for this use. Of course, it's an issue for every other sorted list of titles, but those can't have custom sort keys specified to begin with and don't seem to be included in this proposal. Perhaps they should be, though. In that case we'd probably end up needing an extra column in every single table that includes the page title, just for sorting (but we'd be able to use flexible algorithms to generate the sort key, rather than being stuck with MySQL's).

...

Multilingual sites possibly not well served by table-wide

language-specific coding

utf8 sorting would be a lot better than binary sorting for any site, I'm pretty sure. (I assume utf8 sorts sanely and not according to codepoint.)

...

Doing our own localized sort key encoding and adding another indexed column to sort on would avoid some dependency issues but has its own deployment and maintenance difficulties.

You don't need another column for categorylinks, you can use the existing cl_sortkey, so that should be relatively easy to deploy. It doesn't help with non-category use cases, of course.

...

It would also be possible to use a separate column for the collated sorting while using MySQL 4.1+'s native collations, if the uniqueness constraints are a problem, but this is still dependent on rolling out an upgrade from 4.0.

In that case we may as well make it like cl_sortkey and populate it ourselves, surely.

Brion Vibber

4:46 a.m.

El 5/12/09 1:49 PM, Aryeh Gregor escribió:

...

On Tue, May 12, 2009 at 4:38 PM, Brion Vibberbrion@wikimedia.org wrote:

...

Collation use for sorting needs to be double-checked to confirm it

wouldn't interfere with present uniqueness constraints

Since cl_sortkey isn't part of any unique key, this appears not to be an issue for this use. Of course, it's an issue for every other sorted list of titles, but those can't have custom sort keys specified to begin with and don't seem to be included in this proposal. Perhaps they should be, though. In that case we'd probably end up needing an extra column in every single table that includes the page title, just for sorting (but we'd be able to use flexible algorithms to generate the sort key, rather than being stuck with MySQL's).

As a general issue we also need to consider managing paging through collation-sorted lists, since sort keys for different inputs may produce the same result. At the moment I think category lists are paged by offset (bad!) but we should ensure this is planned for.

...

...

Multilingual sites possibly not well served by table-wide

language-specific coding

utf8 sorting would be a lot better than binary sorting for any site, I'm pretty sure. (I assume utf8 sorts sanely and not according to codepoint.)

Well, "utf8" doesn't tell you anything specific there... :) There's a "general" as well as "binary" which would be the same as what we do now (except for not supporting 4-byte characters AT ALL)

http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html

For a multilingual site we'd probably end up using utf8_unicode_ci, which at least partially implements the Unicode Collation Algorithm (UCA), which sounds kind of confusing since at least a glance at http://www.unicode.org/reports/tr10/ makes it quite explicit that collation properties are language-dependent... presumably that's an un-tailored version which won't have most language-specific properties.

...

...
Doing our own localized sort key encoding and adding another indexed column to sort on would avoid some dependency issues but has its own deployment and maintenance difficulties.

You don't need another column for categorylinks, you can use the existing cl_sortkey, so that should be relatively easy to deploy. It doesn't help with non-category use cases, of course.

You would if you need to store a processed sort key index that's not in the form of displayable characters. (eg, the output of the UCA)

...

...
It would also be possible to use a separate column for the collated sorting while using MySQL 4.1+'s native collations, if the uniqueness constraints are a problem, but this is still dependent on rolling out an upgrade from 4.0.

In that case we may as well make it like cl_sortkey and populate it ourselves, surely.

For the unique case of categorylinks yes. For everything else, additional columns are not already present.

-- brion

Aryeh Gregor

5:01 a.m.

On Tue, May 12, 2009 at 5:46 PM, Brion Vibber brion@wikimedia.org wrote:

...

As a general issue we also need to consider managing paging through collation-sorted lists, since sort keys for different inputs may produce the same result. At the moment I think category lists are paged by offset (bad!) but we should ensure this is planned for.

Category lists use Pager, so they're paged by index offsets, not LIMIT M, N. Note that they should probably be ordered by (cl_sortkey, cl_from) or something instead of just (cl_sortkey) -- currently, equal sortkeys will cause problems. But Pager doesn't support multi-key sort right now.

I'm not sure what you mean here, though. What does "sort keys for different inputs may produce the same result" mean? You're just talking about sort key conflicts? In that case it seems best to just disambiguate by whatever's handy, in this case cl_from (which is the page_id and so not very meaningful). If it's coming up often enough to be a problem, the sort keys should be improved!

...

...
You don't need another column for categorylinks, you can use the existing cl_sortkey, so that should be relatively easy to deploy. It doesn't help with non-category use cases, of course.

You would if you need to store a processed sort key index that's not in the form of displayable characters. (eg, the output of the UCA)

Why? cl_sortkey isn't ever displayed to the user, so I don't see why it couldn't contain binary characters. I guess it's in the URL of links past the first page, but that's not a huge deal. Although it is a definite downside I didn't think of (it's nice to have manually-editable URLs!).

...

...
...
It would also be possible to use a separate column for the collated sorting while using MySQL 4.1+'s native collations, if the uniqueness constraints are a problem, but this is still dependent on rolling out an upgrade from 4.0.

In that case we may as well make it like cl_sortkey and populate it ourselves, surely.

For the unique case of categorylinks yes. For everything else, additional columns are not already present.

I was saying that if we were going to make extra columns, we may as well roll our own sort keys instead of bothering with collations, since it's not like we'd save a column. But of course if rolling our own would mean two extra columns instead of one, that would be a definite downside. Still, MySQL's collation support is unlikely to ever extend to nearly as many languages as we support, and it can't handle niceties like eliding initial "A" or "The" in English, say. So it doesn't seem like as good a solution.

Brion Vibber

6:33 a.m.

El 5/12/09 3:01 PM, Aryeh Gregor escribió:

...

On Tue, May 12, 2009 at 5:46 PM, Brion Vibberbrion@wikimedia.org wrote:

...
As a general issue we also need to consider managing paging through collation-sorted lists, since sort keys for different inputs may produce the same result. At the moment I think category lists are paged by offset (bad!) but we should ensure this is planned for.

Category lists use Pager, so they're paged by index offsets, not LIMIT M, N. Note that they should probably be ordered by (cl_sortkey, cl_from) or something instead of just (cl_sortkey) -- currently, equal sortkeys will cause problems. But Pager doesn't support multi-key sort right now.

Ah, even better -- it's already broken! :)

...

...
...
You don't need another column for categorylinks, you can use the existing cl_sortkey, so that should be relatively easy to deploy. It doesn't help with non-category use cases, of course.

You would if you need to store a processed sort key index that's not in the form of displayable characters. (eg, the output of the UCA)

Why? cl_sortkey isn't ever displayed to the user

It sure is -- the first letter of each sort key entry is displayed in a nice large type in the category list.

-- brion

Aryeh Gregor

7:03 a.m.

On Tue, May 12, 2009 at 7:33 PM, Brion Vibber brion@wikimedia.org wrote:

...

Ah, even better -- it's already broken! :)

Yup. Can be fixed without schema changes, though, at least.

...

It sure is -- the first letter of each sort key entry is displayed in a nice large type in the category list.

Good point. That would have to be adjusted. The effective first letter could probably be extracted from the sort key somehow, even if it's binary. If not, we could cheat and make it the last character or something.

Domas Mituzas

5:12 p.m.

...

Not available until MySQL 5 migration complete (Domas, what's the

status on testing?)

I did testing just for binary schema, I'm happy about application support part, need some DBA-ish features to be merged though :)

Still, do note, that MySQL5 supports only BMP, and will cut strings if there's anything beyond BMP in there.

...

Collation use for sorting needs to be double-checked to confirm it

wouldn't interfere with present uniqueness constraints

It will. If you want proper sorting, you'll end up using case- insensitive unique constraints.

...

Collations not available in MySQL for all languages

Well, we can ask people to describe those collations themselves, in theory. In practice - too much work, and people would just have to use generic UCI sorting rules.

...

Multilingual sites possibly not well served by table-wide

language-specific coding

Generic Unicode is good enough, usually.

...

It would also be possible to use a separate column for the collated sorting while using MySQL 4.1+'s native collations, if the uniqueness constraints are a problem, but this is still dependent on rolling out an upgrade from 4.0.

There's one more issue - if we go for utf8 columns, 255-character limit is somewhat big. Are we going to reduce column sizes? What problems would that bring upon us?

Cheers, Domas http://dammit.lt/

Gerard Meijssen

6:27 p.m.

Hoi, There is no need to ask people to define collation orders. This is exactly what the CLDR (an Unicode standard) is there for. Thanks, GerardM

2009/5/13 Domas Mituzas midom.lists@gmail.com

...

Hi

...

Not available until MySQL 5 migration complete (Domas, what's the

status on testing?)

I did testing just for binary schema, I'm happy about application support part, need some DBA-ish features to be merged though :)

Still, do note, that MySQL5 supports only BMP, and will cut strings if there's anything beyond BMP in there.

...

Collation use for sorting needs to be double-checked to confirm it

wouldn't interfere with present uniqueness constraints

It will. If you want proper sorting, you'll end up using case- insensitive unique constraints.

...

Collations not available in MySQL for all languages

Well, we can ask people to describe those collations themselves, in theory. In practice - too much work, and people would just have to use generic UCI sorting rules.

...

Multilingual sites possibly not well served by table-wide

language-specific coding

Generic Unicode is good enough, usually.

...
It would also be possible to use a separate column for the collated sorting while using MySQL 4.1+'s native collations, if the uniqueness constraints are a problem, but this is still dependent on rolling out an upgrade from 4.0.

There's one more issue - if we go for utf8 columns, 255-character limit is somewhat big. Are we going to reduce column sizes? What problems would that bring upon us?

Cheers, Domas http://dammit.lt/

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Domas Mituzas

7:54 p.m.

Hi!

...

There is no need to ask people to define collation orders. This is exactly what the CLDR (an Unicode standard) is there for.

Now I'm wondering, is it you being ignorant or me.

Cheers, Domas

Gerard Meijssen

8 p.m.

Hoi, http://www.unicode.org/reports/tr10/#Introduction http://cldr.unicode.org/index/cldr-spec/collation-guidelines Thanks, GerardM

2009/5/13 Domas Mituzas midom.lists@gmail.com

...

Hi!

...
There is no need to ask people to define collation orders. This is exactly what the CLDR (an Unicode standard) is there for.

Now I'm wondering, is it you being ignorant or me.

Cheers, Domas

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Domas Mituzas

8:16 p.m.

Hi!

...

http://www.unicode.org/reports/tr10/#Introduction

This is not CLDR, this is general collation algorithm.

...

http://cldr.unicode.org/index/cldr-spec/collation-guidelines

CLDR is a repository/process for LDMLs (thats what I referred to people sending us that data, in case current is wrong/not existing). Currently it has mistakes and multiple versions even for same locales - doesn't seem to be too stable nor correct.

An example: http://unicode.org/cldr/data/common/collation/lt.xml?rev=1.26&content-ty... ;-)

Do note, that such unstable changes require database rebuilds at each iteration.. So, we'd have to have someone reviewing it all, comparing with different sources, and then pushing it once every few years into some data staging environment where we do data conversions all the time? :) riiight...

Domas

Gerard Meijssen

9:10 p.m.

Hoi, The introduction demonstrates that Unicode indeed deals with collation.

When you look at the characters in Unicode, you will find that the Unicode UTF-8 standard is very much a work in progress. When you look at the CLDR you will find that it is also very much a work in progress HOWEVER, for many languages the collation has been well defined and is unlikely to change. When you look at the CLDR for African languages, there is a project called Afrigen where they are collecting the relevant information necessary to include it into the CLDR.

I am not impressed by your argument that you will have to rebuild the sorting order when there are indeed changes to a collation order. First of all standards like the CLDR know releases so these iterations only happen when a new release becomes available and second of all it seems weird to me to refuse to implement an improved collation order when it is wrong in the first place.

I have been always told that we develop and implent open source in order to create open content using open standards. In my opinion you have not provided any argument why any other approach is preferable. In this case the CLDR is an applicable open standard.

When as a consequence of an improved collation order for particular languages we have to rebuild databases every now and again, then it is tough but it needs to be done. It is all part of normal and acceptable system management. Thanks, GerardM

http://o2.it46.se/afrigen/statistics.php

2009/5/13 Domas Mituzas midom.lists@gmail.com

...

Hi!

...
http://www.unicode.org/reports/tr10/#Introduction

This is not CLDR, this is general collation algorithm.

...
http://cldr.unicode.org/index/cldr-spec/collation-guidelines

CLDR is a repository/process for LDMLs (thats what I referred to people sending us that data, in case current is wrong/not existing). Currently it has mistakes and multiple versions even for same locales

doesn't seem to be too stable nor correct.

An example:

http://unicode.org/cldr/data/common/collation/lt.xml?rev=1.26&content-ty... ;-)

Do note, that such unstable changes require database rebuilds at each iteration.. So, we'd have to have someone reviewing it all, comparing with different sources, and then pushing it once every few years into some data staging environment where we do data conversions all the time? :) riiight...

Domas

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Domas Mituzas

10:14 p.m.

Hi!

...

I have been always told that we develop and implent open source in order to create open content using open standards. In my opinion you have not provided any argument why any other approach is preferable. In this case the CLDR is an applicable open standard.

I wonder why you call it 'a standard', markup is standard, data is not. This is what Wikipedia says:

...

"The Common Locale Data Repository Project, often abbreviated as CLDR, is a project of the Unicode Consortium to provide locale data in the XML format for use in computer applications. CLDR contains locale specific information that an operating system will typically provide to applications. "

I'd prefer using binary sort, then we don't have to change anything, and everything is done extremely efficient :-) Anyway, there're lots and lots of implementation details/problems.

It is easy to point at collection of data, it is not that easy to merge it into production environment, handle data conflicts, staging, etc. Do you want to get few people fulltime working just on this?

Shrug, Domas

Gerard Meijssen

11:21 p.m.

Hoi, The point of this thread is that the collation is not working well. A hack is proposed and in my opinion it makes better sense to apply what is effectively the standard for collation. The way the CLDR is managed is very much with the same care as is usual for standards. Several of the people who I know are involved with the CLDR are also involved in standards and RFC's. Your assertion that data can not be part of a standard ... what is this based on ? I know that ISO is considering the implementation of something they call "data as a standard".

You mention that you prefer binary sorts. Is this an approachwhere one size should fit all? Because if it is, you have an approach that is broken by design. Collation can be different from language to language. As a matter of fact the original Dutch collation is no longer used because of the tirrany of people who did not want to take the conventions of "other" languages in consideration.

When it comes to collation, I would not mind if Oracle had people have a day job to implement proper collation for the many languages that exist. When the WMF needs collation, it would use this functionality that I would expect to be available as a standard in MySQL.

You have to appreciate that Wikipedia is currently localised in over 300 languages and you just cannot shrug the complexeties that come with this away. If it takes a few people having a full time job just to support our languages properly, it would be completely justified. Thanks, GerardM

2009/5/13 Domas Mituzas midom.lists@gmail.com

...

Hi!

...
I have been always told that we develop and implent open source in order to create open content using open standards. In my opinion you have not provided any argument why any other approach is preferable. In this case the CLDR is an applicable open standard.

I wonder why you call it 'a standard', markup is standard, data is not. This is what Wikipedia says:

...
"The Common Locale Data Repository Project, often abbreviated as CLDR, is a project of the Unicode Consortium to provide locale data in the XML format for use in computer applications. CLDR contains locale specific information that an operating system will typically provide to applications. "

I'd prefer using binary sort, then we don't have to change anything, and everything is done extremely efficient :-) Anyway, there're lots and lots of implementation details/problems.

It is easy to point at collection of data, it is not that easy to merge it into production environment, handle data conflicts, staging, etc. Do you want to get few people fulltime working just on this?

Shrug, Domas

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Lars Aronsson

14 May 14 May

4:10 a.m.

Gerard Meijssen wrote:

...

The point of this thread is that the collation is not working well. A hack is proposed and in my opinion it makes better sense to apply what is effectively the standard for collation.

Gerard, I don't think you are being helpful. My point in proposing the DEFAULTSORT hack (which is totally ugly), is that the WMF technicians are unable to address the issue. If they were able to "apply a standard", bug 164 would not have been open for more than four years. Sure, there are standards and they should be implemented at the database level (in MySQL, not in MediaWiki), which I think they are in MySQL version 6. It remains an open question how many more years will we have to wait for WMF to upgrade to that version. In the mean time, perhaps we should start (today?) to use DEFAULTSORT more aggressively?

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Gerard Meijssen

11:47 a.m.

Hoi, It is not that they cannot address the issue, it is that they have not addressed the issue. Given the amount of work that is on their plate and given the ambitions for MediaWiki it is not surprising that this is the case. When you state that it is not helpful because my position could derail your proposal, that is fine. We both agree that your proposal is a hack and, given the reaction of Domas, it is plain that he was not even aware of this standard that is applicable and therefore it is equally plain that no thought has been given to implement the collation orders as available in the CLDR.

One other fine reason why it is good to oppose this hack is because it will not work for other scripts particularly some of the Indian scripts. Consequently I do not think that your proposal should be implemented at all. Thanks, GerardM

2009/5/13 Lars Aronsson lars@aronsson.se

...

Gerard Meijssen wrote:

...
The point of this thread is that the collation is not working well. A hack is proposed and in my opinion it makes better sense to apply what is effectively the standard for collation.

Gerard, I don't think you are being helpful. My point in proposing the DEFAULTSORT hack (which is totally ugly), is that the WMF technicians are unable to address the issue. If they were able to "apply a standard", bug 164 would not have been open for more than four years. Sure, there are standards and they should be implemented at the database level (in MySQL, not in MediaWiki), which I think they are in MySQL version 6. It remains an open question how many more years will we have to wait for WMF to upgrade to that version. In the mean time, perhaps we should start (today?) to use DEFAULTSORT more aggressively?

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Domas Mituzas

6:38 p.m.

Hi!

me and Gerard had some chat about this stuff today, I'll post transcript if anyone is interested:

[11:47:03] GerardM- Domas, could you explain to me why it would make a full time job to implement proper collations ? Would it not make sense when it is part of the standard MySQL ? [11:47:37] domas well, a) someone would have to actually fix CLDR stuff [11:47:56] GerardM- what is wrong with it ? [11:47:56] domas see, if proposed/draft is better than 'stable', etc [11:47:59] domas it has errors [11:48:10] domas I looked at random collation, and could see errors [11:48:48] domas example: http://unicode.org/cldr/data/common/collation/lt.xml?rev=1.26&content-ty... [11:49:36] GerardM- what is wrong with it ? and also is there a better approach to this ? [11:49:52] GerardM- it seems to me that having a standard for this is the way to go [11:50:03] domas well, status quo is people maintaining this separately [11:50:07] domas we at mysql maintain it separately [11:50:10] domas oracle maintain it separately [11:50:10] domas etc [11:50:14] domas CLDR is not a standard yet [11:50:25] domas it is a good source of information, yes [11:50:37] GerardM- you mean it is not THE standard that everyone adheres to [11:50:44] domas it is not standard per se [11:50:48] domas CLDR is a project, a collection [11:50:59] domas a repository [11:51:16] GerardM- it is treated with the same care as other standards .. eg ISO-639 [11:51:29] domas well, probably the care is similar, yes, just the amount of data there is huge [11:51:38] domas and it is a bit sloppy [11:51:45] domas I'm sure ISO639 has way less data [11:51:46] domas :) [11:51:48] domas anyway [11:51:52] domas b) [11:51:56] domas actual merging of this [11:52:14] domas this includes validating if the collation is _indexable_ [11:52:32] domas see, this defines comparison rules [11:52:50] domas to simplify [11:52:59] domas this is possible: a>b, b>c, c>a [11:53:09] GerardM- given that we are talking about UTF-8 collation, there should be no problem ordering them [11:53:19] domas no, there is always problem [11:53:30] domas say, in some collation 'll' is same as 'l' [11:53:35] domas (thats real fact ;-) [11:53:55] GerardM- in proper Dutch collation ij is one character [11:53:58] domas right [11:54:05] domas how to index that? [11:54:48] domas I've opened quite a few bugs on issues like that in past few years :) [11:54:52] domas in mysql [11:54:56] GerardM- the UTF-8 solution would be that there is a seperate character for the ij [11:55:15] domas yes, but if people enter 'ij' it has to be treated in same way as that special character [11:55:20] GerardM- there are many such issues [11:55:22] GerardM- yes [11:55:24] domas and how does prefixindex work for 'i%' ? [11:55:30] domas should it include ij's or not? [11:55:32] domas :) [11:55:38] domas (it won't) [11:55:41] domas oh well. [11:55:44] domas or it may [11:55:46] domas but it will break indexing [11:56:00] domas c) [11:56:04] GerardM- ok so the question is how can we have MySQL and the CLDR people work together [11:56:25] GerardM- because it is in everyone's interest to get to a standard solution [11:57:09] domas well, mysql supports LDMLs or whichever was the format [11:57:14] domas here we come to c) [11:57:18] domas and c) is [11:57:23] domas deployment [11:57:47] domas there're two major problems here [11:58:19] domas during conversion to different charset one would have diverging datasets in different orders [11:58:45] domas so, replication can cause data drift [11:58:53] domas another is uniqueness [11:59:00] domas e.g. the change I showed you in lt.xml [11:59:09] domas it is changing unique constraints [11:59:21] domas 'i' and 'y' were separate in 'standard', and are together in 'draft' [12:00:06] domas so, there would have to be a way to resolve all the collisions in page titles [12:00:15] domas which is no longer a technology problem, as community has to be involved [12:00:50] GerardM- sorting and constraints are imho separate things [12:00:57] domas you think so :)) [12:02:16] GerardM- when i and y are equal, you show them next to each other [12:02:17] GerardM- when the constraint has it that there can only be one i or y it is a separate rule that can be implemented in a different way [12:02:24] domas well [12:02:31] domas you cannot have different constraints for ordering/ uniqueness [12:02:33] GerardM- I am sure that it can be implemented in a stored procedure [12:02:36] domas as in, different rulesets [12:03:51] domas tertiary weights would help here, probably, but on the other hand [12:03:56] domas 'i' and 'y' _are_ same :) [12:04:09] domas or, well, who is authority on uniqueness then [12:04:13] domas if we have one on sorting [12:04:36] GerardM- the issue you raised was how to deal with existing data where i and y both exist [12:05:06] domas well, or 'i' and 'I' as this would lead to case insensitive unique constraints [12:05:31] GerardM- so allowing for existing garbage means you have to be able to sort. uniqueness rules would then exist for new content [12:05:49] domas you can't [12:05:56] domas uniqueness rules are enforced by indexing [12:06:11] GerardM- that is one way of doing it [12:06:25] GerardM- technically the best [12:06:35] domas I understand the other way [12:06:39] domas and I don't want to hear about it ;-) [12:06:44] GerardM- <grin> [12:07:04] domas it would probably mean converting all the indexing into non-unique [12:07:12] domas not having any varchar PKs [12:07:31] domas building another layer for multiple-weight comparisons [12:07:51] domas etc [12:07:57] domas complexity is insane [12:08:01] domas we don't do insanely complex things :)) [12:08:09] GerardM- So the question is, how do we get to an open standard for this ? [12:08:23] GerardM- and how do we get some proper sorting for MediaWiki ? [12:08:24] domas well, let's say CLDR _is_ an open standard [12:08:38] domas even then, there's huge work to migrate all that data to MW's data store [12:08:40] domas and maintain it there [12:08:48] domas as by a), b), c) [12:08:50] domas let me add d) [12:09:01] GerardM- CLDR data IS already implemented in some of the extensions !! [12:09:03] domas amount of DBA work, with staging [12:09:12] domas CLDR data is easy to implement [12:09:20] domas some of CLDR data is implemented as mysql collations, or will be [12:09:27] domas maybe all mysql collations will move to CLDR data in future [12:09:46] domas bugs like that get opened: http://bugs.mysql.com/bug.php?id=37898 [12:10:27] GerardM- That is cool .. [12:10:47] GerardM- so people can ask MySQL to implement particular collations ? [12:10:48] domas and... collation changes are _incompatible_changes_ on datasets [12:10:54] domas of course [12:11:04] GerardM- do they get implemented ? [12:11:08] domas takes a while [12:11:14] domas it goes to next alpha/development release [12:11:28] GerardM- right but that is normal in standards as well [12:11:50] GerardM- so how long would it take .. given a positive approach [12:12:55] domas make mediawiki support CLDR - easy [12:12:59] domas as in, the stack [12:13:12] domas make the stack support CLDR as a process suitable for Wikimedia - extreme pain [12:14:25] GerardM- on the phone [12:14:28] domas okie [12:14:35] domas (btw, can I publish this transcript on mailing list? :) [12:14:45] domas it is quite informative [12:15:14] GerardM- <grin> we are not done talking and I was to ask you if I could blog about this :) [12:15:34] domas I'm just asking in advance! [12:15:43] domas btw, as a fun note [12:15:59] domas CLDR does not resolve problems, where dictionaries are needed for sorting [12:16:07] domas my favorite example, provided by River: Welsh. [12:16:13] domas \o/ [12:17:10] domas gonna go grab some coffee too [12:20:29] domas hold on, switching back from GPRS to DSL [12:20:39] * Disconnected (). [12:20:50] * Disconnected (). [12:21:00] * Disconnected (). [12:21:46] * domas_ sets mode -e domas_ [12:21:46] * services. sets mode +e domas [12:21:49] domas back! [12:45:19] GerardM- Hoi [12:45:25] GerardM- back from the phone [12:46:25] GerardM- given that there are over 7000 languages and that there are over 25000 linguistic entities (including dialects and orthographies) it is certainly a "growth area [12:48:28] GerardM- What I am looking for is that we at least support the WMF languages properly and that is already enough of a challenge [13:04:59] domas hehe [13:05:02] domas as I've said [13:05:03] domas for now [13:05:06] domas it is huge huge effort [13:05:35] GerardM- we can do one language at a time <grin> and start with English </grin> [13:07:40] GerardM- What you do not say is that there is an alternative to this ... so it is a job that needs doing [13:08:11] GerardM- and imho making sure that it fits in with what is the standard makes it a best effort [13:08:32] domas english is easiest [13:08:35] domas kind of [13:08:44] domas because we sort english properly even now %) [13:09:18] domas and to have this going forward, we need multiple people to work on this project [13:09:34] domas DBA, engine development, content conflict testing, collation evaluation, etc [13:09:44] domas + more hardware for regular staging :) [13:09:59] domas + rethinking of case sensitivity issues [13:10:15] GerardM- if we would do it TOGETHER with the Unicode people, we could get funding for it [13:10:37] domas I personally am too lazy to work on this project :)) [13:11:38] GerardM- there are other things where you can make a difference as well [13:12:19] GerardM- I learned from Siebrand at Berlin about something that would make our servers more efficient and thereby allow for growth for a couple of months ... [13:12:28] GerardM- had to do with loading messages [13:12:42] domas thats for appservers [13:12:43] domas :) [13:12:46] domas thats my plan! [13:12:49] domas my project! [13:13:12] GerardM- That is certainly important [13:13:32] GerardM- and <grin> there is certain to be much more that needs doing [13:13:37] domas yes [13:13:49] domas all this CLDR stuff is _very_ complex, when it comes to actual implementation [13:14:00] domas do note, we're already way ahead of curve of others when it comes to unicode [13:14:11] domas when I had to talk about certain charset/collation issues in a conference [13:14:19] domas the only public implementation of unicode normalization in software was mediawiki [13:14:20] domas :) [13:14:30] GerardM- <grin> you know my rants about Lingala ? We sure have a long way to go !! </grin> [13:22:03] GerardM- Would MySQL be interested in this project ? [13:22:55] domas file a bug, and see! :) [13:23:20] GerardM- this would be to implement CLDR collation on the Wikipedia projects [13:23:48] GerardM- for them it would be a research project ? [13:24:00] domas why research project [13:24:03] domas we know all the problems already :) [13:24:19] GerardM- the problems of implementing CLDR ? [13:24:29] domas as I've said, MySQL supports LDML [13:24:48] domas http://dev.mysql.com/doc/refman/5.0/en/adding-collation-unicode-uca.html [13:25:07] domas as I've said [13:25:09] domas CLDR is just a repo [13:25:13] GerardM- skyoing [13:25:15] domas LDML is spec, standard [13:30:04] GerardM- reading about LDML to find the relation to the collation [13:32:41] domas lol [13:32:47] domas Each collation must have a unique ID. To add a new collation, you must choose an ID value that is not currently used. The value must be in the range from 1 to 254. The collation ID that you choose will show up in these contexts: [13:32:48] domas hahahahaha [13:32:51] domas thats awesome limitation [13:33:15] domas As of MySQL 6.0.8, the range of IDs from 1024 to 2047 is reserved for user-defined collations. [13:33:18] domas thats better [13:33:23] GerardM- yes [13:33:24] GerardM- indeed [13:36:06] GerardM- so how do we currently deal with the languages from India where the order of Unicode is almost certainly to be wrong [13:36:17] domas well, currently we're using byte order [13:36:24] domas it is not any kind of unicode order [13:36:35] GerardM- so there is no proper sorting [13:36:36] domas as utf8 is variable length, offsets of character starts are different [13:43:57] GerardM- fyi Martin Benjamin is working on the Afrigen project, a project to get CLDR data for African languages [13:44:20] domas as I've said [13:44:22] domas using that data statically [13:44:28] domas on a mediawiki deployment [13:44:30] domas is very easy [13:44:34] domas the process is difficult [13:45:36] GerardM- is having this deployed in different ways per database instance going to have an impact ? [13:46:03] domas it isn't problem with an instance [13:46:06] domas it is problem with data [13:46:45] GerardM- right and this get you in either conversions or in inabilties to deal with them [13:47:13] domas yup [13:47:18] domas it is both problematic and labor-intensive work [13:47:32] domas anyway [13:47:38] domas switching to generic unicode sorting [13:47:41] domas would be huge step ahead [13:47:55] domas but we're not ready for that atm, simpy because until mysql6.0, there's no unicode support outside BMP [13:48:45] GerardM- so when would mysql6 be an option ? [13:49:19] domas *shrug*, it is alpha now [13:49:50] GerardM- <grin> it would be an option when you say.. I will support it </grin> [13:50:10] domas yes, kind of

Aryeh Gregor

8:38 p.m.

On Thu, May 14, 2009 at 7:38 AM, Domas Mituzas midom.lists@gmail.com wrote:

...

[13:36:06] GerardM- so how do we currently deal with the languages from India where the order of Unicode is almost certainly to be wrong [13:36:17] domas well, currently we're using byte order [13:36:24] domas it is not any kind of unicode order [13:36:35] GerardM- so there is no proper sorting [13:36:36] domas as utf8 is variable length, offsets of character starts are different

Well, a binary sort of UTF-8 is code point-order. One-byte characters start with 0, two-byte characters start with 110, three-byte characters start with 1110, four-byte characters start with 11110, so they'll always sort as 1-byte < 2-byte < 3-byte < 4-byte, and the variable length makes no difference. But code point order isn't very good: even in English, z < A, let alone languages with diacritics or whatnot.

An interesting discussion, anyway.

Marcus Buck

9:34 p.m.

I read much technical details on this thread on how collation and sorting is extremely complex. I hereby admit, that I don't understand all of the database dependencies and collation nifles or whatever else may be the limiting factors that play a role here. Perhaps I shouldn't participate in a tech discussion that I don't fully understand, but take me for a wiki user who spends many hour to add defaultsort statements to articles and doesn't understand why the software cannot do it by itself. Perhaps you can shed some more light on it for a dummie like me.

Here is, what I in my simple mind think, how it would be solvable (I'm sure my thoughts are too simple, but I want to understand, why and in what way they are too simple) . As an example I take the German language:

Take the pagename and make it uppercase (could be lowercase too, but uppercase seems better as the first letter will show up in the category). str_replace "Ä" with "A", "Ö" with "O", "Ü" with "U" and "ß" with "SS". Also str_replace other Latin characters with diacritics with their counterpart without diacritic. And that's our sortkey. This very simple procedure should reduce the number of necessary defaultsorts (except for articles about persons) by about 90% in the German wikipedia.

Implement these steps directly in the software and it should fix the sorting of categories. I read much about uniqueness in the thread, but defaultsort isn't unique either.

Of course it only works for languages where the unicode byte order of the basic script correspondends with the sorting order. But a solution helping 80% of the languages in 80% of all cases (and with no disadvantages for the other 20%) is better than a solution that helps 100% of all languages in 100% of all cases, but that does not exist yet, doesn't it?

Marcus Buck User:Slomox

Aryeh Gregor

10:05 p.m.

On Thu, May 14, 2009 at 10:34 AM, Marcus Buck wiki@marcusbuck.org wrote:

...

Here is, what I in my simple mind think, how it would be solvable (I'm sure my thoughts are too simple, but I want to understand, why and in what way they are too simple) . As an example I take the German language:

Take the pagename and make it uppercase (could be lowercase too, but uppercase seems better as the first letter will show up in the category). str_replace "Ä" with "A", "Ö" with "O", "Ü" with "U" and "ß" with "SS". Also str_replace other Latin characters with diacritics with their counterpart without diacritic. And that's our sortkey. This very simple procedure should reduce the number of necessary defaultsorts (except for articles about persons) by about 90% in the German wikipedia.

This would absolutely be possible as a "mostly works" solution for category sorting. It would mostly just need to have the appropriate code written. The only serious problem with it is that if the rules for automatic default sorting changed, a script of some sort would probably have to reparse all pages in some cases to figure out the original sort key provided, which would be kind of expensive.

Of course, this solution is actually woefully incomplete even for German. It only affects category pages, not the zillion other places where pages are sorted by names (Special:AllPages, etc.). But it would be a reasonable step forward if anyone wanted to do the coding required.

Marcus Buck

15 May 15 May

12:48 a.m.

Aryeh Gregor hett schreven:

...

This would absolutely be possible as a "mostly works" solution for category sorting. It would mostly just need to have the appropriate code written. The only serious problem with it is that if the rules for automatic default sorting changed, a script of some sort would probably have to reparse all pages in some cases to figure out the original sort key provided, which would be kind of expensive.

Of course, this solution is actually woefully incomplete even for German. It only affects category pages, not the zillion other places where pages are sorted by names (Special:AllPages, etc.). But it would be a reasonable step forward if anyone wanted to do the coding required.

If I continue my plain layman approach, I would add a new unique column "page_sortkey" to the table "page", that consists of the (uppercased, str_replaced as described in my previous post, and with whatever replacements else necessary to make it sort properly) pagename, " " (two whitespaces) as separator, and the page_id of the page. The page_id adds uniqueness, the two whitespaces make short words always sort before long words, as " " is the first character byte-order-wise and two whitespaces can never occur in page titles.

Special:Allpages would then "ORDER BY 'page_sortkey'" instead of "ORDER BY 'page_title'". If the (language-specific) function that converts the pagename to the sortkey string is good, that should make Allpages sort properly. The only remaining problem would be that pairs like "Günther" and "Gunther" would always sort according to the page_id instead of first "Gunther" and then "Günther". Could perhaps be solved by also making the strpos of the first character that was replaced by our function part of the sortkey, but I guess that would be over the top ;-) But most likely I missed some problem, that makes this not work.

Marcus Buck User:Slomox

Brion Vibber

3:14 a.m.

El 5/14/09 10:48 PM, Marcus Buck escribió:

...

If I continue my plain layman approach, I would add a new unique column "page_sortkey" to the table "page", that consists of the (uppercased, str_replaced as described in my previous post, and with whatever replacements else necessary to make it sort properly) pagename, " " (two whitespaces) as separator, and the page_id of the page. The page_id adds uniqueness, the two whitespaces make short words always sort before long words, as " " is the first character byte-order-wise and two whitespaces can never occur in page titles.

That bit's actually quite clever. :) Not sure it's always applicable, but should work in many cases.

Note that generally we don't need to roll our own case-folding and accent-folding and such -- rules and code already exist for standard language-specific sort order generation, they simply need to be implemented if we're going to roll our own index column.

-- brion

Aryeh Gregor

4:15 a.m.

On Thu, May 14, 2009 at 1:48 PM, Marcus Buck wiki@marcusbuck.org wrote:

...

If I continue my plain layman approach, I would add a new unique column "page_sortkey" to the table "page", that consists of the (uppercased, str_replaced as described in my previous post, and with whatever replacements else necessary to make it sort properly) pagename, " " (two whitespaces) as separator, and the page_id of the page. The page_id adds uniqueness, the two whitespaces make short words always sort before long words, as " " is the first character byte-order-wise and two whitespaces can never occur in page titles.

Special:Allpages would then "ORDER BY 'page_sortkey'" instead of "ORDER BY 'page_title'". If the (language-specific) function that converts the pagename to the sortkey string is good, that should make Allpages sort properly. The only remaining problem would be that pairs like "Günther" and "Gunther" would always sort according to the page_id instead of first "Gunther" and then "Günther". Could perhaps be solved by also making the strpos of the first character that was replaced by our function part of the sortkey, but I guess that would be over the top ;-) But most likely I missed some problem, that makes this not work.

You don't need to append the page id redundantly to the column -- you could just append the actual page_id column to the index, and sort by it as well. It saves you a few bytes. We would be doing this already for category pagination if Pager actually supported it.

Other than that, a problem with this is that page titles aren't just in page, they're scattered all over the place. We use (namespace, title) pairs for pagelinks, templatelinks, recentchanges, watchlist, logging, redirect, protected_titles, among others, and probably some extension tables as well. (This isn't necessarily denormalization -- most of those references are supposed to point to the old name if the page is moved, which the page ID wouldn't.)

Some of these tables probably have sorts by title done on them in the UI somewhere. It would be pretty scary for us to add an extra column for each one *and keep it updated*. Some of these tables can have many thousands of rows for a single page, and each row would need to be updated whenever the page's sort key changes. A change to the sorting algorithm would require changing every row of some fairly large tables (like logging) that currently are more or less insert-only. Of course, the same would be true for just categorylinks, so I guess it's not such a big deal.

We do sorts by page title on at least some of these. For instance, a sorted list of watchlist pages:

http://en.wikipedia.org/wiki/Special:Watchlist/edit

Admittedly, I can't think of many situations where it actually *matters*, or where people use the sorted list much, other than categories and perhaps page table stuff. So maybe categorylinks and page would be sufficient after all.

Tisza Gergő

3:22 p.m.

Aryeh Gregor <Simetrical+wikilist <at> gmail.com> writes:

...

Other than that, a problem with this is that page titles aren't just in page, they're scattered all over the place. We use (namespace, title) pairs for pagelinks, templatelinks, recentchanges, watchlist, logging, redirect, protected_titles, among others, and probably some extension tables as well. (This isn't necessarily denormalization -- most of those references are supposed to point to the old name if the page is moved, which the page ID wouldn't.)

Would it be very expensive to have a separate (namespace, title, sortkey) table, and join on that for queries that need sorting?

Domas Mituzas

4:36 p.m.

...

Would it be very expensive to have a separate (namespace, title, sortkey) table, and join on that for queries that need sorting?

yes, very.

Domas

Aryeh Gregor

8:30 p.m.

On Fri, May 15, 2009 at 4:22 AM, Tisza Gergő gtisza@gmail.com wrote:

...

Would it be very expensive to have a separate (namespace, title, sortkey) table, and join on that for queries that need sorting?

You would have to scan the *entire* table you're joining from (which may be hundreds of millions of rows). Not a possibility.

On Fri, May 15, 2009 at 5:47 AM, Tisza Gergő gtisza@gmail.com wrote:

...

Coding the first or second type of collation rule seems relatively simple, and already a huge gain. (Also, RFC 3454 might be worth checking out as it has language-independent rules for more than diacritics.)

I agree.

...

You can have a separate raw_sortkey column if that's a large concern.

That would still mean an UPDATE of many millions of rows. Plus you'd add another column to a table that's already very large -- categorylinks is ~40,000,000 rows on enwiki, and that's an extra 40m varchar(255)s clogging up the buffer pool even though they're never going to be used except for the occasional update.

...

Anyway, this is the same for any solution that does not rely on MySQL collation: when the rules change, you need to update the relevant column in the database.

Correct. In fact, when MySQL's rules change you also have to rebuild the index, AFAIK.

...

What are the chances that we get decent MySQL collation in the close future (say, next few years)?

If we don't upgrade, I'd say about 0%. :) Even if we do, there are still the uniqueness problems, and the non-BMP problem. So not very good, I'd say, for our purposes. (That's not to say MySQL collation isn't decent for other purposes).

Tisza Gergő

4:47 p.m.

Aryeh Gregor <Simetrical+wikilist <at> gmail.com> writes:

...

On Thu, May 14, 2009 at 10:34 AM, Marcus Buck <wiki <at> marcusbuck.org> wrote:

...
Take the pagename and make it uppercase (could be lowercase too, but uppercase seems better as the first letter will show up in the category). str_replace "Ä" with "A", "Ö" with "O", "Ü" with "U" and "ß" with "SS". Also str_replace other Latin characters with diacritics with their counterpart without diacritic. And that's our sortkey. This very simple procedure should reduce the number of necessary defaultsorts (except for articles about persons) by about 90% in the German wikipedia.

This would absolutely be possible as a "mostly works" solution for category sorting.It would mostly just need to have the appropriate code written.

Most of that can be done with one single language-independent algorithm. All the collation rules I've seen until now fall into one of four categories:

1. you just need to transform to lowercase and discard diacritics (and space, punctuation etc.). That is, you can do a Unicode decomposition and then throw out everything from the combining ranges. I think there are very few languages where that would be fully correct (for example it doesn't handle the German ß), but it would make sorting a lot less wrong, at least for Latin scripts. (For example, orr < őr < ott is incorrect in Hungarian as ő should be sorted after o, but it is still a lot better than putting ő way down after z.) And it only has to be written once.

2. you need a translation table with string replacement rules like ö => o, ő => o~, ß => ss. Works for most languages with Latin letters and probably a lot of others. Needs per-language rules, but it is much easier to ask language communities to provide translation rules than to ask them to write sorting code (and then review it). Most wikis probably already developed those rules and use them with DEFULTSORT.

3. you need a multipass replacement with multiple translation tables (and you concatenate the result using some sort of separator character). Theoretically two passes should be done for a lot of languages (they define equivalence classes for the accented characters in the first pass, then sort on the accents in the second), but in practice you get the right result when you do one pass and then sort on (sortkey, page_title) in the queries. Still, there are a few languages where you need multiple passes (such as Thai, where you sort on consonants first, and only after that on vowels).

4. in some languages such as Chinese it is impossible to sort correctly without a dictionary.

Coding the first or second type of collation rule seems relatively simple, and already a huge gain. (Also, RFC 3454 might be worth checking out as it has language-independent rules for more than diacritics.)

...

The only serious problem with it is that if the rules for automatic default sorting changed, a script of some sort would probably have to reparse all pages in some cases to figure out the original sort key provided, which would be kind of expensive.

You can have a separate raw_sortkey column if that's a large concern. Anyway, this is the same for any solution that does not rely on MySQL collation: when the rules change, you need to update the relevant column in the database.

What are the chances that we get decent MySQL collation in the close future (say, next few years)? Bug 164 was opened 5 years ago, there is no point in waiting another 5 years for database-level collations (and we do get them, the system proposed in this thread can be removed without any complication). Waiting forever will only result in people implementing the same solution with DEFAULTSORT, either by hand (huge waste of resources) or with bots (even more expensive than a built-in algorithm).

Nikola Smolenski

16 May 16 May

12:27 a.m.

Дана Friday 15 May 2009 11:47:56 Tisza Gergő написа:

...

What are the chances that we get decent MySQL collation in the close future (say, next few years)? Bug 164 was opened 5 years ago, there is no point in waiting another 5 years for database-level collations (and we do get them, the system proposed in this thread can be removed without any complication). Waiting forever will only result in people implementing the same solution with DEFAULTSORT, either by hand (huge waste of resources) or with bots (even more expensive than a built-in algorithm).

Given that WMF is one of the largest and most visible MySQL users, wouldn't it be possible to push this a bit? Perhaps even help with developer time (that would otherwise be spent on working around this problem)?

Aryeh Gregor

12:55 a.m.

On Fri, May 15, 2009 at 1:27 PM, Nikola Smolenski smolensk@eunet.yu wrote:

...

Given that WMF is one of the largest and most visible MySQL users, wouldn't it be possible to push this a bit? Perhaps even help with developer time (that would otherwise be spent on working around this problem)?

Wikimedia is nowhere *close* to the largest MySQL user. It's a mid-sized MySQL user at most. It's one of the largest *websites* using MySQL, but there are data warehouses and whatnot that have vastly larger databases, and definitely more valuable ones. Besides, we don't pay them any money, unlike some of their enterprise users.

Dmitriy Sintsov

2 p.m.

* Aryeh Gregor Simetrical+wikilist@gmail.com [Fri, 15 May 2009 13:55:22 -0400]:

...

Wikimedia is nowhere *close* to the largest MySQL user. It's a mid-sized MySQL user at most. It's one of the largest *websites* using MySQL, but there are data warehouses and whatnot that have vastly larger databases, and definitely more valuable ones. Besides, we don't pay them any money, unlike some of their enterprise users.

What can be more valuable than a general knowledge? Some lists of personal info or another statistical data, which will become outdated pretty soon? Only a non-completely free databases of scientific works and publications such as ArtStor are coming to my mind as being really comparable to what wikipedia offers.. Dmitriy

Aryeh Gregor

17 May 17 May

8:28 p.m.

On Sat, May 16, 2009 at 3:00 AM, Dmitriy Sintsov questpc@rambler.ru wrote:

...

What can be more valuable than a general knowledge?

I didn't mean some idealistic notion of what's most valuable to humanity. I meant money. There are plenty of MySQL databases whose continued efficient operation is worth many millions of dollars to their owners. Wikimedia's database is worth almost nothing commercially due to its licensing. Databases that are valuable (as in money) will have owners who are willing to spend money to get them working better, and money is what MySQL cares most about. It's a business.

On Sat, May 16, 2009 at 3:20 AM, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...

An upgrade to MySQL 6 would make a big difference for language and collation support.

MySQL 6 is still in alpha. It is not close to suitable for production use. Upgrading to it is totally out of the question.

Gerard Meijssen

18 May 18 May

1:47 a.m.

Hoi, It would be suitable when Domas says that we can support is. The current software may be stable, but keeping us from improved language support. Improved language support is what we need. Improved language support is essential for improving usability for the majority of our languages. Supporting languages is as technical as providing stability. How to weigh this is not obvious either way. Thanks, GerardM

2009/5/17 Aryeh Gregor <Simetrical+wikilist@gmail.comSimetrical%2Bwikilist@gmail.com

...

On Sat, May 16, 2009 at 3:00 AM, Dmitriy Sintsov questpc@rambler.ru wrote:

...
What can be more valuable than a general knowledge?

I didn't mean some idealistic notion of what's most valuable to humanity. I meant money. There are plenty of MySQL databases whose continued efficient operation is worth many millions of dollars to their owners. Wikimedia's database is worth almost nothing commercially due to its licensing. Databases that are valuable (as in money) will have owners who are willing to spend money to get them working better, and money is what MySQL cares most about. It's a business.

On Sat, May 16, 2009 at 3:20 AM, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...
An upgrade to MySQL 6 would make a big difference for language and

collation

...
support.

MySQL 6 is still in alpha. It is not close to suitable for production use. Upgrading to it is totally out of the question.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Bart

9:21 p.m.

Which version of MySQL are we using now? The mediawiki site says that a new mediawiki installation is compatible with MySQL back to 4.0 -- does that mean we're still using 4.0 and haven't upgraded?

2009/5/17 Gerard Meijssen gerard.meijssen@gmail.com

...

Hoi, It would be suitable when Domas says that we can support is. The current software may be stable, but keeping us from improved language support. Improved language support is what we need. Improved language support is essential for improving usability for the majority of our languages. Supporting languages is as technical as providing stability. How to weigh this is not obvious either way. Thanks, GerardM

2009/5/17 Aryeh Gregor <Simetrical+wikilist@gmail.com Simetrical%2Bwikilist@gmail.com< Simetrical%2Bwikilist@gmail.com Simetrical%252Bwikilist@gmail.com>

...
...
On Sat, May 16, 2009 at 3:00 AM, Dmitriy Sintsov questpc@rambler.ru wrote:

...
What can be more valuable than a general knowledge?

I didn't mean some idealistic notion of what's most valuable to humanity. I meant money. There are plenty of MySQL databases whose continued efficient operation is worth many millions of dollars to their owners. Wikimedia's database is worth almost nothing commercially due to its licensing. Databases that are valuable (as in money) will have owners who are willing to spend money to get them working better, and money is what MySQL cares most about. It's a business.

On Sat, May 16, 2009 at 3:20 AM, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...
An upgrade to MySQL 6 would make a big difference for language and

collation

...
support.

MySQL 6 is still in alpha. It is not close to suitable for production use. Upgrading to it is totally out of the question.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Aryeh Gregor

9:24 p.m.

On Mon, May 18, 2009 at 10:21 AM, Bart banaticus@gmail.com wrote:

...

Which version of MySQL are we using now? The mediawiki site says that a new mediawiki installation is compatible with MySQL back to 4.0 -- does that mean we're still using 4.0 and haven't upgraded?

http://en.wikipedia.org/wiki/Special:Version says "4.0.40-wikimedia-log". It's a custom compile based off 4.0.40 -- source code is at http://svn.wikimedia.org/viewvc/mysql/.

Domas Mituzas

9:32 p.m.

...

http://en.wikipedia.org/wiki/Special:Version says "4.0.40-wikimedia-log". It's a custom compile based off 4.0.40 -- source code is at http://svn.wikimedia.org/viewvc/mysql/.

last official upstream vendor version was 4.0.30 :)

Domas

Bart

9:42 p.m.

Hunh. Upgrading from that means that date formats are returned differently so how we display dates would need to be changed, there's authentication issues because the password field is longer and more secure so bots would need to be fixed, we'd have to parse the articles to make them "safe" then "convert" them to unicode, there'd be several problems just in going to MySQL 4.1 -- but wouldn't it be worth it in the end to continue to upgrade all the way to 5.1 for more streamlining and getting rid of possible bugs like this one: http://bugs.mysql.com/bug.php?id=32707 "It was possible to force an error message of excessive length which could lead to a buffer overflow." Is there a group that's specifically working on issues that would need to be overcome to upgrade our MySQL?

2009/5/18 Aryeh Gregor <Simetrical+wikilist@gmail.comSimetrical%2Bwikilist@gmail.com

...

On Mon, May 18, 2009 at 10:21 AM, Bart banaticus@gmail.com wrote:

...
Which version of MySQL are we using now? The mediawiki site says that a

new

...
mediawiki installation is compatible with MySQL back to 4.0 -- does that mean we're still using 4.0 and haven't upgraded?

http://en.wikipedia.org/wiki/Special:Version says "4.0.40-wikimedia-log". It's a custom compile based off 4.0.40 -- source code is at http://svn.wikimedia.org/viewvc/mysql/.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Domas Mituzas

10:14 p.m.

Hi!

...

Hunh.

!!!

...

Upgrading from that means that date formats are returned differently so how we display dates would need to be changed,

Not really, even though mediawiki has automatic date formatting both directions, we store dates now as VARCHAR(14) or CHAR(14) - nothing would change.

...

there's authentication issues because the password field is longer and more secure so bots would need to be fixed,

not applicable to application side at all, this is just how mysql clients authenticate to mysql server, not how mediawiki users.

...

we'd have to parse the articles to make them "safe" then "convert" them to unicode,

unfortunately, our articles are beyond BMP, so there'd be no way to make them 'safe' nor 'convert' them to unicode (actually, no need to convert anything, they are in unicode _at_the_moment_).

...

there'd be several problems just in going to MySQL 4.1 -- but wouldn't it be worth it in the end to continue to upgrade all the way to 5.1 for more streamlining and getting rid of possible bugs like this one: http://bugs.mysql.com/bug.php?id=32707 "It was possible to force an error message of excessive length which could lead to a buffer overflow."

there're various other ways to crash a server, thats why we don't allow everyone to talk to our database servers.

...

Is there a group that's specifically working on issues that would need to be overcome to upgrade our MySQL?

no, should there be one? we have 5.1 servers in pool, and they seem to work, by the way, with current application.

Cheers, Domas

Domas Mituzas

10:15 p.m.

...

Hunh.

I forgot to thank you for your insights!

Domas

Domas Mituzas

16 May 16 May

6:49 a.m.

...

Given that WMF is one of the largest and most visible MySQL users, wouldn't it be possible to push this a bit? Perhaps even help with developer time (that would otherwise be spent on working around this problem)?

:-)

Domas

Bart

6:54 a.m.

MySQL is open source -- what sort of changes did you have in mind?

2009/5/15 Domas Mituzas midom.lists@gmail.com

...

...
Given that WMF is one of the largest and most visible MySQL users, wouldn't it be possible to push this a bit? Perhaps even help with developer time (that would otherwise be spent on working around this problem)?

:-)

Domas

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Gerard Meijssen

2:20 p.m.

Hoi, An upgrade to MySQL 6 would make a big difference for language and collation support. We need to first make use of improvements that are out there before we consider "roling our own". Thanks, GerardM

2009/5/16 Bart banaticus@gmail.com

...

MySQL is open source -- what sort of changes did you have in mind?

2009/5/15 Domas Mituzas midom.lists@gmail.com

...
...
Given that WMF is one of the largest and most visible MySQL users, wouldn't it be possible to push this a bit? Perhaps even help with developer time (that would otherwise be spent on working around this problem)?

:-)

Domas

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Brion Vibber

19 May 19 May

12:01 a.m.

El 5/16/09 12:20 PM, Gerard Meijssen escribió:

...

Hoi, An upgrade to MySQL 6 would make a big difference for language and collation support. We need to first make use of improvements that are out there before we consider "roling our own".

I think I'd recommend against upgrading our production databases to an alpha development release at this time. :)

The addition of full UTF-8 support is pleasing, and eventually may be a factor in future upgrades when some day it's released.

-- brion vibber (brion @ wikimedia.org)

Bart

7:15 a.m.

I know it's in unicode now, but I think the problem was that MySQL still assumes that it's in latin1 -- when you upgrade to MySQL 4.1, as I understand it, some of the text is "converted" to unicode which can mess things up that are already in unicode. I saw on http://peter-zaitsev.livejournal.com/12083.html "If you will configure MySQL 4.1 to use utf8 by default and just start it with MySQL 4.0 latin1 tables you're likely to trash your data, as there is no charset information in old tables and MySQL 4.1 will assume the data is in unicode. So do not change character set straight ahead but run 4.1 with same charset as 4.0 before and use ALTER TABLE to convert tables to 4.1 format - this has character set information so you should be safe."

2009/5/18 Brion Vibber brion@wikimedia.org

...

El 5/16/09 12:20 PM, Gerard Meijssen escribió:

...
Hoi, An upgrade to MySQL 6 would make a big difference for language and

collation

...
support. We need to first make use of improvements that are out there

before

...
we consider "roling our own".

I think I'd recommend against upgrading our production databases to an alpha development release at this time. :)

The addition of full UTF-8 support is pleasing, and eventually may be a factor in future upgrades when some day it's released.

-- brion vibber (brion @ wikimedia.org)

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Domas Mituzas

2:09 p.m.

Bart,

...

I know it's in unicode now, but I think the problem was that MySQL still assumes that it's in latin1 -- when you upgrade to MySQL 4.1, as I understand it, some of the text is "converted" to unicode which can mess things up that are already in unicode. I saw on http://peter-zaitsev.livejournal.com/12083.html "If you will configure MySQL 4.1 to use utf8 by default and just start it with MySQL 4.0 latin1 tables you're likely to trash your data, as there is no charset information in old tables and MySQL 4.1 will assume the data is in unicode. So do not change character set straight ahead but run 4.1 with same charset as 4.0 before and use ALTER TABLE to convert tables to 4.1 format - this has character set information so you should be safe."

Or you can use --character-set-server (=binary in our case).

Cheers, Domas

Platonides

12 May 12 May

5:05 a.m.

Lars Aronsson wrote:

...

This can be done with bots, for sure, if we agree that it should be done. Is this something we should strive for? Has any language of Wikipedia (or Wikinews or...) already started to do this?

Perhaps each language.php could define a function to transform the titles from whatever rules its language uses to ones sorting 'right' in binary mysql.

Gerard Meijssen

6:36 a.m.

Hoi, The rules for collation are indeed per language and they are defined in the CLDR or the Common Locale Data Repository. CLDR data is already used in some of the extensions developed at translatewiki,net.. Maintaining CLDR data only once makes sense imho. Thanks, GerardM

2009/5/12 Platonides Platonides@gmail.com

...

Lars Aronsson wrote:

...
This can be done with bots, for sure, if we agree that it should be done. Is this something we should strive for? Has any language of Wikipedia (or Wikinews or...) already started to do this?

Perhaps each language.php could define a function to transform the titles from whatever rules its language uses to ones sorting 'right' in binary mysql.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Aryeh Gregor

10:39 p.m.

On Mon, May 11, 2009 at 3:29 PM, Lars Aronsson lars@aronsson.se wrote:

...

There is a way to avoid all such problems, namely by a more aggressive use of DEFAULTSORT that removes from sorting all upper case letters (except the initial one), all whitespace and all commas. It would mean almost every article needs a DEFAULTSORT. In the examples above:

{{DEFAULTSORT:Walesjimmy}} {{DEFAULTSORT:Europeancourtofauditors}} {{DEFAULTSORT:Europeanunionmission}} {{DEFAULTSORT:Europeanquarterofbrussels}} {{DEFAULTSORT:Moonillusion}}

This would be a good thing to do in the software. We could implement the framework reasonably easily, if anyone cares to, and then let each language do its thing. A basic English implementation like this would be easy enough.

Of course, any change to the sortkey beyond the first will require that all existing sort keys be changed by a batch job -- otherwise sorting will be a mess. Every change to the sortkey algorithm would either require that all pages be reparsed (very expensive), or that a special conversion script be defined to account for that exact change. Unless it's minor enough that the inconsistency is acceptable, I guess.

On Tue, May 12, 2009 at 7:18 AM, Petr Kadlec petr.kadlec@gmail.com wrote:

...

Well, not really. Bug 164 would be fixed almost completely for Czech-language wikis by using database features designed for exactly this problem. [1] But, I guess you know the situation. ... [1] http://dev.mysql.com/doc/refman/4.1/en/charset-collation-effect.html

Note the version. Wikimedia uses MySQL 4.0, which doesn't contain any charsets or collations other than binary. If we used a higher version, utf8 might be an option: that would use a Unicode collation, I guess, which should at least be okay for most languages, if not perfect. (But MySQL's utf8 has other downsides, like being variable-width and not supporting Unicode outside the BMP.)

...

If Swedish sorting rules are simple enough that removing all whitespace and punctuation and converting to lower case would solve most of the problems, I would say that such feature would not be too difficult to implement right into MediaWiki (into LanguageSv.php), writing those DEFAULTSORT codes explicitly into every article would be nonsense, IMHO. (So, go ahead with it, I won’t stop you or anything, I’m just trying to say that this is not really a solution for Czech language.)

There's no reason this couldn't be implemented for Czech as well in the software, in principle. Ideally we'd use something based on Unicode collation as a baseline, with optional customizations per language:

http://unicode.org/reports/tr10/

Gerard Meijssen

13 May 13 May

3:59 a.m.

Hoi, Collation based on Unicode ... what do you mean by that ? Do you mean the order of the characters in the UTF-8 or do you mean the Unicode CLDR order. The last is the only sensible approach. The idea of the DEFAULTSORT is imho awful; you want people to invest heavily on something that could work properly when we decided to use the appropriate standards.

The one argument why I think this DEFAULTSORT is that it is too much of a burden for the "other" languages and projects. Most projects still need to concentrate on basic content and this is just another "must have" distraction. Again, it makes sense to implement the appropriate standards in stead of applying an ugly hack. Thanks, GerardM

2009/5/12 Aryeh Gregor <Simetrical+wikilist@gmail.comSimetrical%2Bwikilist@gmail.com

...

On Mon, May 11, 2009 at 3:29 PM, Lars Aronsson lars@aronsson.se wrote:

...
There is a way to avoid all such problems, namely by a more aggressive use of DEFAULTSORT that removes from sorting all upper case letters (except the initial one), all whitespace and all commas. It would mean almost every article needs a DEFAULTSORT. In the examples above:

{{DEFAULTSORT:Walesjimmy}} {{DEFAULTSORT:Europeancourtofauditors}} {{DEFAULTSORT:Europeanunionmission}} {{DEFAULTSORT:Europeanquarterofbrussels}} {{DEFAULTSORT:Moonillusion}}

This would be a good thing to do in the software. We could implement the framework reasonably easily, if anyone cares to, and then let each language do its thing. A basic English implementation like this would be easy enough.

Of course, any change to the sortkey beyond the first will require that all existing sort keys be changed by a batch job -- otherwise sorting will be a mess. Every change to the sortkey algorithm would either require that all pages be reparsed (very expensive), or that a special conversion script be defined to account for that exact change. Unless it's minor enough that the inconsistency is acceptable, I guess.

On Tue, May 12, 2009 at 7:18 AM, Petr Kadlec petr.kadlec@gmail.com wrote:

...
Well, not really. Bug 164 would be fixed almost completely for Czech-language wikis by using database features designed for exactly this problem. [1] But, I guess you know the situation. ... [1] http://dev.mysql.com/doc/refman/4.1/en/charset-collation-effect.html

Note the version. Wikimedia uses MySQL 4.0, which doesn't contain any charsets or collations other than binary. If we used a higher version, utf8 might be an option: that would use a Unicode collation, I guess, which should at least be okay for most languages, if not perfect. (But MySQL's utf8 has other downsides, like being variable-width and not supporting Unicode outside the BMP.)

...
If Swedish sorting rules are simple enough that removing all whitespace and punctuation and converting to lower case would solve most of the problems, I would say that such feature would not be too difficult to implement right into MediaWiki (into LanguageSv.php), writing those DEFAULTSORT codes explicitly into every article would be nonsense, IMHO. (So, go ahead with it, I won’t stop you or anything, I’m just trying to say that this is not really a solution for Czech language.)

There's no reason this couldn't be implemented for Czech as well in the software, in principle. Ideally we'd use something based on Unicode collation as a baseline, with optional customizations per language:

http://unicode.org/reports/tr10/

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

5713

Age (days ago)

5721

Last active (days ago)

wikitech-l@lists.wikimedia.org

51 comments

12 participants

tags (0)

participants (12)

Aryeh Gregor
Bart
Brion Vibber
Dmitriy Sintsov
Domas Mituzas
Gerard Meijssen
Lars Aronsson
Marcus Buck
Nikola Smolenski
Petr Kadlec
Platonides
Tisza Gergő