On 4/13/06, Gregory Maxwell gmaxwell@gmail.com wrote:
AFAIR, most string matches in mysql are case insensitive, which would mean that we could have indexed case insensitive matches quickly... but I'm guessing that our use of binary fields for titles (which is required because no version of mysql has complete UTF-8 support) most likely breaks that.
Yes, they are, and yes, it does. Could someone explain what the exact reason is that we're using varchar binary in MediaWiki for page titles? I've been using regular varchars for my WiktionaryZ tables, and so far it seems to work fine with UTF-8. Where exactly does a non-binary varchar break?
Also, according to the MySQL 5.0 documentation: http://dev.mysql.com/doc/refman/5.0/en/binary-varbinary.html
<quote> The BINARY and VARBINARY data types are distinct from the CHAR BINARY and VARCHAR BINARY data types. For the latter types, the BINARY attribute does not cause the column to be treated as a binary string column. Instead, it causes the binary collation for the column character set to be used, and the column itself contains non-binary character strings rather than binary byte strings. For example, CHAR(5) BINARY is treated as CHAR(5) CHARACTER SET latin1 COLLATE latin1_bin, assuming that the default character set is latin1. This differs from BINARY(5), which stores 5-bytes binary strings that have no character set or collation. </quote>
Since we are using VARCHAR(255) BINARY for page titles, rather than VARBINARY(255), does that mean that "it causes the binary collation for the column character set to be used"? If so, does the use of VARCHAR(255) BINARY, as opposed to a simple VARCHAR(255), affect anything but the sorting order (collation)?
Might it make sense to use non-binary strings and use binary comparisons instead where case-sensitivity is required?
Erik
On 4/14/06, Erik Moeller eloquence@gmail.com wrote:
On 4/13/06, Gregory Maxwell gmaxwell@gmail.com wrote:
AFAIR, most string matches in mysql are case insensitive, which would mean that we could have indexed case insensitive matches quickly... but I'm guessing that our use of binary fields for titles (which is required because no version of mysql has complete UTF-8 support) most likely breaks that.
Yes, they are, and yes, it does. Could someone explain what the exact reason is that we're using varchar binary in MediaWiki for page titles? I've been using regular varchars for my WiktionaryZ tables, and so far it seems to work fine with UTF-8. Where exactly does a non-binary varchar break?
[snip]
Might it make sense to use non-binary strings and use binary comparisons instead where case-sensitivity is required?
It is my understanding that MySQL doesn't correctly handle UTF-8 where the characters are outside of the basic multilingual plane (BMP). These are the characters which can not be represented by UCS-16 (i.e. more then two bytes in the UTF-16 representation) and this includes a number of characters which are required for complete support of some asian languages.
I'm not sure how it fails (does it munge, or warn, .. since it's mysql I'm sure it doesn't reject bad data...), since I haven't tried it personally but thats where I understood the current issue was.
PostgreSQL had simmlar issues with non-bmp characters until version 8 (prior to version 8, PG would reject inserts with most non-bmp characters as invalid).
There are other issues as well.. for example, last I checked on enwiki there were some titles (and page/image links) which contained completely invalid characters... (back from when enwiki was latin1 and more permissive about binary input not so long ago). I see that a number of bad pages have been moved to broken/whatever .. so perhaps thats been cleaned up. ... a switch to utf-8 might potentially run into problems with this pages, depending on how strict mysqls input validation is (historically not strict at all... but...), I know it's always caused me troubles importing into postgresql (because in PG I don't use binary fields for mediawiki imported data, and PG's input validation is fairly strict).
Erik Moeller wrote:
On 4/13/06, Gregory Maxwell gmaxwell@gmail.com wrote:
AFAIR, most string matches in mysql are case insensitive, which would mean that we could have indexed case insensitive matches quickly... but I'm guessing that our use of binary fields for titles (which is required because no version of mysql has complete UTF-8 support) most likely breaks that.
Yes, they are, and yes, it does. Could someone explain what the exact reason is that we're using varchar binary in MediaWiki for page titles? I've been using regular varchars for my WiktionaryZ tables, and so far it seems to work fine with UTF-8. Where exactly does a non-binary varchar break?
It could break string matching, but would definitely break sorting. (Sorting by codepoint may suck, but at least it's predictable.)
More generally, deliberately choosing a non-binary collation which applies to a *different character set* from the one really you're using seems pretty silly. You get unpredictable, incorrect sorting and potentially have strings rejected as invalid.
Might it make sense to use non-binary strings and use binary comparisons instead where case-sensitivity is required?
When MySQL finally reaches the mid-1990s and supports full Unicode, we'll be able to do that.
-- brion vibber (brion @ pobox.com)
On 4/14/06, Brion Vibber brion@pobox.com wrote: [snip]
It could break string matching, but would definitely break sorting. (Sorting by codepoint may suck, but at least it's predictable.)
More generally, deliberately choosing a non-binary collation which applies to a *different character set* from the one really you're using seems pretty silly. You get unpredictable, incorrect sorting and potentially have strings rejected as invalid.
The collation problem is a hard problem in general, as I understand it, as there are some cases where the collation of some unicode characters changes depending on the language.. For example, the position of ø in danish vs most other languages. ... although doing it wrong but mostly right isn't too hard.
Thus supporting multiple languages correctly in a single database becomes a little difficult. I don't think it's reasonable to expect the database to allow you to magically specify a new collation on the fly for each query, since index order depends on collation.
Instead, given sufficient support in the database, you could create a function enumerate_collation(language,string) which returns an integer array (or a mangled string), with one value for the absolute collation position of each character in the string. You could then define index on that function applied to the title column for each of the collations you will be using, and ORDER BY enumerate_collation('en',title) in your queries.
wikitech-l@lists.wikimedia.org