On 4/13/06, Gregory Maxwell gmaxwell@gmail.com wrote:
AFAIR, most string matches in mysql are case insensitive, which would mean that we could have indexed case insensitive matches quickly... but I'm guessing that our use of binary fields for titles (which is required because no version of mysql has complete UTF-8 support) most likely breaks that.
Yes, they are, and yes, it does. Could someone explain what the exact reason is that we're using varchar binary in MediaWiki for page titles? I've been using regular varchars for my WiktionaryZ tables, and so far it seems to work fine with UTF-8. Where exactly does a non-binary varchar break?
Also, according to the MySQL 5.0 documentation: http://dev.mysql.com/doc/refman/5.0/en/binary-varbinary.html
<quote> The BINARY and VARBINARY data types are distinct from the CHAR BINARY and VARCHAR BINARY data types. For the latter types, the BINARY attribute does not cause the column to be treated as a binary string column. Instead, it causes the binary collation for the column character set to be used, and the column itself contains non-binary character strings rather than binary byte strings. For example, CHAR(5) BINARY is treated as CHAR(5) CHARACTER SET latin1 COLLATE latin1_bin, assuming that the default character set is latin1. This differs from BINARY(5), which stores 5-bytes binary strings that have no character set or collation. </quote>
Since we are using VARCHAR(255) BINARY for page titles, rather than VARBINARY(255), does that mean that "it causes the binary collation for the column character set to be used"? If so, does the use of VARCHAR(255) BINARY, as opposed to a simple VARCHAR(255), affect anything but the sorting order (collation)?
Might it make sense to use non-binary strings and use binary comparisons instead where case-sensitivity is required?
Erik