I have an open bug in CentralNotice [1] where I have a search function by name for old content. Naively this happens as a LIKE query on the table's name column. Though this works, it fails at being case insensitive because the table, like most(all?) other tables on the WMF cluster, has by default binary charset/collation.
Given that I want to support case insensitive searching; does anyone have any thoughts or examples on how to go about doing it in a binary table?
The only solution I can think of would be to change the collation/charset of the table in question to utf8. Would that be a bad idea? I'm led to understand the binary default came about because older versions of MySQL did not support the full multilingual plane. But this might no longer be an issue with MariaDB 5.6?
[1] https://bugzilla.wikimedia.org/show_bug.cgi?id=53751
~Matt Walker Wikimedia Foundation Fundraising Technology Team
On 10/1/13, Matthew Walker mwalker@wikimedia.org wrote:
I have an open bug in CentralNotice [1] where I have a search function by name for old content. Naively this happens as a LIKE query on the table's name column. Though this works, it fails at being case insensitive because the table, like most(all?) other tables on the WMF cluster, has by default binary charset/collation.
Given that I want to support case insensitive searching; does anyone have any thoughts or examples on how to go about doing it in a binary table?
The only solution I can think of would be to change the collation/charset of the table in question to utf8. Would that be a bad idea? I'm led to understand the binary default came about because older versions of MySQL did not support the full multilingual plane. But this might no longer be an issue with MariaDB 5.6?
[1] https://bugzilla.wikimedia.org/show_bug.cgi?id=53751
~Matt Walker Wikimedia Foundation Fundraising Technology Team _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
The solution used in the past as far as I know was to have an additional column that was a lowercase version of the main column (For example TitleKey extension. The cl_sortkey field is also the same sort of principle, but with a more complex transformation of the original field)
Additionally, I'm not sure why the character set matters. It would be the collation that would control this. (I believe. Not really up on the details of mysql charsets and collations)
--bawolff
On Wed, Oct 2, 2013 at 10:22 AM, Matthew Walker mwalker@wikimedia.orgwrote:
Given that I want to support case insensitive searching; does anyone have any thoughts or examples on how to go about doing it in a binary table?
The only solution I can think of would be to change the collation/charset of the table in question to utf8.
For a single search field there is no need to change the whole table charset or collation. Simply make the field a character type (VARCHAR / CHAR / TEXT), utf8 if that's suitable, and choose a case-insenitive collation. Eg:
ALTER TABLE <tbl> MODIFY <col> VARCHAR(N) CHARACTER SET utf8 COLLATE utf8_unicode_ci;
More info on string comparison:
http://dev.mysql.com/doc/refman/5.5/en/case-sensitivity.html
However I'd like to hear the historical reasons why MW does everything binary too. There certainly were some character set issues in older MySQL versions and afaik we still have to support MySQL 5.0.2 and up.
BR Sean
On Wed, Oct 2, 2013 at 6:43 AM, Sean Pringle springle@wikimedia.org wrote:
However I'd like to hear the historical reasons why MW does everything binary too.
a) Lack of any native UTF-8 support at all before MySQL 4.1 b) Lack of complete native UTF-8 support until MySQL 5.5.something introduced utf8mb4 c) Inertia and backwards compatibility
-- brion
wikitech-l@lists.wikimedia.org