Hello,
I have got issue with ListFiles page in mediawiki 1.35.1 Filtering worked not very good, was case-sensitive and not always got text in middle of file name. I looked in DB and saw that img_name column is varbinary, but pagers/ImageListPager.php tries to do case-insensitive select with LOWERing both sides of strings. But LOWER does not work for varbinary So I think that following change will be reasonable:
--- ImageListPager.php.orig 2021-10-14 16:31:52.000000000 +0300 +++ ImageListPager.php 2021-10-14 16:00:10.127694733 +0300 @@ -90,9 +90,10 @@
if ( $nt ) { $dbr = wfGetDB( DB_REPLICA ); - $this->mQueryConds[] = 'LOWER(img_name)' . + $this->mQueryConds[] = 'LOWER(CONVERT(img_name USING utf8))' . $dbr->buildLike( $dbr->anyString(), - strtolower( $nt->getDBkey() ), $dbr->anyString() ); + mb_strtolower( $nt->getDBkey() ), $dbr->anyString() ); + } }
@@ -161,9 +162,9 @@ $nt = Title::newFromText( $this->mSearch ); if ( $nt ) { $dbr = wfGetDB( DB_REPLICA ); - $conds[] = 'LOWER(' . $prefix . '_name)' . + $conds[] = 'LOWER(CONVERT(' . $prefix . '_name USING utf8))' . $dbr->buildLike( $dbr->anyString(), - strtolower( $nt->getDBkey() ), $dbr->anyString() ); + mb_strtolower( $nt->getDBkey() ), $dbr->anyString() ); } }
Hi Sergey,
On Thu, 2021-10-14 at 18:16 +0300, Sergey Dorofeev wrote:
I have got issue with ListFiles page in mediawiki 1.35.1 text in middle of file name. I looked in DB and saw that img_name column is varbinary, but pagers/ImageListPager.php tries to do case-insensitive select with LOWERing both sides of strings. But LOWER does not work for varbinary So I think that following change will be reasonable:
Thanks for looking at the code! If you think that you found a software bug, please feel free to report it in the issue tracker: https://www.mediawiki.org/wiki/How_to_report_a_bug Feel also very welcome to upload the patch into Gerrit for review: https://www.mediawiki.org/wiki/Gerrit/Tutorial
Again, thanks a lot! andre
I agree that LOWER doesn't make much sense in binary collation.
Sadly, a utf8 (3-byte UTF-8) conversion may fail for 4-byte characters, so at the very least it should be utf8mb4 (4-byte UTF-8). I am not so familiar with ListPager to say if there could be other issues arising from that- sending a code review would be easier for better context.
On Thu, Oct 14, 2021 at 5:16 PM Sergey Dorofeev sergey@fidoman.ru wrote:
Hello,
I have got issue with ListFiles page in mediawiki 1.35.1 Filtering worked not very good, was case-sensitive and not always got text in middle of file name. I looked in DB and saw that img_name column is varbinary, but pagers/ImageListPager.php tries to do case-insensitive select with LOWERing both sides of strings. But LOWER does not work for varbinary So I think that following change will be reasonable:
--- ImageListPager.php.orig 2021-10-14 16:31:52.000000000 +0300 +++ ImageListPager.php 2021-10-14 16:00:10.127694733 +0300 @@ -90,9 +90,10 @@
if ( $nt ) { $dbr = wfGetDB( DB_REPLICA );
$this->mQueryConds[] = 'LOWER(img_name)'
.
$this->mQueryConds[] =
'LOWER(CONVERT(img_name USING utf8))' . $dbr->buildLike( $dbr->anyString(),
strtolower(
$nt->getDBkey() ), $dbr->anyString() );
mb_strtolower(
$nt->getDBkey() ), $dbr->anyString() );
} }
@@ -161,9 +162,9 @@ $nt = Title::newFromText( $this->mSearch ); if ( $nt ) { $dbr = wfGetDB( DB_REPLICA );
$conds[] = 'LOWER(' . $prefix . '_name)'
.
$conds[] = 'LOWER(CONVERT(' . $prefix .
'_name USING utf8))' . $dbr->buildLike( $dbr->anyString(),
strtolower(
$nt->getDBkey() ), $dbr->anyString() );
mb_strtolower(
$nt->getDBkey() ), $dbr->anyString() ); } }
-- Sergey _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Thank you, did not know about it. Real UTF-8 in mysql is utf8mb4, I think it should be used here.
--- Sergey
Jaime Crespo писал 2021-10-14 18:32:
I agree that LOWER doesn't make much sense in binary collation.
Sadly, a utf8 (3-byte UTF-8) conversion may fail for 4-byte characters, so at the very least it should be utf8mb4 (4-byte UTF-8). I am not so familiar with ListPager to say if there could be other issues arising from that- sending a code review would be easier for better context.
On Thu, Oct 14, 2021 at 5:16 PM Sergey Dorofeev sergey@fidoman.ru wrote:
Hello,
I have got issue with ListFiles page in mediawiki 1.35.1 Filtering worked not very good, was case-sensitive and not always got text in middle of file name. I looked in DB and saw that img_name column is varbinary, but pagers/ImageListPager.php tries to do case-insensitive select with LOWERing both sides of strings. But LOWER does not work for varbinary So I think that following change will be reasonable:
--- ImageListPager.php.orig 2021-10-14 16:31:52.000000000 +0300 +++ ImageListPager.php 2021-10-14 16:00:10.127694733 +0300 @@ -90,9 +90,10 @@
if ( $nt ) { $dbr = wfGetDB( DB_REPLICA );
$this->mQueryConds[] = 'LOWER(img_name)'
.
$this->mQueryConds[] =
'LOWER(CONVERT(img_name USING utf8))' . $dbr->buildLike( $dbr->anyString(),
strtolower(
$nt->getDBkey() ), $dbr->anyString() );
mb_strtolower(
$nt->getDBkey() ), $dbr->anyString() );
} }
@@ -161,9 +162,9 @@ $nt = Title::newFromText( $this->mSearch ); if ( $nt ) { $dbr = wfGetDB( DB_REPLICA );
$conds[] = 'LOWER(' . $prefix . '_name)'
.
$conds[] = 'LOWER(CONVERT(' . $prefix .
'_name USING utf8))' . $dbr->buildLike( $dbr->anyString(),
strtolower(
$nt->getDBkey() ), $dbr->anyString() );
mb_strtolower(
$nt->getDBkey() ), $dbr->anyString() ); } }
-- Sergey _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
--
Jaime Crespo http://wikimedia.org _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
It's years since I discovered that mysql's utf8 is broken in this way, but I can still feel the pain. What part of "universal" did they not understand? The mysql docs more or less say that "utf8" is deprecated, certainly not future-proof, and suggest you use utf8mb4. See https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8.html https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8.html)
On Oct 14, 2021, at 1:17 PM, Sergey Dorofeev sergey@fidoman.ru wrote:
Thank you, did not know about it. Real UTF-8 in mysql is utf8mb4, I think it should be used here.
Sergey
Jaime Crespo писал 2021-10-14 18:32:
I agree that LOWER doesn't make much sense in binary collation.
Sadly, a utf8 (3-byte UTF-8) conversion may fail for 4-byte characters, so at the very least it should be utf8mb4 (4-byte UTF-8). I am not so familiar with ListPager to say if there could be other issues arising from that- sending a code review would be easier for better context.
On Thu, Oct 14, 2021 at 5:16 PM Sergey Dorofeev <sergey@fidoman.ru mailto:sergey@fidoman.ru> wrote: Hello,
I have got issue with ListFiles page in mediawiki 1.35.1 Filtering worked not very good, was case-sensitive and not always got text in middle of file name. I looked in DB and saw that img_name column is varbinary, but pagers/ImageListPager.php tries to do case-insensitive select with LOWERing both sides of strings. But LOWER does not work for varbinary So I think that following change will be reasonable:
--- ImageListPager.php.orig 2021-10-14 16:31:52.000000000 +0300 +++ ImageListPager.php 2021-10-14 16:00:10.127694733 +0300 @@ -90,9 +90,10 @@
if ( $nt ) { $dbr = wfGetDB( DB_REPLICA );
$this->mQueryConds[] = 'LOWER(img_name)'
.
$this->mQueryConds[] =
'LOWER(CONVERT(img_name USING utf8))' . $dbr->buildLike( $dbr->anyString(),
strtolower(
$nt->getDBkey() ), $dbr->anyString() );
mb_strtolower(
$nt->getDBkey() ), $dbr->anyString() );
} }
@@ -161,9 +162,9 @@ $nt = Title::newFromText( $this->mSearch ); if ( $nt ) { $dbr = wfGetDB( DB_REPLICA );
$conds[] = 'LOWER(' . $prefix . '_name)'
.
$conds[] = 'LOWER(CONVERT(' . $prefix .
'_name USING utf8))' . $dbr->buildLike( $dbr->anyString(),
strtolower(
$nt->getDBkey() ), $dbr->anyString() );
mb_strtolower(
$nt->getDBkey() ), $dbr->anyString() ); } }
-- Sergey _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org mailto:wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org mailto:wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/ https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
-- Jaime Crespo <http://wikimedia.org http://wikimedia.org/>
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org mailto:wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org mailto:wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/ https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
I don't want to defend MySQL development decisions- in fact PHP made some similarly bad ones, but it would be unfair to judge them too harsly with the "power of hindsight" [0]- but... /pedantic on
On Thu, Oct 14, 2021 at 7:37 PM Roy Smith roy@panix.com wrote:
What part of "universal" did they not understand?
... several years ago, during the end of the century/start of a new one, no one used UTF-8 [1] and PHP didn't even support multi-byte strings. The original spec for UTF-8 called for up to 6 bytes[2]. The BMP, however (3 bytes) contained characters for most modern languages [3], which was a waste of space and performance because at the time, MySQL worked much faster with fixed-width columns, which would be a waste of space (double!). My guess is that someone said "this is probably good enough", and would it be too outrageous to think that we may not need as many extra characters as stars in our Galaxy, when less than 65K were practically needed?
3 things changed after that: * Unicode limited UTF-8 to encoding for 21 bits in 2003 [4], requiring only 4 bytes- only one more than on MySQL's utf8 * Apple wanted to sell iPhones in Japan, so they were added to unicode in 2010, and its subsequent popularity * MySQL/InnoDB has been highly optimized for the fast handling of variable-length strings
However, you cannot just arbitrarily break backwards compatibility and rename the meaning of configuration- specially with storage software that has been continuously supporting incremental upgrades as long as I can remember. You can just support the new standard and encourage its usage, make it the default, etc.
This is a bit offtopic here (feel free to PM to continue the conversation), and just to be clear, I am _not fully justifying the decisions_, just giving historical context, but I want to end with some relevant lessons to the list:
* It is very difficult to build future-proof applications- PHP, MySQL, Mediawiki, they have a long history and we should be gentle when we judge them from the future. My work, involving backups, makes sometimes supporting storage of stuff for over 5 years (unchanged) challenging, because encryption algorithms are found to be weak, or end up being unsupported/unavailable in just 2 releases of the operating system! * Standards also change, they are not as "universal" as we may want to believe (there have been 32 extra unicode versions since 1991). I expect new collations to be needed in the future that are currently not implemented, too. * It is ok to make "mistakes", as long as we learn from them and improve upon them :-)
Sorry for the text block.
[0] url:https://powerlisting.fandom.com/wiki/Hindsight [1] url:https://commons.wikimedia.org/wiki/File:Utf8webgrowth.svg [2] url:https://www.rfc-editor.org/rfc/rfc2279 [3] url: https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane [4] url:https://www.rfc-editor.org/rfc/rfc3629
s/they/emojis/
On Fri, Oct 15, 2021 at 2:12 PM Jaime Crespo jcrespo@wikimedia.org wrote:
I don't want to defend MySQL development decisions- in fact PHP made some similarly bad ones, but it would be unfair to judge them too harsly with the "power of hindsight" [0]- but... /pedantic on
On Thu, Oct 14, 2021 at 7:37 PM Roy Smith roy@panix.com wrote:
What part of "universal" did they not understand?
... several years ago, during the end of the century/start of a new one, no one used UTF-8 [1] and PHP didn't even support multi-byte strings. The original spec for UTF-8 called for up to 6 bytes[2]. The BMP, however (3 bytes) contained characters for most modern languages [3], which was a waste of space and performance because at the time, MySQL worked much faster with fixed-width columns, which would be a waste of space (double!). My guess is that someone said "this is probably good enough", and would it be too outrageous to think that we may not need as many extra characters as stars in our Galaxy, when less than 65K were practically needed?
3 things changed after that:
- Unicode limited UTF-8 to encoding for 21 bits in 2003 [4], requiring
only 4 bytes- only one more than on MySQL's utf8
- Apple wanted to sell iPhones in Japan, so they were added to unicode in
2010, and its subsequent popularity
- MySQL/InnoDB has been highly optimized for the fast handling of
variable-length strings
However, you cannot just arbitrarily break backwards compatibility and rename the meaning of configuration- specially with storage software that has been continuously supporting incremental upgrades as long as I can remember. You can just support the new standard and encourage its usage, make it the default, etc.
This is a bit offtopic here (feel free to PM to continue the conversation), and just to be clear, I am _not fully justifying the decisions_, just giving historical context, but I want to end with some relevant lessons to the list:
- It is very difficult to build future-proof applications- PHP, MySQL,
Mediawiki, they have a long history and we should be gentle when we judge them from the future. My work, involving backups, makes sometimes supporting storage of stuff for over 5 years (unchanged) challenging, because encryption algorithms are found to be weak, or end up being unsupported/unavailable in just 2 releases of the operating system!
- Standards also change, they are not as "universal" as we may want to
believe (there have been 32 extra unicode versions since 1991). I expect new collations to be needed in the future that are currently not implemented, too.
- It is ok to make "mistakes", as long as we learn from them and improve
upon them :-)
Sorry for the text block.
[0] url:https://powerlisting.fandom.com/wiki/Hindsight [1] url:https://commons.wikimedia.org/wiki/File:Utf8webgrowth.svg [2] url:https://www.rfc-editor.org/rfc/rfc2279 [3] url: https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane [4] url:https://www.rfc-editor.org/rfc/rfc3629
wikitech-l@lists.wikimedia.org