Hello,
On most wikis, MediaWiki is configuration to convert the first letter of a title to uppercase, but apparently it's not converting every Unicode characters : for example, on frwiki ɽ https://fr.wikipedia.org/w/index.php?title=%C9%BD&redirect=no is a different article than Ɽ https://fr.wikipedia.org/wiki/%E2%B1%A4, even if the second character is the uppercase version of the first one in Unicode.
So, what characters are actually converted to uppercase by the title normalization ?
I need to know this information to stop reporting some false positives in WPCleaner https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:WPCleaner.
Thanks, Nico
MediaWiki uses php's mb_strtoupper.
I believe this will use normal unicode uppercase algorithm. However this can vary depending on version of unicode. We are currently in the process of switching to php7, but for the moment we are still using HHVM's uppercasing code. There's a list of differences between hhvm and php7.2 uppercasing at https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-con... [All this is probably subject to change]
However, I am at a loss as to why hhvm & php < 5.6 [1] wouldn't map that character, since the ɽ -> Ɽ mapping has been present since unicode 5 (2006). Guess it was using a really old unicode data or something.
See also bug T219279 [2]
-- Brian
[1] https://3v4l.org/GHt3b [2] https://phabricator.wikimedia.org/T219279
On Sat, Aug 3, 2019 at 7:57 AM Nicolas Vervelle nvervelle@gmail.com wrote:
Hello,
On most wikis, MediaWiki is configuration to convert the first letter of a title to uppercase, but apparently it's not converting every Unicode characters : for example, on frwiki ɽ https://fr.wikipedia.org/w/index.php?title=%C9%BD&redirect=no is a different article than Ɽ https://fr.wikipedia.org/wiki/%E2%B1%A4, even if the second character is the uppercase version of the first one in Unicode.
So, what characters are actually converted to uppercase by the title normalization ?
I need to know this information to stop reporting some false positives in WPCleaner https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:WPCleaner.
Thanks, Nico _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Thanks Brian,
Great for the link to Php72ToUpper.php ! I think I understand with it : for example, the first line says 'ƀ' => 'ƀ', which should mean that this letter shouldn't be converted to uppercase by MW ? That's one of the letter I found that wasn't converted to uppercase and that was generating a false positive in my code : so it's because specific MW code is preventing the conversion :-)
Nico
On Sun, Aug 4, 2019 at 1:32 AM bawolff bawolff+wn@gmail.com wrote:
MediaWiki uses php's mb_strtoupper.
I believe this will use normal unicode uppercase algorithm. However this can vary depending on version of unicode. We are currently in the process of switching to php7, but for the moment we are still using HHVM's uppercasing code. There's a list of differences between hhvm and php7.2 uppercasing at
https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-con... [All this is probably subject to change]
However, I am at a loss as to why hhvm & php < 5.6 [1] wouldn't map that character, since the ɽ -> Ɽ mapping has been present since unicode 5 (2006). Guess it was using a really old unicode data or something.
See also bug T219279 [2]
-- Brian
[1] https://3v4l.org/GHt3b [2] https://phabricator.wikimedia.org/T219279
On Sat, Aug 3, 2019 at 7:57 AM Nicolas Vervelle nvervelle@gmail.com wrote:
Hello,
On most wikis, MediaWiki is configuration to convert the first letter of
a
title to uppercase, but apparently it's not converting every Unicode characters : for example, on frwiki ɽ https://fr.wikipedia.org/w/index.php?title=%C9%BD&redirect=no is a different article than Ɽ https://fr.wikipedia.org/wiki/%E2%B1%A4, even if the second character is the uppercase version of the first one in
Unicode.
So, what characters are actually converted to uppercase by the title normalization ?
I need to know this information to stop reporting some false positives in WPCleaner https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:WPCleaner.
Thanks, Nico _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Sun, Aug 4, 2019 at 11:34 AM Nicolas Vervelle nvervelle@gmail.com wrote:
Thanks Brian,
Great for the link to Php72ToUpper.php ! I think I understand with it : for example, the first line says 'ƀ' => 'ƀ', which should mean that this letter shouldn't be converted to uppercase by MW ? That's one of the letter I found that wasn't converted to uppercase and that was generating a false positive in my code : so it's because specific MW code is preventing the conversion :-)
Hi!
No, that file is a temporary measure during a transition between two versions of php.
In HHVM and PHP 5.x, calling mb_toupper("ƀ") would give the erroneous result "ƀ".
In PHP 7.x, the result is the correct capitalization.
The issue is that the titles of wiki articles get normalized, so under php7 we would have
ƀar => Ƀar
which would prevent you from being able to reach the page.
Once we're done with the transition and we go through the process of coverting the (several hundred) pages/users that have the wrong title normalization, we will remove that table, and obtain the correct behaviour.
You just need to subscribe https://phabricator.wikimedia.org/T219279 and wait for its resolution I think - most unicode horrors are fixed in recent versions of PHP, including the one you were citing.
Cheers,
Giuseppe
Thanks Giuseppe !
I've subscribed to T219279 to know when the pages are properly converted, and when I can remove the hack in my code.
Nico
On Mon, Aug 5, 2019 at 7:03 AM Giuseppe Lavagetto glavagetto@wikimedia.org wrote:
On Sun, Aug 4, 2019 at 11:34 AM Nicolas Vervelle nvervelle@gmail.com wrote:
Thanks Brian,
Great for the link to Php72ToUpper.php ! I think I understand with it : for example, the first line says 'ƀ' =>
'ƀ',
which should mean that this letter shouldn't be converted to uppercase by MW ? That's one of the letter I found that wasn't converted to uppercase and that was generating a false positive in my code : so it's because
specific
MW code is preventing the conversion :-)
Hi!
No, that file is a temporary measure during a transition between two versions of php.
In HHVM and PHP 5.x, calling mb_toupper("ƀ") would give the erroneous result "ƀ".
In PHP 7.x, the result is the correct capitalization.
The issue is that the titles of wiki articles get normalized, so under php7 we would have
ƀar => Ƀar
which would prevent you from being able to reach the page.
Once we're done with the transition and we go through the process of coverting the (several hundred) pages/users that have the wrong title normalization, we will remove that table, and obtain the correct behaviour.
You just need to subscribe https://phabricator.wikimedia.org/T219279 and wait for its resolution I think - most unicode horrors are fixed in recent versions of PHP, including the one you were citing.
Cheers,
Giuseppe
Giuseppe Lavagetto Principal Site Reliability Engineer, Wikimedia Foundation _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Last question (I believe) : I've implemented something similar as Php72ToUpper in WPCleaner, and it seems to work fine for removing false positives. I've only one left on frwiki : ⅷ https://fr.wikipedia.org/w/index.php?title=%E2%85%B7&redirect=no. My code still converts it to uppercase, but on frwiki there is one page for the lowercase letter, and one page for the uppercase letter, so this letter is not converted to uppercase by current MediaWiki version. Is it missing in Php72ToUpper to prevent it to be converted with PHP 7.2 ?
Nico
On Mon, Aug 5, 2019 at 8:45 AM Nicolas Vervelle nvervelle@gmail.com wrote:
Thanks Giuseppe !
I've subscribed to T219279 to know when the pages are properly converted, and when I can remove the hack in my code.
Nico
On Mon, Aug 5, 2019 at 7:03 AM Giuseppe Lavagetto < glavagetto@wikimedia.org> wrote:
On Sun, Aug 4, 2019 at 11:34 AM Nicolas Vervelle nvervelle@gmail.com wrote:
Thanks Brian,
Great for the link to Php72ToUpper.php ! I think I understand with it : for example, the first line says 'ƀ' =>
'ƀ',
which should mean that this letter shouldn't be converted to uppercase
by
MW ? That's one of the letter I found that wasn't converted to uppercase and that was generating a false positive in my code : so it's because
specific
MW code is preventing the conversion :-)
Hi!
No, that file is a temporary measure during a transition between two versions of php.
In HHVM and PHP 5.x, calling mb_toupper("ƀ") would give the erroneous result "ƀ".
In PHP 7.x, the result is the correct capitalization.
The issue is that the titles of wiki articles get normalized, so under php7 we would have
ƀar => Ƀar
which would prevent you from being able to reach the page.
Once we're done with the transition and we go through the process of coverting the (several hundred) pages/users that have the wrong title normalization, we will remove that table, and obtain the correct behaviour.
You just need to subscribe https://phabricator.wikimedia.org/T219279 and wait for its resolution I think - most unicode horrors are fixed in recent versions of PHP, including the one you were citing.
Cheers,
Giuseppe
Giuseppe Lavagetto Principal Site Reliability Engineer, Wikimedia Foundation _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Apparently that will change in php7.3, which we will move to eventually but probably not anytime soon: https://3v4l.org/W7TiC
-- bawolff On Mon, Aug 5, 2019 at 12:32 PM Nicolas Vervelle nvervelle@gmail.com wrote:
Last question (I believe) : I've implemented something similar as Php72ToUpper in WPCleaner, and it seems to work fine for removing false positives. I've only one left on frwiki : ⅷ https://fr.wikipedia.org/w/index.php?title=%E2%85%B7&redirect=no. My code still converts it to uppercase, but on frwiki there is one page for the lowercase letter, and one page for the uppercase letter, so this letter is not converted to uppercase by current MediaWiki version. Is it missing in Php72ToUpper to prevent it to be converted with PHP 7.2 ?
Nico
On Mon, Aug 5, 2019 at 8:45 AM Nicolas Vervelle nvervelle@gmail.com wrote:
Thanks Giuseppe !
I've subscribed to T219279 to know when the pages are properly converted, and when I can remove the hack in my code.
Nico
On Mon, Aug 5, 2019 at 7:03 AM Giuseppe Lavagetto < glavagetto@wikimedia.org> wrote:
On Sun, Aug 4, 2019 at 11:34 AM Nicolas Vervelle nvervelle@gmail.com wrote:
Thanks Brian,
Great for the link to Php72ToUpper.php ! I think I understand with it : for example, the first line says 'ƀ' =>
'ƀ',
which should mean that this letter shouldn't be converted to uppercase
by
MW ? That's one of the letter I found that wasn't converted to uppercase
and
that was generating a false positive in my code : so it's because
specific
MW code is preventing the conversion :-)
Hi!
No, that file is a temporary measure during a transition between two versions of php.
In HHVM and PHP 5.x, calling mb_toupper("ƀ") would give the erroneous result "ƀ".
In PHP 7.x, the result is the correct capitalization.
The issue is that the titles of wiki articles get normalized, so under php7 we would have
ƀar => Ƀar
which would prevent you from being able to reach the page.
Once we're done with the transition and we go through the process of coverting the (several hundred) pages/users that have the wrong title normalization, we will remove that table, and obtain the correct behaviour.
You just need to subscribe https://phabricator.wikimedia.org/T219279
and
wait for its resolution I think - most unicode horrors are fixed in
recent
versions of PHP, including the one you were citing.
Cheers,
Giuseppe
Giuseppe Lavagetto Principal Site Reliability Engineer, Wikimedia Foundation _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hi Nico, if possible, can your tool to actually use MW API to normalize titles? It's a very quick API call, you can do multiple titles at once, but it will save you a lot of grief over incompatibilities. --Yuri
On Sat, Aug 3, 2019 at 10:57 AM Nicolas Vervelle nvervelle@gmail.com wrote:
Hello,
On most wikis, MediaWiki is configuration to convert the first letter of a title to uppercase, but apparently it's not converting every Unicode characters : for example, on frwiki ɽ https://fr.wikipedia.org/w/index.php?title=%C9%BD&redirect=no is a different article than Ɽ https://fr.wikipedia.org/wiki/%E2%B1%A4, even if the second character is the uppercase version of the first one in Unicode.
So, what characters are actually converted to uppercase by the title normalization ?
I need to know this information to stop reporting some false positives in WPCleaner https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:WPCleaner.
Thanks, Nico _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Thanks Yuri,
I know of the normalization done through the API, but it doesn't work for the case I'm working on : it's a dump analysis, and I want it to be able to work offline...
Nico
On Sun, Aug 4, 2019 at 2:12 AM Yuri Astrakhan yuriastrakhan@gmail.com wrote:
Hi Nico, if possible, can your tool to actually use MW API to normalize titles? It's a very quick API call, you can do multiple titles at once, but it will save you a lot of grief over incompatibilities. --Yuri
On Sat, Aug 3, 2019 at 10:57 AM Nicolas Vervelle nvervelle@gmail.com wrote:
Hello,
On most wikis, MediaWiki is configuration to convert the first letter of
a
title to uppercase, but apparently it's not converting every Unicode characters : for example, on frwiki ɽ https://fr.wikipedia.org/w/index.php?title=%C9%BD&redirect=no is a different article than Ɽ https://fr.wikipedia.org/wiki/%E2%B1%A4, even if the second character is the uppercase version of the first one in
Unicode.
So, what characters are actually converted to uppercase by the title normalization ?
I need to know this information to stop reporting some false positives in WPCleaner https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:WPCleaner.
Thanks, Nico _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org