For title normalization, what characters are converted to uppercase ?

List overview All Threads
Download

newer

older

Discovery Weekly Update for the...

Upcoming Search Platform Office...

Nicolas Vervelle

3 Aug 2019 3 Aug '19

4:56 p.m.

Hello, On most wikis, MediaWiki is configuration to convert the first letter of a title to uppercase, but apparently it's not converting every Unicode characters : for example, on frwiki ɽ <https://fr.wikipedia.org/w/index.php?title=%C9%BD&redirect=no> is a different article than Ɽ <https://fr.wikipedia.org/wiki/%E2%B1%A4>, even if the second character is the uppercase version of the first one in Unicode. So, what characters are actually converted to uppercase by the title normalization ? I need to know this information to stop reporting some false positives in WPCleaner <https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:WPCleaner>. Thanks, Nico

Show replies by date

bawolff

4 Aug 4 Aug

1:32 a.m.

New subject: For title normalization, what characters are converted to uppercase ?

MediaWiki uses php's mb_strtoupper. I believe this will use normal unicode uppercase algorithm. However this can vary depending on version of unicode. We are currently in the process of switching to php7, but for the moment we are still using HHVM's uppercasing code. There's a list of differences between hhvm and php7.2 uppercasing at https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-co… [All this is probably subject to change] However, I am at a loss as to why hhvm & php < 5.6 [1] wouldn't map that character, since the ɽ -> Ɽ mapping has been present since unicode 5 (2006). Guess it was using a really old unicode data or something. See also bug T219279 [2] -- Brian [1] https://3v4l.org/GHt3b [2] https://phabricator.wikimedia.org/T219279 On Sat, Aug 3, 2019 at 7:57 AM Nicolas Vervelle <nvervelle(a)gmail.com> wrote:

...

Nicolas Vervelle

11:33 a.m.

New subject: For title normalization, what characters are converted to uppercase ?

Thanks Brian, Great for the link to Php72ToUpper.php ! I think I understand with it : for example, the first line says 'ƀ' => 'ƀ', which should mean that this letter shouldn't be converted to uppercase by MW ? That's one of the letter I found that wasn't converted to uppercase and that was generating a false positive in my code : so it's because specific MW code is preventing the conversion :-) Nico On Sun, Aug 4, 2019 at 1:32 AM bawolff <bawolff+wn(a)gmail.com> wrote:

...

Hello, On most wikis, MediaWiki is configuration to convert the first letter of

title to uppercase, but apparently it's not converting every Unicode characters : for example, on frwiki ɽ <https://fr.wikipedia.org/w/index.php?title=%C9%BD&redirect=no> is a different article than Ɽ <https://fr.wikipedia.org/wiki/%E2%B1%A4>, even if the second character is the uppercase version of the first one in

Unicode.

So, what characters are actually converted to uppercase by the title normalization ? I need to know this information to stop reporting some false positives in WPCleaner <https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:WPCleaner>. Thanks, Nico _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Giuseppe Lavagetto

5 Aug 5 Aug

7:03 a.m.

New subject: For title normalization, what characters are converted to uppercase ?

On Sun, Aug 4, 2019 at 11:34 AM Nicolas Vervelle <nvervelle(a)gmail.com> wrote:

...

Hi! No, that file is a temporary measure during a transition between two versions of php. In HHVM and PHP 5.x, calling mb_toupper("ƀ") would give the erroneous result "ƀ". In PHP 7.x, the result is the correct capitalization. The issue is that the titles of wiki articles get normalized, so under php7 we would have ƀar => Ƀar which would prevent you from being able to reach the page. Once we're done with the transition and we go through the process of coverting the (several hundred) pages/users that have the wrong title normalization, we will remove that table, and obtain the correct behaviour. You just need to subscribe https://phabricator.wikimedia.org/T219279 and wait for its resolution I think - most unicode horrors are fixed in recent versions of PHP, including the one you were citing. Cheers, Giuseppe -- Giuseppe Lavagetto Principal Site Reliability Engineer, Wikimedia Foundation

Nicolas Vervelle

8:45 a.m.

New subject: For title normalization, what characters are converted to uppercase ?

...

On Sun, Aug 4, 2019 at 11:34 AM Nicolas Vervelle <nvervelle(a)gmail.com> wrote:

Thanks Brian, Great for the link to Php72ToUpper.php ! I think I understand with it : for example, the first line says 'ƀ' =>

'ƀ',

which should mean that this letter shouldn't be converted to uppercase by MW ? That's one of the letter I found that wasn't converted to uppercase and that was generating a false positive in my code : so it's because

specific

MW code is preventing the conversion :-)

Nicolas Vervelle

9:31 p.m.

New subject: For title normalization, what characters are converted to uppercase ?

Last question (I believe) : I've implemented something similar as Php72ToUpper in WPCleaner, and it seems to work fine for removing false positives. I've only one left on frwiki : ⅷ <https://fr.wikipedia.org/w/index.php?title=%E2%85%B7&redirect=no>. My code still converts it to uppercase, but on frwiki there is one page for the lowercase letter, and one page for the uppercase letter, so this letter is not converted to uppercase by current MediaWiki version. Is it missing in Php72ToUpper to prevent it to be converted with PHP 7.2 ? Nico On Mon, Aug 5, 2019 at 8:45 AM Nicolas Vervelle <nvervelle(a)gmail.com> wrote:

...

On Sun, Aug 4, 2019 at 11:34 AM Nicolas Vervelle <nvervelle(a)gmail.com> wrote:

Thanks Brian, Great for the link to Php72ToUpper.php ! I think I understand with it : for example, the first line says 'ƀ' =>

'ƀ',

which should mean that this letter shouldn't be converted to uppercase

MW ? That's one of the letter I found that wasn't converted to uppercase and that was generating a false positive in my code : so it's because

specific

MW code is preventing the conversion :-)

bawolff

11:24 p.m.

New subject: For title normalization, what characters are converted to uppercase ?

Apparently that will change in php7.3, which we will move to eventually but probably not anytime soon: https://3v4l.org/W7TiC -- bawolff On Mon, Aug 5, 2019 at 12:32 PM Nicolas Vervelle <nvervelle(a)gmail.com> wrote:

...

Thanks Giuseppe ! I've subscribed to T219279 to know when the pages are properly converted, and when I can remove the hack in my code. Nico On Mon, Aug 5, 2019 at 7:03 AM Giuseppe Lavagetto < glavagetto(a)wikimedia.org> wrote: > On Sun, Aug 4, 2019 at 11:34 AM Nicolas Vervelle <nvervelle(a)gmail.com> > wrote: > > > Thanks Brian, > > > > Great for the link to Php72ToUpper.php ! > > I think I understand with it : for example, the first line says 'ƀ' => > 'ƀ', > > which should mean that this letter shouldn't be converted to uppercase > by > > MW ? > > That's one of the letter I found that wasn't converted to uppercase

and

> > that was generating a false positive in my code : so it's because > specific > > MW code is preventing the conversion :-) > > > > Hi! > > No, that file is a temporary measure during a transition between two > versions of php. > > In HHVM and PHP 5.x, calling mb_toupper("ƀ") would give the erroneous > result "ƀ". > > In PHP 7.x, the result is the correct capitalization. > > The issue is that the titles of wiki articles get normalized, so under > php7 > we would have > > ƀar => Ƀar > > which would prevent you from being able to reach the page. > > Once we're done with the transition and we go through the process of > coverting the (several hundred) pages/users that have the wrong title > normalization, we will remove that table, and obtain the correct > behaviour. > > You just need to subscribe https://phabricator.wikimedia.org/T219279

and

> wait for its resolution I think - most unicode horrors are fixed in

recent

versions of PHP, including the one you were citing. Cheers, Giuseppe -- Giuseppe Lavagetto Principal Site Reliability Engineer, Wikimedia Foundation _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Yuri Astrakhan

4 Aug 4 Aug

2:11 a.m.

New subject: For title normalization, what characters are converted to uppercase ?

Hi Nico, if possible, can your tool to actually use MW API to normalize titles? It's a very quick API call, you can do multiple titles at once, but it will save you a lot of grief over incompatibilities. --Yuri On Sat, Aug 3, 2019 at 10:57 AM Nicolas Vervelle <nvervelle(a)gmail.com> wrote:

...

Nicolas Vervelle

10:58 a.m.

New subject: For title normalization, what characters are converted to uppercase ?

Thanks Yuri, I know of the normalization done through the API, but it doesn't work for the case I'm working on : it's a dump analysis, and I want it to be able to work offline... Nico On Sun, Aug 4, 2019 at 2:12 AM Yuri Astrakhan <yuriastrakhan(a)gmail.com> wrote:

...

Hello, On most wikis, MediaWiki is configuration to convert the first letter of

Unicode.

_______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

1725

days inactive

1727

days old

wikitech-l@lists.wikimedia.org

Manage subscription

8 comments

4 participants

tags (0)

participants (4)

bawolff
Giuseppe Lavagetto
Nicolas Vervelle
Yuri Astrakhan