Making $wgLegalTitleChars easier to use outside PHP - Wikitech-l

10 Dec 2021

Hi,

Myself and Erutuon are working on writing a Rust library[1] that 
validates and normalizes MediaWiki page titles, initially for use in 
bots and tools, but it has other potential use cases too.

There is some prior art for this, a mediawiki-title npm package[2] is 
used by various nodejs services, and there's a mediawiki.Title 
ResourceLoader module in core too, all of which reimplement MediaWiki's 
title parsing, validation and normalization routines...basically 
MediaWikiTitleCodec and what it calls.

One problem area is $wgLegalTitleChars. It's part of a PHP regex that 
gets used to check if any invalid characters are present by inverting 
it, basically [^{legalchars}]. It's also exposed via the meta=siteinfo 
API, which suggests that it's useful for external callers, but in its 
current form it's really not.

The current value is:
  $wgLegalTitleChars = " 
%!\"$&'()*,\\-.\\/0-9:;=?@A-Z\\\\^_`a-z~\\x80-\\xFF+";

It's is a mix of literal characters, some escaped and some not, and 
various ranges. Specifically, it's tuned for PHP regexes, like / is 
escaped because it delimits regexes (an unnecessary escape and therefore 
problematic in Rust), and it uses syntax like \x80-\xFF, while in 
JavaScript we want \u0080-\uFFFF. We'd also like to use the Unicode 
escape class syntax in Rust, but for now have worked around it to use 
the byte class.

In fact, we have functions that parse the byte class and turn it into a 
unicode class, Title::convertByteClassToUnicodeClass() in PHP and a 
JavaScript version too[3]. This seems entirely unnecessary to me, given 
that we could just write this sequence in a more regex-neutral manner to 
make it more portable:

  $wgLegalTitleCharacters = [
	 'characters' => " %!\"$&'()*,-./:;=?@\\\\^_`~",
	 // +
	 'plus' => true,
	 // A-Z, a-z, 0-9
	 'alphanumeric' => true,
	 // \x80-\xFF
	 'non-ascii' => true,
  ];

Characters are literal characters to run through preg_quote or 
mw.util.escapeRegExp, and the ranges are to be specified in whatever 
format the specific regex engine would like.

But this also opens up the question - is there a valid use case for 
customizing $wgLegalTitleChars anymore? We already have a comment that 
says "Don't change this unless you know what you're doing". There's
also 
one that says "In some rare cases you may wish to remove + for 
compatibility with old links." - is that still a consideration today?

I would think that for easy importing/exporting across various MediaWiki 
wikis we want the set of legal title characters to be rather static. And 
if someone wants to ban some character from being used, 
Extension:TitleBlacklist (bundled) provides a much less invasive way to 
do so.

So to recap:
1. Can we get rid of the ability to customize legal title characters?
2. If #1 is no, any objections to the breaking change of swapping out
    $wgLegalTitleChars (string) for $wgLegalTitleCharacters (array)?
    Note that extensions can still read from the old global, just it
    can't be overridden in LocalSettings anymore.

The actual patch to review is 
<https://gerrit.wikimedia.org/r/c/mediawiki/core/+/745386>.

[1] https://gitlab.com/mwbot-rs/mwbot/-/tree/master/mwtitle
[2] https://github.com/wikimedia/mediawiki-title/
[3] https://github.com/wikimedia/mediawiki-title/blob/master/lib/utils.js