Hi,
Myself and Erutuon are working on writing a Rust library[1] that validates and normalizes MediaWiki page titles, initially for use in bots and tools, but it has other potential use cases too.
There is some prior art for this, a mediawiki-title npm package[2] is used by various nodejs services, and there's a mediawiki.Title ResourceLoader module in core too, all of which reimplement MediaWiki's title parsing, validation and normalization routines...basically MediaWikiTitleCodec and what it calls.
One problem area is $wgLegalTitleChars. It's part of a PHP regex that gets used to check if any invalid characters are present by inverting it, basically [^{legalchars}]. It's also exposed via the meta=siteinfo API, which suggests that it's useful for external callers, but in its current form it's really not.
The current value is: $wgLegalTitleChars = " %!"$&'()*,\-.\/0-9:;=?@A-Z\\^_`a-z~\x80-\xFF+";
It's is a mix of literal characters, some escaped and some not, and various ranges. Specifically, it's tuned for PHP regexes, like / is escaped because it delimits regexes (an unnecessary escape and therefore problematic in Rust), and it uses syntax like \x80-\xFF, while in JavaScript we want \u0080-\uFFFF. We'd also like to use the Unicode escape class syntax in Rust, but for now have worked around it to use the byte class.
In fact, we have functions that parse the byte class and turn it into a unicode class, Title::convertByteClassToUnicodeClass() in PHP and a JavaScript version too[3]. This seems entirely unnecessary to me, given that we could just write this sequence in a more regex-neutral manner to make it more portable:
$wgLegalTitleCharacters = [ 'characters' => " %!"$&'()*,-./:;=?@\\^_`~", // + 'plus' => true, // A-Z, a-z, 0-9 'alphanumeric' => true, // \x80-\xFF 'non-ascii' => true, ];
Characters are literal characters to run through preg_quote or mw.util.escapeRegExp, and the ranges are to be specified in whatever format the specific regex engine would like.
But this also opens up the question - is there a valid use case for customizing $wgLegalTitleChars anymore? We already have a comment that says "Don't change this unless you know what you're doing". There's also one that says "In some rare cases you may wish to remove + for compatibility with old links." - is that still a consideration today?
I would think that for easy importing/exporting across various MediaWiki wikis we want the set of legal title characters to be rather static. And if someone wants to ban some character from being used, Extension:TitleBlacklist (bundled) provides a much less invasive way to do so.
So to recap: 1. Can we get rid of the ability to customize legal title characters? 2. If #1 is no, any objections to the breaking change of swapping out $wgLegalTitleChars (string) for $wgLegalTitleCharacters (array)? Note that extensions can still read from the old global, just it can't be overridden in LocalSettings anymore.
The actual patch to review is https://gerrit.wikimedia.org/r/c/mediawiki/core/+/745386.
[1] https://gitlab.com/mwbot-rs/mwbot/-/tree/master/mwtitle [2] https://github.com/wikimedia/mediawiki-title/ [3] https://github.com/wikimedia/mediawiki-title/blob/master/lib/utils.js
wikitech-l@lists.wikimedia.org