Hi,
Myself and Erutuon are working on writing a Rust library[1] that
validates and normalizes MediaWiki page titles, initially for use in
bots and tools, but it has other potential use cases too.
There is some prior art for this, a mediawiki-title npm package[2] is
used by various nodejs services, and there's a mediawiki.Title
ResourceLoader module in core too, all of which reimplement MediaWiki's
title parsing, validation and normalization routines...basically
MediaWikiTitleCodec and what it calls.
One problem area is $wgLegalTitleChars. It's part of a PHP regex that
gets used to check if any invalid characters are present by inverting
it, basically [^{legalchars}]. It's also exposed via the meta=siteinfo
API, which suggests that it's useful for external callers, but in its
current form it's really not.
The current value is:
$wgLegalTitleChars = "
%!\"$&'()*,\\-.\\/0-9:;=?@A-Z\\\\^_`a-z~\\x80-\\xFF+";
It's is a mix of literal characters, some escaped and some not, and
various ranges. Specifically, it's tuned for PHP regexes, like / is
escaped because it delimits regexes (an unnecessary escape and therefore
problematic in Rust), and it uses syntax like \x80-\xFF, while in
JavaScript we want \u0080-\uFFFF. We'd also like to use the Unicode
escape class syntax in Rust, but for now have worked around it to use
the byte class.
In fact, we have functions that parse the byte class and turn it into a
unicode class, Title::convertByteClassToUnicodeClass() in PHP and a
JavaScript version too[3]. This seems entirely unnecessary to me, given
that we could just write this sequence in a more regex-neutral manner to
make it more portable:
$wgLegalTitleCharacters = [
'characters' => " %!\"$&'()*,-./:;=?@\\\\^_`~",
// +
'plus' => true,
// A-Z, a-z, 0-9
'alphanumeric' => true,
// \x80-\xFF
'non-ascii' => true,
];
Characters are literal characters to run through preg_quote or
mw.util.escapeRegExp, and the ranges are to be specified in whatever
format the specific regex engine would like.
But this also opens up the question - is there a valid use case for
customizing $wgLegalTitleChars anymore? We already have a comment that
says "Don't change this unless you know what you're doing". There's
also
one that says "In some rare cases you may wish to remove + for
compatibility with old links." - is that still a consideration today?
I would think that for easy importing/exporting across various MediaWiki
wikis we want the set of legal title characters to be rather static. And
if someone wants to ban some character from being used,
Extension:TitleBlacklist (bundled) provides a much less invasive way to
do so.
So to recap:
1. Can we get rid of the ability to customize legal title characters?
2. If #1 is no, any objections to the breaking change of swapping out
$wgLegalTitleChars (string) for $wgLegalTitleCharacters (array)?
Note that extensions can still read from the old global, just it
can't be overridden in LocalSettings anymore.
The actual patch to review is
<https://gerrit.wikimedia.org/r/c/mediawiki/core/+/745386>.
[1]
https://gitlab.com/mwbot-rs/mwbot/-/tree/master/mwtitle
[2]
https://github.com/wikimedia/mediawiki-title/
[3]
https://github.com/wikimedia/mediawiki-title/blob/master/lib/utils.js