Edward Z. Yang wrote:
Comments on message input would be very valuable, not necessarily from the wiki side, but from the PHP file for translators side. Specifically, now that your switch has been made, are you happy with the new implementation of the translator? Also, what you think of the method that extensions add localizable strings to the MessageCache? Would decentralizing the messages be a better policy for even the core parts of MediaWiki?
The performance of our extension system is poor, currently about 20% of CPU time on the cluster is used by extension registration of various kinds. Extensions in PHP need to be arranged so that on a normal startup, no code needs to be executed from any of the extensions. It should be a matter of checking the file modification times, and loading a pre-merged message array from a cache.
The input text could indeed be decentralized while maintaining good performance, with careful design. Whether or not to do this in any given application should probably be at the discretion of the application designer.
How about just decentralizing English messages, so they're never out of-sync with the code?
They're very rarely out of sync. I think most of us have found that it's convenient to have all the English messages in the same place. Revising the English text can be done without reference to the code, just like translation can. Programmers sometimes make very poor choices for the default text.
Policy on how you reuse messages would be interesting too: how much would a message have to change before you just say, "Okay, we need to give it a new name so old customizations don't clobber it."? As well as naming conventions for the messages: what would you imagine a good policy would be?
The translators encourage us to avoid reuse. Just because two concepts can be expressed with the same word in English doesn't mean they can be expressed with the same word in every language. "OK" is a good example: we use it for a generic button label, but in some languages, they prefer to use a button label related to the operation which will be performed by the button.
An easy way to duplicate messages across all languages would reduce the programmer's need to reuse messages. Preferably you would have a reference rather than a full copy, to reduce maintenance.
Architecture is hard to figure out from just looking at code, so a real-quick, high level lowdown would be appreciated (doesn't have to be very long). Caching, I understand, is just lots of unserialize and filemtime calls.
Parameter substitution (from what I see) seems at first to be a very easy thing to do, but quickly gets very complicated. I'd like to know the major pitfalls with substitution in obscure languages are (so I can determine whether or not, for my needs, I have to work around them, or I can blithely ignore them).
We didn't have plural support for a long time, I tend to think of it as eye candy. English needs plurals just like any other language, but programmers routinely abuse the language for our convenience. We now do have plurals, which I suppose makes for a nicer-looking product. For example:
'undelete_short' => 'Undelete {{PLURAL:$1|one edit|$1 edits}}',
Grammatical transformations for agglutinative languages came a bit earlier. I implemented them for Finnish, when I was told that it was absolutely necessary if we wished to make our language files site-independent, i.e. to remove the Wikipedia references. In Finnish, "about Wikipedia" becomes "Tietoja Wikipediasta" and "you can upload it to Wikipedia" becomes "Voit tallentaa tiedoston Wikipediaan". Suffixes are added depending on how the word is used, plus minor modifications to the base. There is a long list of exceptions, but since we only needed to translate a few words, such as the site name, we didn't need to include it.
We now have grammatical transformation functions for 18 languages. Some of these are just dictionaries for Wikimedia site names, but many have proper algorithms.
Even before we had arbitrary grammatical transformation, we had a nominative/genitive distinction for month names. This distinction is absolutely necessary if you wish to substitute month names into sentences.
The other (much simpler) issue with parameter substitution is HTML escaping. Despite being much simpler, MediaWiki does a pretty poor job of it. We have a plethora of poorly-named wfMsg*() functions, including the multitasking wfMsgExt(), with lots of ways to slip up and let through unescaped user input. I'd like to do some work in that area at some stage.
In another other post:
First, you have a Language object. This object contains all the localisable message strings, as well as other important language-specific settings and custom behavior (uppercasing, lowercasing, printing dates, formatting numbers, etc.)
There's also the MessageCache class, which handles input of text via the MediaWiki namespace. And there's the wfMsg*() functions in GlobalFunctions.php. When I wrote MessageCache.php, I intended the wfMsg*() functions to remain as simple shortcuts to functions in the MessageCache class. That was not how it turned out; I wasn't paying enough attention. Now we have large amounts of message retrieval code in GlobalFunctions.php.
Expiration checking consists of by ensuring all dependencies have filemtime that match the ones bundled with the cached copy. Similar checking could be implemented for serialized versions, as it seems that they are not updated until manually recompiled.
The manual recompilation model was a mistake, and I intend to remove it at the earliest opportunity. It's inconvenient for site administrators. Caching is much more versatile, especially when you add dependency checking. The only problem is that you need to have a data store which is both fast to read and writable by the webserver. Such a store is not always available.
-- Tim Starling