Hi, I'm currently developing an I18N system for a moderately large PHP software library and would like to model it off of MediaWiki's setup, being an extremely well-tested installation (even though it is an application, not a library).
However, I do realize that the stability that comes with age also means that there might be a little bit of legacy cruft stuck in the files, esp. when concerning fundamental architectural changes.
I'd like to ask all and any developers who are familiar with MediaWiki's I18N structure to comment on your opinions of the strengths and weaknesses of the system (the code will tell me how it works) and what you would do different if you had the chance to rewrite the whole thing. Consider it documentation for an extremely undocumented section of the code: not a single result turns up for a Wikitech-l search of "i18n."
Thanks for your time! I understand you're all quite busy people.
Edward Z. Yang wrote:
Hi, I'm currently developing an I18N system for a moderately large PHP software library and would like to model it off of MediaWiki's setup, being an extremely well-tested installation (even though it is an application, not a library).
Will this library be free?
However, I do realize that the stability that comes with age also means that there might be a little bit of legacy cruft stuck in the files, esp. when concerning fundamental architectural changes.
I'd like to ask all and any developers who are familiar with MediaWiki's I18N structure to comment on your opinions of the strengths and weaknesses of the system (the code will tell me how it works) and what you would do different if you had the chance to rewrite the whole thing. Consider it documentation for an extremely undocumented section of the code: not a single result turns up for a Wikitech-l search of "i18n."
I've made a number of posts to wikitech-l on this subject, but I don't use the term "i18n". As I once quipped on IRC, MediaWiki isn't internationalised, it's international. It's always been international, we never had to go through the process of internationalisation to get there. Or more precisely, it's multilingual. But also I don't like substituting digits for letters.
There are a number of aspects to our interface translation system:
* Message input * Architecture, caching * Parameter substitution: agglutination and escaping
Which of these are you interested in? Presumably our message input model (i.e. via a wiki) wouldn't be universally applicable.
-- Tim Starling
Tim Starling wrote:
Will this library be free?
Of course! (maybe a little too free, being licensed under LGPL). A link, perhaps, would speak better: http://hp.jpsband.org/
I've made a number of posts to wikitech-l on this subject, but I don't use the term "i18n". As I once quipped on IRC, MediaWiki isn't internationalised, it's international. It's always been international, we never had to go through the process of internationalisation to get there. Or more precisely, it's multilingual. But also I don't like substituting digits for letters.
For your benefit, I reran the search under "international", "language" and "Tim Starling". Some interesting threads were brought up:
http://mail.wikipedia.org/mailman/htdig/wikitech-l/2002-March/012496.html - extremely old thread on switching to UTF-8
http://mail.wikipedia.org/mailman/htdig/wikitech-l/2006-July/037090.html - your proposal for rewriting the language files
There are a number of aspects to our interface translation system:
- Message input
- Architecture, caching
- Parameter substitution: agglutination and escaping
Which of these are you interested in? Presumably our message input model (i.e. via a wiki) wouldn't be universally applicable.
Comments on message input would be very valuable, not necessarily from the wiki side, but from the PHP file for translators side. Specifically, now that your switch has been made, are you happy with the new implementation of the translator? Also, what you think of the method that extensions add localizable strings to the MessageCache? Would decentralizing the messages be a better policy for even the core parts of MediaWiki? How about just decentralizing English messages, so they're never out of-sync with the code?
Policy on how you reuse messages would be interesting too: how much would a message have to change before you just say, "Okay, we need to give it a new name so old customizations don't clobber it."? As well as naming conventions for the messages: what would you imagine a good policy would be?
Architecture is hard to figure out from just looking at code, so a real-quick, high level lowdown would be appreciated (doesn't have to be very long). Caching, I understand, is just lots of unserialize and filemtime calls.
Parameter substitution (from what I see) seems at first to be a very easy thing to do, but quickly gets very complicated. I'd like to know the major pitfalls with substitution in obscure languages are (so I can determine whether or not, for my needs, I have to work around them, or I can blithely ignore them).
I suppose this is just a big jumble of questions that I'd like answered. I hope none of them are too time-consuming. Thanks for your help!
Edward Z. Yang wrote:
Comments on message input would be very valuable, not necessarily from the wiki side, but from the PHP file for translators side. Specifically, now that your switch has been made, are you happy with the new implementation of the translator? Also, what you think of the method that extensions add localizable strings to the MessageCache? Would decentralizing the messages be a better policy for even the core parts of MediaWiki?
The performance of our extension system is poor, currently about 20% of CPU time on the cluster is used by extension registration of various kinds. Extensions in PHP need to be arranged so that on a normal startup, no code needs to be executed from any of the extensions. It should be a matter of checking the file modification times, and loading a pre-merged message array from a cache.
The input text could indeed be decentralized while maintaining good performance, with careful design. Whether or not to do this in any given application should probably be at the discretion of the application designer.
How about just decentralizing English messages, so they're never out of-sync with the code?
They're very rarely out of sync. I think most of us have found that it's convenient to have all the English messages in the same place. Revising the English text can be done without reference to the code, just like translation can. Programmers sometimes make very poor choices for the default text.
Policy on how you reuse messages would be interesting too: how much would a message have to change before you just say, "Okay, we need to give it a new name so old customizations don't clobber it."? As well as naming conventions for the messages: what would you imagine a good policy would be?
The translators encourage us to avoid reuse. Just because two concepts can be expressed with the same word in English doesn't mean they can be expressed with the same word in every language. "OK" is a good example: we use it for a generic button label, but in some languages, they prefer to use a button label related to the operation which will be performed by the button.
An easy way to duplicate messages across all languages would reduce the programmer's need to reuse messages. Preferably you would have a reference rather than a full copy, to reduce maintenance.
Architecture is hard to figure out from just looking at code, so a real-quick, high level lowdown would be appreciated (doesn't have to be very long). Caching, I understand, is just lots of unserialize and filemtime calls.
Parameter substitution (from what I see) seems at first to be a very easy thing to do, but quickly gets very complicated. I'd like to know the major pitfalls with substitution in obscure languages are (so I can determine whether or not, for my needs, I have to work around them, or I can blithely ignore them).
We didn't have plural support for a long time, I tend to think of it as eye candy. English needs plurals just like any other language, but programmers routinely abuse the language for our convenience. We now do have plurals, which I suppose makes for a nicer-looking product. For example:
'undelete_short' => 'Undelete {{PLURAL:$1|one edit|$1 edits}}',
Grammatical transformations for agglutinative languages came a bit earlier. I implemented them for Finnish, when I was told that it was absolutely necessary if we wished to make our language files site-independent, i.e. to remove the Wikipedia references. In Finnish, "about Wikipedia" becomes "Tietoja Wikipediasta" and "you can upload it to Wikipedia" becomes "Voit tallentaa tiedoston Wikipediaan". Suffixes are added depending on how the word is used, plus minor modifications to the base. There is a long list of exceptions, but since we only needed to translate a few words, such as the site name, we didn't need to include it.
We now have grammatical transformation functions for 18 languages. Some of these are just dictionaries for Wikimedia site names, but many have proper algorithms.
Even before we had arbitrary grammatical transformation, we had a nominative/genitive distinction for month names. This distinction is absolutely necessary if you wish to substitute month names into sentences.
The other (much simpler) issue with parameter substitution is HTML escaping. Despite being much simpler, MediaWiki does a pretty poor job of it. We have a plethora of poorly-named wfMsg*() functions, including the multitasking wfMsgExt(), with lots of ways to slip up and let through unescaped user input. I'd like to do some work in that area at some stage.
In another other post:
First, you have a Language object. This object contains all the localisable message strings, as well as other important language-specific settings and custom behavior (uppercasing, lowercasing, printing dates, formatting numbers, etc.)
There's also the MessageCache class, which handles input of text via the MediaWiki namespace. And there's the wfMsg*() functions in GlobalFunctions.php. When I wrote MessageCache.php, I intended the wfMsg*() functions to remain as simple shortcuts to functions in the MessageCache class. That was not how it turned out; I wasn't paying enough attention. Now we have large amounts of message retrieval code in GlobalFunctions.php.
Expiration checking consists of by ensuring all dependencies have filemtime that match the ones bundled with the cached copy. Similar checking could be implemented for serialized versions, as it seems that they are not updated until manually recompiled.
The manual recompilation model was a mistake, and I intend to remove it at the earliest opportunity. It's inconvenient for site administrators. Caching is much more versatile, especially when you add dependency checking. The only problem is that you need to have a data store which is both fast to read and writable by the webserver. Such a store is not always available.
-- Tim Starling
I did quite a bit of thinking on what you posted here, and some of it quite fabulous. I still have a few questions:
- If you had to rename every message in MediaWiki, would you have enforced a definite naming convention? Namespaced them?
Grammatical transformations for agglutinative languages came a bit earlier. I implemented them for Finnish, when I was told that it was absolutely necessary if we wished to make our language files site-independent, i.e. to remove the Wikipedia references. In Finnish, "about Wikipedia" becomes "Tietoja Wikipediasta" and "you can upload it to Wikipedia" becomes "Voit tallentaa tiedoston Wikipediaan". Suffixes are added depending on how the word is used, plus minor modifications to the base. There is a long list of exceptions, but since we only needed to translate a few words, such as the site name, we didn't need to include it.
We now have grammatical transformation functions for 18 languages. Some of these are just dictionaries for Wikimedia site names, but many have proper algorithms.
If they're just dictionaries, I wonder whether or not you couldn't have had wfMsg("sitename-$form"), just like the nominative/genitive distinction for month names? The current setup makes it difficult to add your own name to the dictionaries without editing PHP files (not that, I imagine, anyone would care).
There's also the MessageCache class, which handles input of text via the MediaWiki namespace.
The MediaWiki namespace is quite an innovation on your part, and extremely convenient. However, that's out of the scope for my purposes.
Once again, thanks!
Edward Z. Yang wrote:
I did quite a bit of thinking on what you posted here, and some of it quite fabulous. I still have a few questions:
- If you had to rename every message in MediaWiki, would you have
enforced a definite naming convention? Namespaced them?
Sure, I would have a naming convention. We've already more or less standardised on using prefixes for extension messages. Namespaces for core messages would probably be useful too. My main bugbear with the current system is punctuation style: Lee used strings of unbroken lowercase letters, which I think is hard to read. They're always quoted, so why not have spaces between words? Or underscores, if that's too progressive.
Grammatical transformations for agglutinative languages came a bit earlier. I implemented them for Finnish, when I was told that it was absolutely necessary if we wished to make our language files site-independent, i.e. to remove the Wikipedia references. In Finnish, "about Wikipedia" becomes "Tietoja Wikipediasta" and "you can upload it to Wikipedia" becomes "Voit tallentaa tiedoston Wikipediaan". Suffixes are added depending on how the word is used, plus minor modifications to the base. There is a long list of exceptions, but since we only needed to translate a few words, such as the site name, we didn't need to include it.
We now have grammatical transformation functions for 18 languages. Some of these are just dictionaries for Wikimedia site names, but many have proper algorithms.
If they're just dictionaries, I wonder whether or not you couldn't have had wfMsg("sitename-$form"), just like the nominative/genitive distinction for month names? The current setup makes it difficult to add your own name to the dictionaries without editing PHP files (not that, I imagine, anyone would care).
If you have an algorithm, then you only have to enter the site name once, not six times. That's very convenient, so I would encourage people to implement algorithms rather than dictionaries. Even a comprehensive dictionary isn't going to be much good for site names which are invented words.
Most conversion functions respect $wgGrammarForms, which is an exception list for grammatical transforms that can be customised from LocalSettings.php, just like the site name.
-- Tim Starling
On 11/24/06, Tim Starling tstarling@wikimedia.org wrote:
Edward Z. Yang wrote:
I did quite a bit of thinking on what you posted here, and some of it quite fabulous. I still have a few questions:
- If you had to rename every message in MediaWiki, would you have
enforced a definite naming convention? Namespaced them?
Sure, I would have a naming convention. We've already more or less standardised on using prefixes for extension messages. Namespaces for core messages would probably be useful too. My main bugbear with the current system is punctuation style: Lee used strings of unbroken lowercase letters, which I think is hard to read. They're always quoted, so why not have spaces between words? Or underscores, if that's too progressive.
Someone suggested on IRC once that something might break if we use spaces, because the message names also serve as page names and maybe whatever creates the pages will expect underscores. I usually use hyphens, myself . . . I find them less obtrusive and easier to type than underscores.
On 24/11/06, Simetrical Simetrical+wikitech@gmail.com wrote:
Someone suggested on IRC once that something might break if we use spaces, because the message names also serve as page names and maybe whatever creates the pages will expect underscores. I usually use hyphens, myself . . . I find them less obtrusive and easier to type than underscores.
That's bollocks, spaces in message names are fine and there's no reason we can't use them. We have, I think, a few messages with spaces in their names at present as it is.
Rob Church
On 11/24/06, Rob Church robchur@gmail.com wrote:
On 24/11/06, Simetrical Simetrical+wikitech@gmail.com wrote:
Someone suggested on IRC once that something might break if we use spaces, because the message names also serve as page names and maybe whatever creates the pages will expect underscores. I usually use hyphens, myself . . . I find them less obtrusive and easier to type than underscores.
That's bollocks, spaces in message names are fine and there's no reason we can't use them. We have, I think, a few messages with spaces in their names at present as it is.
There aren't, actually. Only in $dateFormats and $bookstoreList, not $messages. Doesn't mean they won't work, though.
Rob Church wrote:
On 24/11/06, Simetrical Simetrical+wikitech@gmail.com wrote:
Someone suggested on IRC once that something might break if we use spaces, because the message names also serve as page names and maybe whatever creates the pages will expect underscores. I usually use hyphens, myself . . . I find them less obtrusive and easier to type than underscores.
That's bollocks, spaces in message names are fine and there's no reason we can't use them. We have, I think, a few messages with spaces in their names at present as it is.
I think I used ucfirst instead of Title::newFromText() to convert from message names to DB keys, for efficiency reasons. That could be changed, of course. The efficiency of the message cache system has been completely blown to hell anyway, by the invention of parsed messages. Here I am worrying about a few more lines of PHP, and then someone adds a shell out to Tidy.
-- Tim Starling
On 25/11/06, Tim Starling tstarling@wikimedia.org wrote:
I think I used ucfirst instead of Title::newFromText() to convert from message names to DB keys, for efficiency reasons. That could be changed, of course. The efficiency of the message cache system has been completely blown to hell anyway, by the invention of parsed messages. Here I am worrying about a few more lines of PHP, and then someone adds a shell out to Tidy.
In an ideal world, we wouldn't need to run it through Tidy...
Rob Church
Here is my analysis of MediaWiki's I18N system:
== Structure ==
First, you have a Language object. This object contains all the localisable message strings, as well as other important language-specific settings and custom behavior (uppercasing, lowercasing, printing dates, formatting numbers, etc.)
The object is constructed from two sources: subclassed versions of itself (classes) and Message files (messages).
== General use ==
You load a language object by calling the Language::factory() function. This function the class file for the object (taking in account fallback languages by using the fallback langauge's object but overloading the language key) and returns that object. Nothing else happens.
When a message/etc is requested, a lazy load initializor is called. Now the real work starts. We're first going to take the scenario that the language is not cached. The system loads the Messages file by:
require( $filename ); $cache = compact( self::$mLocalisationKeys );
...where self::$mLocalisationKeys is the name of variables that could be used in the localization file. This lets you use things like:
$fallback = false; $rtl = false;
...and easily siphon them into arrays.
Then, we load the $fallback language (if not set, English) to fill in the gaps in the messages. There is specialized behavior for certain keys, as they can be mergeable maps, lists or alias lists (not sure what the last one is).
== Caching ==
MediaWiki has lots of caching mechanisms built in, which make the code somewhat more difficult to understand. Before doing any loading, MediaWiki will check the following places to see if we can be lazy:
1. $mLocalisationCache[$code] - just a variable where it may have been stashed 2. serialized/$code.ser - compiled serialized language file 3. Memcached version of file (with expiration checking)
Expiration checking consists of by ensuring all dependencies have filemtime that match the ones bundled with the cached copy. Similar checking could be implemented for serialized versions, as it seems that they are not updated until manually recompiled.
== Behavior ==
Things that are localizable:
- Weekdays (and abbrev) - Months (and abbrev) - Bookstores - Skin names - Math names - Date preferences - Date format - Default date format - Date preference migration map - Default user option overrides - Language names - Timezones - Character encoding conversion via iconv - UpperLowerCase first (needs casemaps for some) - UpperLowerCase - Uppercase words - Uppercase word breaks - Case folding - Strip punctuation for MySQL search - Get first character - Alternate encoding - Recoding for edit (and then recode input) - RTL - Direction mark character depending on RTL - Arrow depending on RTL - Languages where italics cannot be used - Number formatting (commafy, transform digits, transform separators) - Truncate (multibyte) - Grammar conversions for inflected languages - Plural transformations - Formatting expiry times - Segmenting for diffs (Chinese) - Convert to variants of language - Language specific user preference options - Link trails [[foo]]bar - Language code (RFC 3066)
Neat functionality:
- I18N sprintfDate - Roman numeral formatting
wikitech-l@lists.wikimedia.org