Edward Z. Yang wrote:
Comments on message input would be very valuable, not
necessarily from
the wiki side, but from the PHP file for translators side.
Specifically, now that your switch has been made, are you happy with the
new implementation of the translator? Also, what you think of the
method that extensions add localizable strings to the MessageCache?
Would decentralizing the messages be a better policy for even the core
parts of MediaWiki?
The performance of our extension system is poor, currently about 20% of CPU
time on the cluster is used by extension registration of various kinds.
Extensions in PHP need to be arranged so that on a normal startup, no code
needs to be executed from any of the extensions. It should be a matter of
checking the file modification times, and loading a pre-merged message array
from a cache.
The input text could indeed be decentralized while maintaining good
performance, with careful design. Whether or not to do this in any given
application should probably be at the discretion of the application designer.
How about just decentralizing English messages, so
they're never out of-sync with the code?
They're very rarely out of sync. I think most of us have found that it's
convenient to have all the English messages in the same place. Revising the
English text can be done without reference to the code, just like
translation can. Programmers sometimes make very poor choices for the
default text.
Policy on how you reuse messages would be interesting
too: how much
would a message have to change before you just say, "Okay, we need to
give it a new name so old customizations don't clobber it."? As well as
naming conventions for the messages: what would you imagine a good
policy would be?
The translators encourage us to avoid reuse. Just because two concepts can
be expressed with the same word in English doesn't mean they can be
expressed with the same word in every language. "OK" is a good example: we
use it for a generic button label, but in some languages, they prefer to use
a button label related to the operation which will be performed by the button.
An easy way to duplicate messages across all languages would reduce the
programmer's need to reuse messages. Preferably you would have a reference
rather than a full copy, to reduce maintenance.
Architecture is hard to figure out from just looking
at code, so a
real-quick, high level lowdown would be appreciated (doesn't have to be
very long). Caching, I understand, is just lots of unserialize and
filemtime calls.
Parameter substitution (from what I see) seems at first to be a very
easy thing to do, but quickly gets very complicated. I'd like to know
the major pitfalls with substitution in obscure languages are (so I can
determine whether or not, for my needs, I have to work around them, or I
can blithely ignore them).
We didn't have plural support for a long time, I tend to think of it as eye
candy. English needs plurals just like any other language, but programmers
routinely abuse the language for our convenience. We now do have plurals,
which I suppose makes for a nicer-looking product. For example:
'undelete_short' => 'Undelete {{PLURAL:$1|one edit|$1 edits}}',
Grammatical transformations for agglutinative languages came a bit earlier.
I implemented them for Finnish, when I was told that it was absolutely
necessary if we wished to make our language files site-independent, i.e. to
remove the Wikipedia references. In Finnish, "about Wikipedia" becomes
"Tietoja Wikipediasta" and "you can upload it to Wikipedia" becomes
"Voit
tallentaa tiedoston Wikipediaan". Suffixes are added depending on how the
word is used, plus minor modifications to the base. There is a long list of
exceptions, but since we only needed to translate a few words, such as the
site name, we didn't need to include it.
We now have grammatical transformation functions for 18 languages. Some of
these are just dictionaries for Wikimedia site names, but many have proper
algorithms.
Even before we had arbitrary grammatical transformation, we had a
nominative/genitive distinction for month names. This distinction is
absolutely necessary if you wish to substitute month names into sentences.
The other (much simpler) issue with parameter substitution is HTML escaping.
Despite being much simpler, MediaWiki does a pretty poor job of it. We have
a plethora of poorly-named wfMsg*() functions, including the multitasking
wfMsgExt(), with lots of ways to slip up and let through unescaped user
input. I'd like to do some work in that area at some stage.
In another other post:
First, you have a Language object. This object
contains all the
localisable message strings, as well as other important
language-specific settings and custom behavior (uppercasing,
lowercasing, printing dates, formatting numbers, etc.)
There's also the MessageCache class, which handles input of text via the
MediaWiki namespace. And there's the wfMsg*() functions in
GlobalFunctions.php. When I wrote MessageCache.php, I intended the wfMsg*()
functions to remain as simple shortcuts to functions in the MessageCache
class. That was not how it turned out; I wasn't paying enough attention. Now
we have large amounts of message retrieval code in GlobalFunctions.php.
Expiration checking consists of by ensuring all
dependencies have
filemtime that match the ones bundled with the cached copy. Similar
checking could be implemented for serialized versions, as it seems that
they are not updated until manually recompiled.
The manual recompilation model was a mistake, and I intend to remove it at
the earliest opportunity. It's inconvenient for site administrators. Caching
is much more versatile, especially when you add dependency checking. The
only problem is that you need to have a data store which is both fast to
read and writable by the webserver. Such a store is not always available.
-- Tim Starling