2016-04-12 14:01 GMT+03:00 Adrian Heine <adrian.heine(a)wikimedia.de>de>:
Hi everyone,
as some of you might know, I'm a software developer at Wikimedia
Deutschland, working on Wikidata. I'm currently focusing on improving
Wikidata's support for languages we as a team are not using on a daily
basis. As part of my work I stumbled over a shortcoming in MediaWiki's
message system that – as far as I see it – prevents me from doing the right
thing(tm). I'm asking you to verify that the issue I see indeed is an issue
and that we want to fix it. Subsequently, I'm interested in hearing your
plans or goals for MediaWiki's message system so that I can align my
implementation with them. Finally, I am hoping to find someone who is
willing to help me fix it.
First of all, thanks for working on this issue. It is a real issue,
but not often requested. I think that is because manually checking in
every place whether the language code is unexpected (different from
the one in current context) would be cumbersome and always outputting
language codes on every tag would be bloaty. Best would be if this
checking was automated in a templating library, but so far templating
hasn't been much adopted in MediaWiki core. But of course this
information needs to be exposed first, which is what I understand you
are doing.
== The issue ==
On Wikidata, we regularly have content in different languages on the same
page. We use the HTML lang and dir attributes accordingly. For example, we
have a table with terms for an entity in different languages. For missing
terms, we would display a message in the UI language within this table. The
corresponding HTML (simplified) might look like this:
<div id="mw-content-text" lang="UILANG"
dir="UILANG_DIR">
<table class="entity-terms">
<tr class="entity-terms-for-OTHERLANG1" lang="OTHERLANG1"
dir="OTHERLANG1_DIR">
<td class="entity-terms-for-OTHERLANG1-label">
<div class="wb-empty" lang="UILANG"
dir="UILANG_DIR">
<!-- missing label message -->
</div>
</td>
</tr>
</div>
</div>
This works great as long as the missing label message is available in the UI
language. If that is not the case, though, the message is translated
according to the defined language fallbacks. In that case, we might end up
with something like this:
<div class="wb-empty" lang="arc" dir="rtl">No label
defined</div>
That's obviously wrong, and I'd like to fix it.
== Fixing it ==
For fixing this, I tried to make MessageCache provide the language a message
was taken from [1]. That's not too straight-forward to begin with, but while
working on it I realized that MessageCache is only responsible for following
the language fallback chain for database translations. For file-based
translations, the fallbacks are directly merged in by LocalisationCache, so
the information is not there anymore at the time of translating a message. I
see some ways to fix this:
* Don't merge messages in LocalisationCache, but perform the fallback on
request (possibly caching the result)
* Tag message strings in LocalisationCache with the language they are in
(sounds expensive to me)
* Tag message strings as being a fallback in LocalisationCache (that way we
could follow the fallback until we find a language in which the message
string is not tagged as being a fallback)
What do you think?
The current localisation cache implementation quite obviously trades
space for speed. In this light I would suggest option two, to tag the
actual language the string is in.
However, this trade-off might not make sense anymore, as we have more
languages and more messages, resulting in almost gigabyte size caches.
See also for example
https://phabricator.wikimedia.org/T99740. I added
wikitech-l to CC in hopes that people who have worked on localisation
cache more recently would comment on whether option one, to not merge
messages, would make more sense nowadays.
-Niklas