Hi everyone,
as some of you might know, I'm a software developer at Wikimedia Deutschland, working on Wikidata. I'm currently focusing on improving Wikidata's support for languages we as a team are not using on a daily basis. As part of my work I stumbled over a shortcoming in MediaWiki's message system that – as far as I see it – prevents me from doing the right thing(tm). I'm asking you to verify that the issue I see indeed is an issue and that we want to fix it. Subsequently, I'm interested in hearing your plans or goals for MediaWiki's message system so that I can align my implementation with them. Finally, I am hoping to find someone who is willing to help me fix it.
== The issue ==
On Wikidata, we regularly have content in different languages on the same page. We use the HTML lang and dir attributes accordingly. For example, we have a table with terms for an entity in different languages. For missing terms, we would display a message in the UI language within this table. The corresponding HTML (simplified) might look like this:
<div id="mw-content-text" lang="UILANG" dir="UILANG_DIR"> <table class="entity-terms"> <tr class="entity-terms-for-OTHERLANG1" lang="OTHERLANG1" dir="OTHERLANG1_DIR"> <td class="entity-terms-for-OTHERLANG1-label"> <div class="wb-empty" lang="UILANG" dir="UILANG_DIR"> <!-- missing label message --> </div> </td> </tr> </div> </div>
This works great as long as the missing label message is available in the UI language. If that is not the case, though, the message is translated according to the defined language fallbacks. In that case, we might end up with something like this:
<div class="wb-empty" lang="arc" dir="rtl">No label defined</div>
That's obviously wrong, and I'd like to fix it.
== Fixing it ==
For fixing this, I tried to make MessageCache provide the language a message was taken from [1]. That's not too straight-forward to begin with, but while working on it I realized that MessageCache is only responsible for following the language fallback chain for database translations. For file-based translations, the fallbacks are directly merged in by LocalisationCache, so the information is not there anymore at the time of translating a message. I see some ways to fix this:
* Don't merge messages in LocalisationCache, but perform the fallback on request (possibly caching the result) * Tag message strings in LocalisationCache with the language they are in (sounds expensive to me) * Tag message strings as being a fallback in LocalisationCache (that way we could follow the fallback until we find a language in which the message string is not tagged as being a fallback)
What do you think?
[1] https://gerrit.wikimedia.org/r/282133
Thanks,
Hi Arian,
your diagnosis is completely right. Btw. I've filed some bugs for this kind of mess since few years. Things gradually improved :-(
Imho, the message object needs to be enabled to return a direction and a language code (BCP 47 code, to be more precise) that reflects the true value for fallback messages etc. Currently I do not see a real us case for the question "Is this message a fallback message" but I bet, someone will find one, so I suggest to make that queriable, too.
LocalisationCache should keep "is_fallback" and "has language code X" for messages. Alternatively for the latter, a pointer to to a language object might do as well.
We do not have a chance to produce correct HTML with mixed languages, if we do not even know what language a string is in. We must however, in all instances of language strings, - either check the DOM for the current language, and enclose messages in a proper language wrapper if needed, or - emit language wrappers unconditionally and have tidy clean them up.
Purodha
On 12.04.2016 13:01, Adrian Heine wrote:
Hi everyone,
as some of you might know, I'm a software developer at Wikimedia Deutschland, working on Wikidata. I'm currently focusing on improving Wikidata's support for languages we as a team are not using on a daily basis. As part of my work I stumbled over a shortcoming in MediaWiki's message system that – as far as I see it – prevents me from doing the right thing(tm). I'm asking you to verify that the issue I see indeed is an issue and that we want to fix it. Subsequently, I'm interested in hearing your plans or goals for MediaWiki's message system so that I can align my implementation with them. Finally, I am hoping to find someone who is willing to help me fix it.
== The issue ==
On Wikidata, we regularly have content in different languages on the same page. We use the HTML lang and dir attributes accordingly. For example, we have a table with terms for an entity in different languages. For missing terms, we would display a message in the UI language within this table. The corresponding HTML (simplified) might look like this:
<div id="mw-content-text" lang="UILANG" dir="UILANG_DIR"> <table class="entity-terms"> <tr class="entity-terms-for-OTHERLANG1" lang="OTHERLANG1" dir="OTHERLANG1_DIR"> <td class="entity-terms-for-OTHERLANG1-label"> <div class="wb-empty" lang="UILANG" dir="UILANG_DIR"> <!-- missing label message --> </div> </td> </tr> </div> </div>
This works great as long as the missing label message is available in the UI language. If that is not the case, though, the message is translated according to the defined language fallbacks. In that case, we might end up with something like this:
<div class="wb-empty" lang="arc" dir="rtl">No label defined</div>
That's obviously wrong, and I'd like to fix it.
== Fixing it ==
For fixing this, I tried to make MessageCache provide the language a message was taken from [1]. That's not too straight-forward to begin with, but while working on it I realized that MessageCache is only responsible for following the language fallback chain for database translations. For file-based translations, the fallbacks are directly merged in by LocalisationCache, so the information is not there anymore at the time of translating a message. I see some ways to fix this:
- Don't merge messages in LocalisationCache, but perform the fallback
on request (possibly caching the result)
- Tag message strings in LocalisationCache with the language they are
in (sounds expensive to me)
- Tag message strings as being a fallback in LocalisationCache (that
way we could follow the fallback until we find a language in which the message string is not tagged as being a fallback)
What do you think?
[1] https://gerrit.wikimedia.org/r/282133
Thanks,
Adrian Heine né Lang SOFTWARE DEVELOPER
Wikimedia Deutschland e.V. | Tempelhofer Ufer 23-24 | 10963 Berlin Phone: +49 (0)30 219 158 26-0 http://wikimedia.de
Imagine a world, in which every single human being can freely share in the sum of all knowledge. That‘s our commitment.
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
Mediawiki-i18n mailing list Mediawiki-i18n@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-i18n
2016-04-12 14:01 GMT+03:00 Adrian Heine adrian.heine@wikimedia.de:
Hi everyone,
as some of you might know, I'm a software developer at Wikimedia Deutschland, working on Wikidata. I'm currently focusing on improving Wikidata's support for languages we as a team are not using on a daily basis. As part of my work I stumbled over a shortcoming in MediaWiki's message system that – as far as I see it – prevents me from doing the right thing(tm). I'm asking you to verify that the issue I see indeed is an issue and that we want to fix it. Subsequently, I'm interested in hearing your plans or goals for MediaWiki's message system so that I can align my implementation with them. Finally, I am hoping to find someone who is willing to help me fix it.
First of all, thanks for working on this issue. It is a real issue, but not often requested. I think that is because manually checking in every place whether the language code is unexpected (different from the one in current context) would be cumbersome and always outputting language codes on every tag would be bloaty. Best would be if this checking was automated in a templating library, but so far templating hasn't been much adopted in MediaWiki core. But of course this information needs to be exposed first, which is what I understand you are doing.
== The issue ==
On Wikidata, we regularly have content in different languages on the same page. We use the HTML lang and dir attributes accordingly. For example, we have a table with terms for an entity in different languages. For missing terms, we would display a message in the UI language within this table. The corresponding HTML (simplified) might look like this:
<div id="mw-content-text" lang="UILANG" dir="UILANG_DIR"> <table class="entity-terms"> <tr class="entity-terms-for-OTHERLANG1" lang="OTHERLANG1" dir="OTHERLANG1_DIR"> <td class="entity-terms-for-OTHERLANG1-label"> <div class="wb-empty" lang="UILANG" dir="UILANG_DIR"> <!-- missing label message --> </div> </td> </tr> </div> </div>
This works great as long as the missing label message is available in the UI language. If that is not the case, though, the message is translated according to the defined language fallbacks. In that case, we might end up with something like this:
<div class="wb-empty" lang="arc" dir="rtl">No label defined</div>
That's obviously wrong, and I'd like to fix it.
== Fixing it ==
For fixing this, I tried to make MessageCache provide the language a message was taken from [1]. That's not too straight-forward to begin with, but while working on it I realized that MessageCache is only responsible for following the language fallback chain for database translations. For file-based translations, the fallbacks are directly merged in by LocalisationCache, so the information is not there anymore at the time of translating a message. I see some ways to fix this:
- Don't merge messages in LocalisationCache, but perform the fallback on
request (possibly caching the result)
- Tag message strings in LocalisationCache with the language they are in
(sounds expensive to me)
- Tag message strings as being a fallback in LocalisationCache (that way we
could follow the fallback until we find a language in which the message string is not tagged as being a fallback)
What do you think?
The current localisation cache implementation quite obviously trades space for speed. In this light I would suggest option two, to tag the actual language the string is in.
However, this trade-off might not make sense anymore, as we have more languages and more messages, resulting in almost gigabyte size caches. See also for example https://phabricator.wikimedia.org/T99740. I added wikitech-l to CC in hopes that people who have worked on localisation cache more recently would comment on whether option one, to not merge messages, would make more sense nowadays.
-Niklas
Hi everyone,
Niklas brought this message[1] to my attention as something that probably deserves more attention than it has gotten, and I trust he's correct. What he said back in April: "I added wikitech-l to CC in hopes that people who have worked on localisation cache more recently would comment on whether [Adrian's proposed option to not merge messages in LocalisationCache] would make more sense nowadays."
Adrian: assuming Niklas and I are correct, my suggestion for moving this forward would be to turn your design thoughts into an ArchCom-RFC[2] for more explicit consideration by ArchCom. My attempt at abstract for those who haven't followed this:
Adrian was trying to figure out how to output pages that need to have multiple languages in a single page, which becomes difficult when fallbacks are missing. It results in some oddball behavior where placeholder text is output with incorrect i18n attributes in the surrounding div. He provided several alternatives for how to solve the problem in his mail to mediawiki-i18n. Niklas replied, providing the "if we stick with the status quo" answer (tag message strings in LocalisationCache with the correct language), and then is trying to figure out if the status quo makes the right space vs speed tradeoff given the quantity of languages and messages we have in 2016.
Adrian & Niklas: did I get the gist of it?
Rob
[1]: The April 2016 mediawiki-i18n thread: "Providing the effective language of messages" https://lists.wikimedia.org/pipermail/mediawiki-i18n/2016-April/thread.html#1059 [2]: My unofficial guide on how to turn something into an ArchCom-RFC: https://www.mediawiki.org/wiki/User:RobLa-WMF/ArchCom-RFC
Tagging language and directionality on fallback entries sounds reasonable, and not a huge addition.
I'm not sure why the fallbacks are cached in the main body instead of letting them through to the backing language's cache, but if that's a necessary or useful choice I've no reason to hop in and change it. I also am a bit mystified why we moved from a filesystem-backed database to a giant in-memory cache that requires deserializing the entire language cache on every request (if I'm reading correctly) but apparently that performs better than I think it should, so take me with a grain of salt... :)
I would recommend getting someone from Performance to weigh in on this to confirm the proposed solution has no surprises. Probably best to have at least a preliminary patch which can be run through some paces...
More generally, we should capture the other note in the mediawiki-i18n thread about having the templating system pick up language and directionality and fill out the necessary buts for tagging and isolation on the HTML end.
We may want to think about a class to represent a bit of text or HTML that's tagged with a language and an isolation factor (not sure if necessary), which can be mixed and matched with Message objects and passed into Html class generators or template things. And of course, equivalent on the JS side.
-- brion Hi everyone,
Niklas brought this message[1] to my attention as something that probably deserves more attention than it has gotten, and I trust he's correct. What he said back in April: "I added wikitech-l to CC in hopes that people who have worked on localisation cache more recently would comment on whether [Adrian's proposed option to not merge messages in LocalisationCache] would make more sense nowadays."
Adrian: assuming Niklas and I are correct, my suggestion for moving this forward would be to turn your design thoughts into an ArchCom-RFC[2] for more explicit consideration by ArchCom. My attempt at abstract for those who haven't followed this:
Adrian was trying to figure out how to output pages that need to have multiple languages in a single page, which becomes difficult when fallbacks are missing. It results in some oddball behavior where placeholder text is output with incorrect i18n attributes in the surrounding div. He provided several alternatives for how to solve the problem in his mail to mediawiki-i18n. Niklas replied, providing the "if we stick with the status quo" answer (tag message strings in LocalisationCache with the correct language), and then is trying to figure out if the status quo makes the right space vs speed tradeoff given the quantity of languages and messages we have in 2016.
Adrian & Niklas: did I get the gist of it?
Rob
[1]: The April 2016 mediawiki-i18n thread: "Providing the effective language of messages" < https://lists.wikimedia.org/pipermail/mediawiki-i18n/2016-April/thread.html#...
[2]: My unofficial guide on how to turn something into an ArchCom-RFC: https://www.mediawiki.org/wiki/User:RobLa-WMF/ArchCom-RFC
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hi everyone,
thanks for your encouraging feedback. I'm leaving Wikimedia Deutschland today, though, and thus the Wikidata team. I won't have time to push this topic further. I'm happy to help, review and answer questions, if somebody else wants to pick this up. Please contact me at mail@adrianheine.de in this case.
Bye,
mediawiki-i18n@lists.wikimedia.org