Hi everyone,
as some of you might know, I'm a software developer at Wikimedia
Deutschland, working on Wikidata. I'm currently focusing on improving
Wikidata's support for languages we as a team are not using on a daily
basis. As part of my work I stumbled over a shortcoming in MediaWiki's
message system that – as far as I see it – prevents me from doing the
right thing(tm). I'm asking you to verify that the issue I see indeed is
an issue and that we want to fix it. Subsequently, I'm interested in
hearing your plans or goals for MediaWiki's message system so that I can
align my implementation with them. Finally, I am hoping to find someone
who is willing to help me fix it.
== The issue ==
On Wikidata, we regularly have content in different languages on the
same page. We use the HTML lang and dir attributes accordingly. For
example, we have a table with terms for an entity in different
languages. For missing terms, we would display a message in the UI
language within this table. The corresponding HTML (simplified) might
look like this:
<div id="mw-content-text" lang="UILANG" dir="UILANG_DIR">
<table class="entity-terms">
<tr class="entity-terms-for-OTHERLANG1" lang="OTHERLANG1"
dir="OTHERLANG1_DIR">
<td class="entity-terms-for-OTHERLANG1-label">
<div class="wb-empty" lang="UILANG" dir="UILANG_DIR">
<!-- missing label message -->
</div>
</td>
</tr>
</div>
</div>
This works great as long as the missing label message is available in
the UI language. If that is not the case, though, the message is
translated according to the defined language fallbacks. In that case, we
might end up with something like this:
<div class="wb-empty" lang="arc" dir="rtl">No label defined</div>
That's obviously wrong, and I'd like to fix it.
== Fixing it ==
For fixing this, I tried to make MessageCache provide the language a
message was taken from [1]. That's not too straight-forward to begin
with, but while working on it I realized that MessageCache is only
responsible for following the language fallback chain for database
translations. For file-based translations, the fallbacks are directly
merged in by LocalisationCache, so the information is not there anymore
at the time of translating a message. I see some ways to fix this:
* Don't merge messages in LocalisationCache, but perform the fallback on
request (possibly caching the result)
* Tag message strings in LocalisationCache with the language they are in
(sounds expensive to me)
* Tag message strings as being a fallback in LocalisationCache (that way
we could follow the fallback until we find a language in which the
message string is not tagged as being a fallback)
What do you think?
[1] https://gerrit.wikimedia.org/r/282133
Thanks,
--
Adrian Heine né Lang
SOFTWARE DEVELOPER
Wikimedia Deutschland e.V. | Tempelhofer Ufer 23-24 | 10963 Berlin
Phone: +49 (0)30 219 158 26-0
http://wikimedia.de
Imagine a world, in which every single human being can freely share in
the sum of all
knowledge. That‘s our commitment.
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der
Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/681/51985.
-------- Messaggio inoltrato --------
Oggetto: [discovery] Better search results on wiki via TextCat
Data: Tue, 19 Jul 2016 19:42:27 -0600
Mittente: Deborah Tankersley
A: A public mailing list about Wikimedia Search and Discovery projects
<discovery(a)lists.wikimedia.org>
We're happy to announce that after numerous tests and analyses[1] and a
fully operational demo[2], the Discovery Team is ready to release
TextCat[3] into production on wiki.
What is TextCat? It detects the language that the search query was
written in which allows us to look for results on a different wiki.
TextCat is a language detection library based on n-grams[4]. During a
search, TextCat will only kick in when the following three things occur:
1. fewer than 3 results are returned from the query on the current
wiki
2. language detection is successful (meaning that TextCat is
reasonably certain what language the query is in, and that it is
different from the language of the current wiki)
3. the other wiki (in the detected language) has results
Our analysis of the A/B test[5] (for English, French, Spanish, Italian
and German Wikipedia's) showed that:
"...The test groups not only had a substantially lower zero results
rate (57% in control group vs 46% in the two test groups), but they
had a higher clickthrough rate (44% in the control group vs 49-50%
in the two test groups), indicating that we may be providing users
with relevant results that they would not have gotten otherwise."
This update will be scheduled for production release during the week of
July 25, 2016 on the following Wikipedia's:
* English [6]
* German [7]
* Spanish [8]
* Italian [9]
* French [10]
TextCat will then be added to this next group of Wikipedia's at a later
date:
* Portugese[11]
* Russian[12]
* Japanese[13]
This is a huge step forward in creating a search mechanism that is able
to detect - with a high level of accuracy - the language that was used
and produce results in that language. Another forward-looking aspect of
TextCat is investigating a confidence measuring algorithm[14], to ensure
that the language detection results are the best they can be.
We will also be doing more[15] A/B tests using TextCat on non Wikipedia
sites, such as Wikibooks and Wikivoyage. These new tests will give us
insight into whether applying the same language detection configuration
across projects would be helpful.
Please let us know if you have any questions or concerns, on the TextCat
discussion page[16]. Also, for screenshots of what this update will look
like, please see this one[17] showing an existing search typed in on
enwiki in Russian "первым экспериментом" and this one[18] for showing
what it will look like once TextCat is in production on enwiki.
Thanks!
[1] https://phabricator.wikimedia.org/T118278
[2] https://tools.wmflabs.org/textcatdemo/
[3] https://www.mediawiki.org/wiki/TextCat
[4] https://en.wikipedia.org/wiki/N-gram
[5]
https://commons.wikimedia.org/wiki/File:Report_on_Cirrus_Search_TextCat_AB_…
[6] https://en.wikipedia.org/
[7] https://de.wikipedia.org/
[8] https://es.wikipedia.org/
[9] https://it.wikipedia.org/
[10] https://fr.wikipedia.org/
[11] https://pt.wikipedia.org/
[12] https://ru.wikipedia.org/
[13] https://ja.wikipedia.org/
[14] https://phabricator.wikimedia.org/T140289
[15] https://phabricator.wikimedia.org/T140292
[16] https://www.mediawiki.org/wiki/Talk:TextCat
[17] https://commons.wikimedia.org/wiki/File:Existing-search_no-textcat.png
[18] https://commons.wikimedia.org/wiki/File:New-search_with-textcat.png
--
Deb Tankersley
Product Manager, Discovery
IRC: debt
Wikimedia Foundation