Small update: I went through the language list at
https://github.com/mkroetzsch/wda/blob/master/includes/epTurtleFileWriter.py#L472
and added a number of TODOs to the most obvious problematic cases.
Typical problems are:
* Malformed language codes ('tokipona')
* Correctly formed language codes without any official meaning (e.g.,
'cbk-zam')
* Correctly formed codes with the wrong meaning (e.g., 'sr-ec': Serbian
from Ecuador?!)
* Language codes with redundant information (e.g., 'kk-cyrl' should be
the same as 'kk' according to IANA, but we have both)
* Use of macrolanguages instead of languages (e.g., "zh" is not
"Mandarin" but just "Chinese"; I guess we mean Mandarin; less sure about
Kurdish ...)
* Language codes with incomplete information (e.g., "sr" should be
"sr-Cyrl" or "sr-Latn", both of which already exist; same for "zh" and
"zh-Hans"/"zh-Hant", but also for "zh-HK" [is this simplified or
traditional?]).
I invite any language experts to look at the file and add
comments/improvements. Some of the issues should possibly also be
considered on the implementation side: we don't want two distinct codes
for the same thing.
Cheers,
Markus
On 04/08/13 16:35, Markus Krötzsch wrote:
> On 04/08/13 13:17, Federico Leva (Nemo) wrote:
>> Markus Krötzsch, 04/08/2013 12:32:
>>> * Wikidata uses "be-x-old" as a code, but MediaWiki messages for this
>>> language seem to use "be-tarask" as a language code. So there must be a
>>> mapping somewhere. Where?
>>
>> Where I linked it.
>
> Are you sure? The file you linked has mappings from site ids to language
> codes, not from language codes to language codes. Do you mean to say:
> "If you take only the entries of the form 'XXXwiki' in the list, and
> extract a language code from the XXX, then you get a mapping from
> language codes to language codes that covers all exceptions in
> Wikidata"? This approach would give us:
>
> 'als' : 'gsw',
> 'bat-smg': 'sgs',
> 'be_x_old' : 'be-tarask',
> 'crh': 'crh-latn',
> 'fiu_vro': 'vro',
> 'no' : 'nb',
> 'roa-rup': 'rup',
> 'zh-classical' : 'lzh'
> 'zh-min-nan': 'nan',
> 'zh-yue': 'yue'
>
> Each of the values on the left here also occur as language tags in
> Wikidata, so if we map them, we use the same tag for things that
> Wikidata has distinct tags for. For example, Q27 has a label for yue but
> also for zh-yue [1]. It seems to be wrong to export both of these with
> the same language tag if Wikidata uses them for different purposes.
>
> Maybe this is a bug in Wikidata and we should just not export texts with
> any of the above codes at all (since they always are given by another
> tag directly)?
>
>>
>>> * MediaWiki's
http://www.mediawiki.org/wiki/Manual:$wgDummyLanguageCodes
>>> provides some mappings. For example, it maps "zh-yue" to "yue". Yet,
>>> Wikidata use both of these codes. What does this mean?
>>>
>>> Answers to Nemo's points inline:
>>>
>>> On 04/08/13 06:15, Federico Leva (Nemo) wrote:
>>>> Markus Krötzsch, 03/08/2013 15:48:
>
> ...
>
>>>> Apart from the above, doesn't wgLanguageCode in
>>>>
https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php
>>>>
>>>> have what you need?
>>>
>>> Interesting. However, the list there does not contain all 300 sites that
>>> we currently find in Wikidata dumps (and some that we do not find there,
>>> including things like dkwiki that seem to be outdated). The full list of
>>> sites we support is also found in the file I mentioned above, just after
>>> the language list (variable siteLanguageCodes).
>>
>> Of course not all wikis are there, that configuration is needed only
>> when the subdomain is "wrong". It's still not clear to me what codes you
>> are considering wrong.
>
> Well, the obvious: if a language used in Wikidata labels or on Wikimedia
> sites has an official IANA code [2], then we should use this code. Every
> other code would be "wrong". For languages that do not have any accurate
> code, we should probably use a private code, following the requirements
> of BCP 47 for private use subtags (in particular, they should have a
> single x somewhere).
>
> This does not seem to be done correctly by my current code. For example,
> we now map 'map_bmswiki' to 'map-bms'. While both 'map' and 'bms' are
> lANA language tags, I am not sure that their combination makes sense.
> The language should be Basa Banyumasan, but bms is for Bilma Kanuri (and
> it is a language code, not a dialect code). Note that map-bms does not
> occur in the file you linked to, so I guess there is some more work to do.
>
> Markus
>
> [1]
http://www.wikidata.org/wiki/Special:Export/Q27
> [2]
>
http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
>
>
>
_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l