Re: [Wikidata-l] Wikidata RDF export available

4 Aug 2013

Markus Krötzsch, 04/08/2013 17:35:
...
  Are you sure? The file you linked has mappings from
site ids to language
 codes, not from language codes to language codes. Do you mean to say:
 "If you take only the entries of the form 'XXXwiki' in the list, and
 extract a language code from the XXX, then you get a mapping from
 language codes to language codes that covers all exceptions in
 Wikidata"? 
Yes. You said Wikidata just uses the subdomain and the subdomain is 
contained in the database names used by the config. Sorry if I implied 
the removal of the wik* suffix and the conversion from _ to -

...
  This approach would give us:

 'als' : 'gsw',
 'bat-smg': 'sgs',
 'be_x_old' : 'be-tarask',
 'crh': 'crh-latn',
 'fiu_vro': 'vro',
 'no' : 'nb',
 'roa-rup': 'rup',
 'zh-classical' : 'lzh'
 'zh-min-nan': 'nan',
 'zh-yue': 'yue'

 Each of the values on the left here also occur as language tags in
 Wikidata, so if we map them, we use the same tag for things that
 Wikidata has distinct tags for. For example, Q27 has a label for yue but
 also for zh-yue [1]. It seems to be wrong to export both of these with
 the same language tag if Wikidata uses them for different purposes.

 Maybe this is a bug in Wikidata and we should just not export texts with
 any of the above codes at all (since they always are given by another
 tag directly)? 
Sorry, I don't know why both can appear. I would have said that one is a 
sitelink and the other some value added on wiki with the correct 
language code (entry label?) but my limited json reading skills seem to 
indicate otherwise.

...
  [...]

 Well, the obvious: if a language used in Wikidata labels or on Wikimedia
 sites has an official IANA code [2], 
(And all of them are supposed to, except rare exceptions with pre-2006 
wikis.)

...
  then we should use this code. Every
 other code would be "wrong". For languages that do not have any accurate
 code, we should probably use a private code, following the requirements
 of BCP 47 for private use subtags (in particular, they should have a
 single x somewhere).

 This does not seem to be done correctly by my current code. For example,
 we now map 'map_bmswiki' to 'map-bms'. While both 'map' and
'bms' are
 lANA language tags, I am not sure that their combination makes sense.
 The language should be Basa Banyumasan, but bms is for Bilma Kanuri (and
 it is a language code, not a dialect code). Note that map-bms does not
 occur in the file you linked to, so I guess there is some more work to do. 
Indeed, that appears to be one of the exceptions. :) I don't know how it 
should be tracked, you could file a bug in 
MediaWiki>Internationalisation asking to find a proper code for this 
language.
What was unclear to me is why you implied there were many such cases; 
that would surprise me.

Nemo

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata-l] Wikidata RDF export available