Languages in Wikidata4Wiktionary

List overview All Threads
Download

newer

older

New Wikidata accounts can't edit...

SPARQL federation - more of the...

Denny Vrandečić

7 Apr 2017 7 Apr '17

12:51 a.m.

The current spec of the data model states that an L-Item has a lemma, a language, and several forms, and the forms in turn have representations.

https://www.mediawiki.org/wiki/Extension:WikibaseLexeme/Data_Model

The language is a Q-Item, the lemma and the representations are Multilingual Texts. Multilingual texts are sets of pairs of strings and UserLanguageCodes.

My question is about the relation between representing a language as a Q-Item and as a UserLanguageCode.

A previous proposal treated lemmas and representations as raw strings, with the language pointing to the Q-Item being the only language information. This now is gone, and the lemma and representation carry their own language information.

How do they interact? The language set referencable through Q-Items is much larger than the set of languages with a UserLanguageCode, and indeed, the intention was to allow for every language to be representable in Wikidata, not only those with a UserLanguageCode.

I sense quite a problem here.

I see two possible ways to resolve this: - return to the original model and use strings instead of Multilingual texts (with all the negative implications for variants) - use Q-Items instead of UserLanguageCodes for Multilingual texts (which would be quite a migration)

I don't think restricting Wiktionary4Wikidata support to the list of languages with a UserLanguageCode is a viable solution, which would happen if we implement the data model as currently suggested, if I understand it correctly.

Cheers, Denny

Attachments:

attachment.htm (text/html — 1.9 KB)

Show replies by date

Tobias Schönberg

7 Apr 7 Apr

1:59 a.m.

An example using the second suggestion:

If I would like to query all L-items that contain a combination of letters and limit those results by getting the Q-items of the language and limit those, to those that have Latin influences.

In my imagination this would work better using the second suggestion. Also the flexibility of "what is a language" and "what is a dialect" would seem easier if we can attach statements to the UserLanguageCode or the Q-item of the language.

-Tobias

2017-04-06 18:51 GMT+02:00 Denny Vrandečić vrandecic@gmail.com:

...

The current spec of the data model states that an L-Item has a lemma, a language, and several forms, and the forms in turn have representations.

https://www.mediawiki.org/wiki/Extension:WikibaseLexeme/Data_Model

The language is a Q-Item, the lemma and the representations are Multilingual Texts. Multilingual texts are sets of pairs of strings and UserLanguageCodes.

My question is about the relation between representing a language as a Q-Item and as a UserLanguageCode.

A previous proposal treated lemmas and representations as raw strings, with the language pointing to the Q-Item being the only language information. This now is gone, and the lemma and representation carry their own language information.

How do they interact? The language set referencable through Q-Items is much larger than the set of languages with a UserLanguageCode, and indeed, the intention was to allow for every language to be representable in Wikidata, not only those with a UserLanguageCode.

I sense quite a problem here.

I see two possible ways to resolve this:

return to the original model and use strings instead of Multilingual

texts (with all the negative implications for variants)

use Q-Items instead of UserLanguageCodes for Multilingual texts (which

would be quite a migration)

I don't think restricting Wiktionary4Wikidata support to the list of languages with a UserLanguageCode is a viable solution, which would happen if we implement the data model as currently suggested, if I understand it correctly.

Cheers, Denny

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Daniel Kinzler

10 Apr 10 Apr

11:42 p.m.

Tobias' comment made me realize that I did not clarify wone very important distinction: there are two kinds of places where a "language" is needed in the Lexeme data model https://www.mediawiki.org/wiki/Extension:WikibaseLexeme/Data_Model:

1) the "lexeme language". This can be any Item, language code or no. This is what Tobias would have to use in his query.

2) the language codes used in the MultilingualTextValues (lemma, representation, and gloss). This is where my "hybrid" approach comes in: use a standard language code augmented by an item ID to identify the variant.

To make it easy to create new Lexemes, the lexeme language can serve as a default for lemma, representation, and gloss - but only if it has a language code. If it does not have one, the user will have to specify one for use in MultilingualTextValues.

Am 06.04.2017 um 19:59 schrieb Tobias Schönberg:

...

An example using the second suggestion:

If I would like to query all L-items that contain a combination of letters and limit those results by getting the Q-items of the language and limit those, to those that have Latin influences.

In my imagination this would work better using the second suggestion. Also the flexibility of "what is a language" and "what is a dialect" would seem easier if we can attach statements to the UserLanguageCode or the Q-item of the language.

-Tobias

-- Daniel Kinzler Principal Platform Engineer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Denny Vrandečić

11 Apr 11 Apr

12:12 a.m.

So assume we enter a new Lexeme in Examplarian (which has a Q-Item), but Examplarian has no language code for whatever reason. What language code would they enter in the MultilingualTextValue?

On Mon, Apr 10, 2017 at 8:42 AM Daniel Kinzler daniel.kinzler@wikimedia.de wrote:

...

Tobias' comment made me realize that I did not clarify wone very important distinction: there are two kinds of places where a "language" is needed in the Lexeme data model https://www.mediawiki.org/wiki/Extension:WikibaseLexeme/Data_Model:

the "lexeme language". This can be any Item, language code or no. This

is what Tobias would have to use in his query.

the language codes used in the MultilingualTextValues (lemma,

representation, and gloss). This is where my "hybrid" approach comes in: use a standard language code augmented by an item ID to identify the variant.

To make it easy to create new Lexemes, the lexeme language can serve as a default for lemma, representation, and gloss - but only if it has a language code. If it does not have one, the user will have to specify one for use in MultilingualTextValues.

Am 06.04.2017 um 19:59 schrieb Tobias Schönberg:

...
An example using the second suggestion:

If I would like to query all L-items that contain a combination of

letters and

...
limit those results by getting the Q-items of the language and limit

those, to

...
those that have Latin influences.

In my imagination this would work better using the second suggestion.

Also the

...
flexibility of "what is a language" and "what is a dialect" would seem

easier if

...
we can attach statements to the UserLanguageCode or the Q-item of the

language.

...
-Tobias

-- Daniel Kinzler Principal Platform Engineer

Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Daniel Kinzler

1:41 a.m.

Am 10.04.2017 um 18:12 schrieb Denny Vrandečić:

...

So assume we enter a new Lexeme in Examplarian (which has a Q-Item), but Examplarian has no language code for whatever reason. What language code would they enter in the MultilingualTextValue?

My plan is: it will be "mis+Q7654321" internally, which will be exposed in HTML and RDF as "mis".

We will want to distinguish "a known language not on this list (mis)" from "an unknown language (und)" and "translingual" (Wiktionary uses "mul" for translingual, but that's not technically correct).

-- Daniel Kinzler Principal Platform Engineer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Stas Malyshev

8:56 a.m.

Hi!

...

We will want to distinguish "a known language not on this list (mis)" from "an unknown language (und)" and "translingual" (Wiktionary uses "mul" for translingual, but that's not technically correct).

I think "mul" is for "text in more than one language" and there's also "zxx" is for "text that is not defined as being in any language at all".

BTW, BCP 47 also says:

The 'mis' (Uncoded) primary language subtag identifies content whose language is known but that does not currently have a corresponding subtag. This subtag SHOULD NOT be used. Because the addition of other codes in the future can render its application invalid, it is inherently unstable and hence incompatible with the stability goals of BCP 47. It is always preferable to use other subtags: either 'und' or (with prior agreement) private use subtags.

So maybe using und would be a good idea.

-- Stas Malyshev smalyshev@wikimedia.org

Gerard Meijssen

12:56 a.m.

Hoi, The standard for the identification of a language should suffice. As long as we follow the standard and insist on the identification in this manner it is always possible to provide an identifcation. When you insist on a an item ID, that item ID needs to have a language code and this language code must never change.

Without this there is no interoperability. Thanks, GerardM

On 10 April 2017 at 17:42, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:

...

Tobias' comment made me realize that I did not clarify wone very important distinction: there are two kinds of places where a "language" is needed in the Lexeme data model https://www.mediawiki.org/wiki/Extension:WikibaseLexeme/Data_Model:

the "lexeme language". This can be any Item, language code or no. This

is what Tobias would have to use in his query.

the language codes used in the MultilingualTextValues (lemma,

representation, and gloss). This is where my "hybrid" approach comes in: use a standard language code augmented by an item ID to identify the variant.

To make it easy to create new Lexemes, the lexeme language can serve as a default for lemma, representation, and gloss - but only if it has a language code. If it does not have one, the user will have to specify one for use in MultilingualTextValues.

Am 06.04.2017 um 19:59 schrieb Tobias Schönberg:

...
An example using the second suggestion:

If I would like to query all L-items that contain a combination of

letters and

...
limit those results by getting the Q-items of the language and limit

those, to

...
those that have Latin influences.

In my imagination this would work better using the second suggestion.

Also the

...
flexibility of "what is a language" and "what is a dialect" would seem

easier if

...
we can attach statements to the UserLanguageCode or the Q-item of the

language.

...
-Tobias

-- Daniel Kinzler Principal Platform Engineer

Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Daniel Kinzler

1:10 a.m.

Am 10.04.2017 um 18:56 schrieb Gerard Meijssen:

...

Hoi, The standard for the identification of a language should suffice.

I know no standard that would be sufficient for our use case.

For instance, we not only need identifiers for German, Swiss and Austrian German. We also need identifiers for German German before and after the spelling reform of 1901, and before and ofter the spelling reform of 1996. We will also need identifiers for the "language" of mathematical notation. And for various variants of ancient languages: not just Sumerian, but Sumerian from different regions and periods.

The only system I know that gives us that flexibility is Wikidata. For interoperability, we should provide a standard language code (aka subtag). But a language code alone is not going to be sufficient to distinguish the different variants we will need.

-- Daniel Kinzler Principal Platform Engineer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Denny Vrandečić

1:24 a.m.

Daniel, I agree, but isn't that what Multilingual Text requires? A language code?

I.e. how does the current model plan to solve that?

I assume most of it is hidden behind mini-wizards like "Create a new lexeme", which actually make sure the multitext language and the language property are consistently set. In that case I can see this work.

On Mon, Apr 10, 2017 at 10:11 AM Daniel Kinzler daniel.kinzler@wikimedia.de wrote:

...

Am 10.04.2017 um 18:56 schrieb Gerard Meijssen:

...
Hoi, The standard for the identification of a language should suffice.

I know no standard that would be sufficient for our use case.

For instance, we not only need identifiers for German, Swiss and Austrian German. We also need identifiers for German German before and after the spelling reform of 1901, and before and ofter the spelling reform of 1996. We will also need identifiers for the "language" of mathematical notation. And for various variants of ancient languages: not just Sumerian, but Sumerian from different regions and periods.

The only system I know that gives us that flexibility is Wikidata. For interoperability, we should provide a standard language code (aka subtag). But a language code alone is not going to be sufficient to distinguish the different variants we will need.

-- Daniel Kinzler Principal Platform Engineer

Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Daniel Kinzler

1:33 a.m.

Am 10.04.2017 um 19:24 schrieb Denny Vrandečić:

...

Daniel, I agree, but isn't that what Multilingual Text requires? A language code?

Yes. Well, internally, it just has to be *some* unique code. But for interoperability, we want it to be a standard code. So I propose to internally use something like "de+Q1980305", and expose that as "de" externally. This allows us to distinguish however many variants of German we want internally, and tag them all as "de" in HTML and RDF, so standard tools can use the language information.

...

I assume most of it is hidden behind mini-wizards like "Create a new lexeme", which actually make sure the multitext language and the language property are consistently set. In that case I can see this work.

Yes, that is exactly the plan for the NewLexeme page.

We'll still have to come up with a nifty UI for "add a lemma, select a language, and optionally an item identifying a variant of that language".

-- Daniel Kinzler Principal Platform Engineer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Gerard Meijssen

2:01 a.m.

Hoi, The standard is flexible. It allows you to add user defined parts. It allows for language that have no recognised language code. The point is that the solution for external parties cannot be found in Wikidata itself. We have to use the standards if we want interoperability. We need interoperability and we need to define what it is that is expressed. Once we decide that a specific expression of language is in use, we stick with that definition. It can only be deprecated if that is what people want. Thanks, GerardM

On 10 April 2017 at 19:10, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:

...

Am 10.04.2017 um 18:56 schrieb Gerard Meijssen:

...
Hoi, The standard for the identification of a language should suffice.

I know no standard that would be sufficient for our use case.

For instance, we not only need identifiers for German, Swiss and Austrian German. We also need identifiers for German German before and after the spelling reform of 1901, and before and ofter the spelling reform of 1996. We will also need identifiers for the "language" of mathematical notation. And for various variants of ancient languages: not just Sumerian, but Sumerian from different regions and periods.

The only system I know that gives us that flexibility is Wikidata. For interoperability, we should provide a standard language code (aka subtag). But a language code alone is not going to be sufficient to distinguish the different variants we will need.

-- Daniel Kinzler Principal Platform Engineer

Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Stas Malyshev

9:01 a.m.

Hi!

...

For instance, we not only need identifiers for German, Swiss and Austrian German. We also need identifiers for German German before and after the spelling reform of 1901, and before and ofter the spelling reform of 1996. We will also

Theoretically, BCP 47 should be able to handle this? E.g. they have sl-IT-rozaj-biske-1994 as an example. But we probably shouldn't try to construct these tags ourselves, but instead let editors specify them.

...

need identifiers for the "language" of mathematical notation. And for various

That's where using zxx-math or something like that would be useful? Or we could omit any tags from those at all.

...

The only system I know that gives us that flexibility is Wikidata. For interoperability, we should provide a standard language code (aka subtag). But a language code alone is not going to be sufficient to distinguish the different variants we will need.

I think it can be, using BCP 47 extensions, but Wikidata team should not be taking care of it - instead, Wikidata editors should do it by assigning language tag properties to specific Wikidata items.

-- Stas Malyshev smalyshev@wikimedia.org

Thomas PT

12 Apr 12 Apr

4:56 p.m.

This plan sounds great! Thank you!

A question about the tags used: would it be possible instead of having a "mis+Q7654321" internally and "mis" externally to use a private use subtag [1] like "mis-x-Q7654321" or "de-x-Q1980305" (or maybe "mis-x-wd-Q7654321" and "de-x-wd-Q1980305") that would be used both internally and externally? It has the advantage of being a valid BCP-47 code and allowing RDF users to extract the exact language (and not only the less very informative "mis"). A variant would be to use "x-Q7654321" instead of "mis-x-Q7654321" to avoid the "mis" tag entirely.

An other possible way to go: just store the Qid internally and retrieve language tag from the item to build something like de-x-wd-Q1980305 when generating the output (or maybe just "de" if the output user do not want to have custom extensions).

Thomas

[1] https://tools.ietf.org/html/bcp47#section-2.2.7

...

Le 11 avr. 2017 à 03:01, Stas Malyshev smalyshev@wikimedia.org a écrit :

Hi!

...
For instance, we not only need identifiers for German, Swiss and Austrian German. We also need identifiers for German German before and after the spelling reform of 1901, and before and ofter the spelling reform of 1996. We will also

Theoretically, BCP 47 should be able to handle this? E.g. they have sl-IT-rozaj-biske-1994 as an example. But we probably shouldn't try to construct these tags ourselves, but instead let editors specify them.

...
need identifiers for the "language" of mathematical notation. And for various

That's where using zxx-math or something like that would be useful? Or we could omit any tags from those at all.

There is the Zmth script subtag for that.

...

...
The only system I know that gives us that flexibility is Wikidata. For interoperability, we should provide a standard language code (aka subtag). But a language code alone is not going to be sufficient to distinguish the different variants we will need.

I think it can be, using BCP 47 extensions, but Wikidata team should not be taking care of it - instead, Wikidata editors should do it by assigning language tag properties to specific Wikidata items. -- Stas Malyshev smalyshev@wikimedia.org

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Stas Malyshev

7 Apr 7 Apr

7:16 a.m.

Hi!

...

use Q-Items instead of UserLanguageCodes for Multilingual texts (which

would be quite a migration)

I foresee that might be a bit of a problem for external tools consuming this data - how they would figure out what language it is if it's doesn't have a code? We could of course generate fake codes like mis-x-q12345, maybe that would work.

...

I don't think restricting Wiktionary4Wikidata support to the list of languages with a UserLanguageCode is a viable solution, which would happen if we implement the data model as currently suggested, if I understand it correctly.

Aren't we limiting it right now this way in Wikidata?

-- Stas Malyshev smalyshev@wikimedia.org

Denny Vrandečić

7:34 a.m.

On Thu, Apr 6, 2017, 16:16 Stas Malyshev smalyshev@wikimedia.org wrote:

...

Hi!

...

use Q-Items instead of UserLanguageCodes for Multilingual texts (which

would be quite a migration)

I foresee that might be a bit of a problem for external tools consuming this data - how they would figure out what language it is if it's doesn't have a code? We could of course generate fake codes like mis-x-q12345, maybe that would work.

Q-items for languages already have a property to state their language code. It's just an extra hop away.

...

...
I don't think restricting Wiktionary4Wikidata support to the list of languages with a UserLanguageCode is a viable solution, which would happen if we implement the data model as currently suggested, if I understand it correctly.

Aren't we limiting it right now this way in Wikidata?

For labels and descriptions of items yes, and I think that was sensible. It might be time to revisit that decision though.

But for supporting Wiktionary that would be extremely limiting. French Wiktionary supports words in more than a thousand languages currently. Limiting the supported languages of the lemmas is, IMHO, unacceptable.

...

-- Stas Malyshev smalyshev@wikimedia.org

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Gerard Meijssen

12:35 p.m.

Hoi, There are many valid possibilities to describe something that is not a language and language used may represent a language that does not have a language code. There is a standard for indicating languages; it allows for something like "US-American Spanish" by combining a country and a language code. This is well known.

The problem with everything that has not been recognised / standardised / defined as a language is that it is highly political. The practical side is that we can use an x in a code to indicate a special use. However, then calling it a language is problematic because a language ought to mean that its understanding is mutually exclusive.

Calling it a language code and use "expressed in" would imho work for any form of language. When Wiktionaries content is imported in Wikidata, we first have to have these languages codes agreed on. To first import the bulk is no problem. It puts pressure on the resolution of such issues and that is not half bad. Thanks, GerardM

On 7 April 2017 at 01:34, Denny Vrandečić vrandecic@gmail.com wrote:

...

On Thu, Apr 6, 2017, 16:16 Stas Malyshev smalyshev@wikimedia.org wrote:

...
Hi!

...

use Q-Items instead of UserLanguageCodes for Multilingual texts (which

would be quite a migration)

I foresee that might be a bit of a problem for external tools consuming this data - how they would figure out what language it is if it's doesn't have a code? We could of course generate fake codes like mis-x-q12345, maybe that would work.

Q-items for languages already have a property to state their language code. It's just an extra hop away.

...
...
I don't think restricting Wiktionary4Wikidata support to the list of languages with a UserLanguageCode is a viable solution, which would happen if we implement the data model as currently suggested, if I understand it correctly.

Aren't we limiting it right now this way in Wikidata?

For labels and descriptions of items yes, and I think that was sensible. It might be time to revisit that decision though.

But for supporting Wiktionary that would be extremely limiting. French Wiktionary supports words in more than a thousand languages currently. Limiting the supported languages of the lemmas is, IMHO, unacceptable.

...
-- Stas Malyshev smalyshev@wikimedia.org

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Daniel Kinzler

8 Apr 8 Apr

1:13 a.m.

Am 07.04.2017 um 01:34 schrieb Denny Vrandečić:

...

I foresee that might be a bit of a problem for external tools consuming
this data - how they would figure out what language it is if it's
doesn't have a code? We could of course generate fake codes like
mis-x-q12345, maybe that would work.

Q-items for languages already have a property to state their language code. It's just an extra hop away.

We want ISO codes (or rather, IANA language subtags [1]), so we can use them in HTML lang attributes, and in RDF literals. This allows interoperability with standard tools.

For this reason, I also favor a mixed approach, that allows standard language tags to be used whenever possible. I have some ideas on how that could work, but no definite plan yet.

Something like de+Q1980305 could work; when generating HTML or RDF, we'd just drop the suffix. For transligual entries (e.g. the for number symbol i), we could use e.g. mis+Q1140046.

[1] https://www.iana.org/assignments/language-subtag-registry/language-subtag-re...

-- Daniel Kinzler Principal Platform Engineer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Stas Malyshev

1:59 a.m.

Hi!

...

Something like de+Q1980305 could work; when generating HTML or RDF, we'd just drop the suffix. For transligual entries (e.g. the for number symbol i), we could use e.g. mis+Q1140046.

I think for those that are not in particular language, und or zxx could be better. mis as I read it is for "this is in specific language, but we don't have a code for it". See https://en.wikipedia.org/wiki/ISO_639

-- Stas Malyshev smalyshev@wikimedia.org

Scott MacLeod

2:04 a.m.

I tried to see how the ISO codes and IANA language subtags compare with Glottolog's 8,444 entries under languages ( http://glottolog.org/glottolog/language) and Ethnologue's 7,099 living languages (https://www.ethnologue.com/), but couldn't find any comparisons or comparative lists.

Will it be possible with these new developments in Wikidata to query for these possibilities, and leave the options open for a growing list of languages, as well as an universal translator?

And how will invented languages be added, such as Krell, Elvish and Klingon (and even other species' languages in emergent interspecies' communications), and possibly per OpenNMT (Neural Machine Translation) - http://opennmt.net/ (and possibly GNMT); see also Peter Norvig's recent article in the regards to OpenNMT and invented languages - https://medium.com/@peternorvig/last-tweets-of-the-krell-82b8cb74c320 (and per http://scott-macleod.blogspot.com/2017/04/falco-peregrinus-smartphone-that-c... ).

Scott

On Fri, Apr 7, 2017 at 10:13 AM, Daniel Kinzler <daniel.kinzler@wikimedia.de

...

wrote:

...

Am 07.04.2017 um 01:34 schrieb Denny Vrandečić:

...
I foresee that might be a bit of a problem for external tools
consuming

...
this data - how they would figure out what language it is if it's
doesn't have a code? We could of course generate fake codes like
mis-x-q12345, maybe that would work.
Q-items for languages already have a property to state their language
code. It's

...
just an extra hop away.

We want ISO codes (or rather, IANA language subtags [1]), so we can use them in HTML lang attributes, and in RDF literals. This allows interoperability with standard tools.

For this reason, I also favor a mixed approach, that allows standard language tags to be used whenever possible. I have some ideas on how that could work, but no definite plan yet.

Something like de+Q1980305 could work; when generating HTML or RDF, we'd just drop the suffix. For transligual entries (e.g. the for number symbol i), we could use e.g. mis+Q1140046.

[1] https://www.iana.org/assignments/language-subtag-registry/language-subtag- registry

-- Daniel Kinzler Principal Platform Engineer

Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- -- - Scott MacLeod - Founder & President - World University and School - http://worlduniversityandschool.org - 415 480 4577 - http://scottmacleod.com - CC World University and School - like CC Wikipedia with best STEM-centric CC OpenCourseWare - incorporated as a nonprofit university and school in California, and is a U.S. 501 (c) (3) tax-exempt educational organization. IMPORTANT NOTICE: This transmission and any attachments are intended only for the use of the individual or entity to which they are addressed and may contain information that is privileged, confidential, or exempt from disclosure under applicable federal or state laws. If the reader of this transmission is not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this transmission in error, please notify me immediately by email or telephone. World University and School is sending you this because of your interest in free, online, higher education. If you don't want to receive these, please reply with 'unsubscribe' in the body of the email, leaving the subject line intact. Thank you.

Denny Vrandečić

2:50 a.m.

Scott,

I assume you realized that the article by Norvig you cited was rather intentionally published on April 1st.

Cheers, Denny

On Fri, Apr 7, 2017 at 11:04 AM Scott MacLeod < worlduniversityandschool@gmail.com> wrote:

...

I tried to see how the ISO codes and IANA language subtags compare with Glottolog's 8,444 entries under languages ( http://glottolog.org/glottolog/language) and Ethnologue's 7,099 living languages (https://www.ethnologue.com/), but couldn't find any comparisons or comparative lists.

Will it be possible with these new developments in Wikidata to query for these possibilities, and leave the options open for a growing list of languages, as well as an universal translator?

And how will invented languages be added, such as Krell, Elvish and Klingon (and even other species' languages in emergent interspecies' communications), and possibly per OpenNMT (Neural Machine Translation) - http://opennmt.net/ (and possibly GNMT); see also Peter Norvig's recent article in the regards to OpenNMT and invented languages - https://medium.com/@peternorvig/last-tweets-of-the-krell-82b8cb74c320 (and per http://scott-macleod.blogspot.com/2017/04/falco-peregrinus-smartphone-that-c... ).

Scott

On Fri, Apr 7, 2017 at 10:13 AM, Daniel Kinzler < daniel.kinzler@wikimedia.de> wrote:

Am 07.04.2017 um 01:34 schrieb Denny Vrandečić:

...
I foresee that might be a bit of a problem for external tools
consuming

...
this data - how they would figure out what language it is if it's
doesn't have a code? We could of course generate fake codes like
mis-x-q12345, maybe that would work.
Q-items for languages already have a property to state their language
code. It's

...
just an extra hop away.

We want ISO codes (or rather, IANA language subtags [1]), so we can use them in HTML lang attributes, and in RDF literals. This allows interoperability with standard tools.

For this reason, I also favor a mixed approach, that allows standard language tags to be used whenever possible. I have some ideas on how that could work, but no definite plan yet.

Something like de+Q1980305 could work; when generating HTML or RDF, we'd just drop the suffix. For transligual entries (e.g. the for number symbol i), we could use e.g. mis+Q1140046.

[1]

https://www.iana.org/assignments/language-subtag-registry/language-subtag-re...

-- Daniel Kinzler Principal Platform Engineer

Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

--

--

Scott MacLeod - Founder & President

World University and School

http://worlduniversityandschool.org

415 480 4577 <(415)%20480-4577>

http://scottmacleod.com

CC World University and School - like CC Wikipedia with best

STEM-centric CC OpenCourseWare - incorporated as a nonprofit university and school in California, and is a U.S. 501 (c) (3) tax-exempt educational organization.

IMPORTANT NOTICE: This transmission and any attachments are intended only for the use of the individual or entity to which they are addressed and may contain information that is privileged, confidential, or exempt from disclosure under applicable federal or state laws. If the reader of this transmission is not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this transmission in error, please notify me immediately by email or telephone.

World University and School is sending you this because of your interest in free, online, higher education. If you don't want to receive these, please reply with 'unsubscribe' in the body of the email, leaving the subject line intact. Thank you.

Info WorldUniversity

2:55 a.m.

Denny,

Yes, yet the timing is good for these great developments you're making with languages in Wikidata4Wiktionary.

Cheers, Scott

On Fri, Apr 7, 2017 at 11:50 AM, Denny Vrandečić vrandecic@gmail.com wrote:

...

Scott,

I assume you realized that the article by Norvig you cited was rather intentionally published on April 1st.

Cheers, Denny

On Fri, Apr 7, 2017 at 11:04 AM Scott MacLeod <worlduniversityandschool@ gmail.com> wrote:

...
I tried to see how the ISO codes and IANA language subtags compare with Glottolog's 8,444 entries under languages (http://glottolog.org/ glottolog/language) and Ethnologue's 7,099 living languages ( https://www.ethnologue.com/), but couldn't find any comparisons or comparative lists.

Will it be possible with these new developments in Wikidata to query for these possibilities, and leave the options open for a growing list of languages, as well as an universal translator?

And how will invented languages be added, such as Krell, Elvish and Klingon (and even other species' languages in emergent interspecies' communications), and possibly per OpenNMT (Neural Machine Translation) - http://opennmt.net/ (and possibly GNMT); see also Peter Norvig's recent article in the regards to OpenNMT and invented languages - https://medium.com/@peternorvig/last-tweets-of-the-krell-82b8cb74c320 (and per http://scott-macleod.blogspot.com/2017/04/falco-peregrinus- smartphone-that-could.html).

Scott

On Fri, Apr 7, 2017 at 10:13 AM, Daniel Kinzler < daniel.kinzler@wikimedia.de> wrote:

Am 07.04.2017 um 01:34 schrieb Denny Vrandečić:

...
I foresee that might be a bit of a problem for external tools
consuming

...
this data - how they would figure out what language it is if it's
doesn't have a code? We could of course generate fake codes like
mis-x-q12345, maybe that would work.
Q-items for languages already have a property to state their language
code. It's

...
just an extra hop away.

We want ISO codes (or rather, IANA language subtags [1]), so we can use them in HTML lang attributes, and in RDF literals. This allows interoperability with standard tools.

For this reason, I also favor a mixed approach, that allows standard language tags to be used whenever possible. I have some ideas on how that could work, but no definite plan yet.

Something like de+Q1980305 could work; when generating HTML or RDF, we'd just drop the suffix. For transligual entries (e.g. the for number symbol i), we could use e.g. mis+Q1140046.

[1] https://www.iana.org/assignments/language-subtag- registry/language-subtag-registry

-- Daniel Kinzler Principal Platform Engineer

Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

--

--

Scott MacLeod - Founder & President

World University and School

http://worlduniversityandschool.org

415 480 4577 <(415)%20480-4577>

http://scottmacleod.com

CC World University and School - like CC Wikipedia with best

STEM-centric CC OpenCourseWare - incorporated as a nonprofit university and school in California, and is a U.S. 501 (c) (3) tax-exempt educational organization.

IMPORTANT NOTICE: This transmission and any attachments are intended only for the use of the individual or entity to which they are addressed and may contain information that is privileged, confidential, or exempt from disclosure under applicable federal or state laws. If the reader of this transmission is not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this transmission in error, please notify me immediately by email or telephone.

World University and School is sending you this because of your interest in free, online, higher education. If you don't want to receive these, please reply with 'unsubscribe' in the body of the email, leaving the subject line intact. Thank you.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- -- - Scott MacLeod - Founder & President - 415 480 4577 - http://scottmacleod.com - World University and School - http://worlduniversityandschool.org - CC World University and School - like CC Wikipedia with best STEM-centric CC OpenCourseWare - incorporated as a nonprofit university and school in California, and is a U.S. 501 (c) (3) tax-exempt educational organization. IMPORTANT NOTICE: This transmission and any attachments are intended only for the use of the individual or entity to which they are addressed and may contain information that is privileged, confidential, or exempt from disclosure under applicable federal or state laws. If the reader of this transmission is not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this transmission in error, please notify me immediately by email or telephone. World University and School is sending you this because of your interest in free, online, higher education. If you don't want to receive these, please reply with 'unsubscribe' in the body of the email, leaving the subject line intact. Thank you.

Stas Malyshev

1:55 a.m.

Hi!

...

Q-items for languages already have a property to state their language code. It's just an extra hop away.

Right, but what if there's nothing there? You're saying we have more languages than codes, so it's inevitable some of them won't have codes?

-- Stas Malyshev smalyshev@wikimedia.org

David Cuenca Tudela

7 Apr 7 Apr

5:51 p.m.

Personally I would prefer a mixed approach, where there is a list of top-level items that are authorized, and then verifying that the item used is a subclass of any of those items. Whether those constraints are hard-enforced or just supervised could be a topic of discussion, but IMHO the more automated, the better.

Regarding the codes, it can be generated with the code of the top-level item+the Q number of the item used. If someone wants to use one or the other, it should be quite easy to remove.

Cheers, David

On Thu, Apr 6, 2017 at 6:51 PM, Denny Vrandečić vrandecic@gmail.com wrote:

...

The current spec of the data model states that an L-Item has a lemma, a language, and several forms, and the forms in turn have representations.

https://www.mediawiki.org/wiki/Extension:WikibaseLexeme/Data_Model

The language is a Q-Item, the lemma and the representations are Multilingual Texts. Multilingual texts are sets of pairs of strings and UserLanguageCodes.

My question is about the relation between representing a language as a Q-Item and as a UserLanguageCode.

A previous proposal treated lemmas and representations as raw strings, with the language pointing to the Q-Item being the only language information. This now is gone, and the lemma and representation carry their own language information.

How do they interact? The language set referencable through Q-Items is much larger than the set of languages with a UserLanguageCode, and indeed, the intention was to allow for every language to be representable in Wikidata, not only those with a UserLanguageCode.

I sense quite a problem here.

I see two possible ways to resolve this:

return to the original model and use strings instead of Multilingual

texts (with all the negative implications for variants)

use Q-Items instead of UserLanguageCodes for Multilingual texts (which

would be quite a migration)

I don't think restricting Wiktionary4Wikidata support to the list of languages with a UserLanguageCode is a viable solution, which would happen if we implement the data model as currently suggested, if I understand it correctly.

Cheers, Denny

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Etiamsi omnes, ego non

2782

Age (days ago)

2788

Last active (days ago)

wikidata@lists.wikimedia.org

22 comments

9 participants

tags (0)

participants (9)

Daniel Kinzler
David Cuenca Tudela
Denny Vrandečić
Gerard Meijssen
Info WorldUniversity
Scott MacLeod
Stas Malyshev
Thomas PT
Tobias Schönberg