ISO-639 + Glossaries / vocabulary lists / thematical lists

List overview All Threads
Download

newer

older

Wikimedia Fundraising drive 2004

The ultimate Wiktionary RFC

Sabine Cretella

4 Sep 2004 4 Sep '04

5:19 p.m.

Hi Gerard and all of you,

thinking about the code I was just considering some points.

What I noted on the page you gave me for the Italian version of the ISO-code is that you use a mixed version for language identifiers - the two letter code and where there's no two letter code the three letter code - is this correct? I also noted that not all languages are present in the ISO-3-letter-code - so they are standardised, but not completely. This would obviously lead to an own wiktionary standard.

I am asking as I thought about compiling a list of the used language codes for wiktionary and then add the several translations of the languages names asking freinds and colleagues to complete the list. Normally in the translation world the two letter code is used.

I'll then add the list to my sourceforge project (wsi-glossary: http://sourceforge.net/projects/wsi-glossary/) you can see who is contributing right now with integrations to the lists here: http://wiki.wesolveitnet.com/wakka.php?wakka=WsiGlossaryContributors.

I should modify licensing (mine up to now was the same as the one used for the OmegaT manual to GNU FDL - I have to check out if this is possible without problems on sourceforge net. I am to new to OpenContent to know all about this - so another thing to be done immediately).

If you are working on a multilanguage list e.g. of trees, birds, vegetables etc. etc. please consider seriously to have these lists integrated by other people as well and have it ready somewhere for download or just integrate it into wsi-glossary. Certain kinds of work can be done even by schools in language lessons - e.g. the Italian Thesaurus for OpenOffice.org was created with the help of a school where the teachers were the team leaders and during the classes the pupils did something that made "sense" to them. Having them work directly in wiktionary online is impossible for most schools as computers don't have Internet access (or only a few of them) and so working on tables is much easier.

If you prefer not to hand out the list: give out single terms or gourps of terms like this:

I need these term(s) house cat mouse etc.

in the following languages: German French Italian etc.

I can then publish these parts or on my portal or send the request to different lists of translators - so step by step it is possible to integrate and improve.

Best wishes from Italy,

Sabine

-- Sabine Cretella s.cretella@wordsandmore.it www.wordsandmore.it Meetingplace for translators www.wesolveitnet.com

Show replies by date

Gerard Meijssen

4 Sep 4 Sep

5:45 p.m.

New subject: ISO-639 + Glossaries / vocabulary lists / thematical lists

Sabine Cretella wrote:

...

Hi Gerard and all of you,

thinking about the code I was just considering some points.

What I noted on the page you gave me for the Italian version of the ISO-code is that you use a mixed version for language identifiers - the two letter code and where there's no two letter code the three letter code - is this correct? I also noted that not all languages are present in the ISO-3-letter-code - so they are standardised, but not completely. This would obviously lead to an own wiktionary standard.

I am asking as I thought about compiling a list of the used language codes for wiktionary and then add the several translations of the languages names asking freinds and colleagues to complete the list. Normally in the translation world the two letter code is used.

I'll then add the list to my sourceforge project (wsi-glossary: http://sourceforge.net/projects/wsi-glossary/) you can see who is contributing right now with integrations to the lists here: http://wiki.wesolveitnet.com/wakka.php?wakka=WsiGlossaryContributors.

I should modify licensing (mine up to now was the same as the one used for the OmegaT manual to GNU FDL - I have to check out if this is possible without problems on sourceforge net. I am to new to OpenContent to know all about this - so another thing to be done immediately).

If you are working on a multilanguage list e.g. of trees, birds, vegetables etc. etc. please consider seriously to have these lists integrated by other people as well and have it ready somewhere for download or just integrate it into wsi-glossary. Certain kinds of work can be done even by schools in language lessons - e.g. the Italian Thesaurus for OpenOffice.org was created with the help of a school where the teachers were the team leaders and during the classes the pupils did something that made "sense" to them. Having them work directly in wiktionary online is impossible for most schools as computers don't have Internet access (or only a few of them) and so working on tables is much easier.

If you prefer not to hand out the list: give out single terms or gourps of terms like this:

I need these term(s) house cat mouse etc.

in the following languages: German French Italian etc.

I can then publish these parts or on my portal or send the request to different lists of translators - so step by step it is possible to integrate and improve.

Best wishes from Italy,

Sabine

Wikimedia does use two letter ISO 639 codes and when they do not exist they do use the three letter codes. There are missing ISO codes. There are also the SIL codes but personally I think mixing these three codes makes a mess. Preferably ISO adds missing codes for languages.

For cooperation to work best, things like XML can be considered. GEMET uses it, they have people knowledgable regarding thesauri XML open content.

When you have an application that can import and export XML data, you can work off line locally and export the data at the end of the day. The start might be the Italian Open Office list and add definitions in Italian export it and share it with the world. An even better start might be words in another wikipedia with an Italian translation; the translations TO Italian are then already known.

The most important thing is to prevent double work and the continued checking of the stuff that is available. Start with producing definitions for all the Languages. Many translations are available on the nl:wiktionary. The articles can be copied to it:wiktionary just add content to some templates. They do need checking as well... :)

Thanks, Gerard

Sabine Cretella

6:16 p.m.

New subject: ISO-639 + Glossaries / vocabulary lists / thematical lists

Hi Gerard,

...

Wikimedia does use two letter ISO 639 codes and when they do not exist they do use the three letter codes. There are missing ISO codes. There are also the SIL codes but personally I think mixing these three codes makes a mess. Preferably ISO adds missing codes for languages.

I completely agree to this.

...

For cooperation to work best, things like XML can be considered. GEMET uses it, they have people knowledgable regarding thesauri XML open content.

XML is one of the best solutions as the standard for CAT (computer aided translation) software is tmx for memories and tbx for glossaries - and this is nothing else than "definite" xml codes.

...

When you have an application that can import and export XML data, you can work off line locally and export the data at the end of the day. The start might be the Italian Open Office list and add definitions in Italian export it and share it with the world. An even better start might be words in another wikipedia with an Italian translation; the translations TO Italian are then already known.

Hmmm ... data needs to be multilanguage, doesn't it? Or at least must be identified by language tags. To edit bilingual data we could use OmegaT (that adds language tags to the tmx 1.1 file) - then we could use the tmx-file to import data. When new terms are added to a list using the old tmx file they will be automatically given as translated so the translator just needs to translate the missing part that is stored in a new translation memory file in tmx format. OmegaT is java based, therefore platform independent and Open Source ... the created files of single "words" on the other hand could then be used as glossary files. Maybe we could try this out with the translations of the ISO languages-table.

...

The most important thing is to prevent double work and the continued checking of the stuff that is available. Start with producing definitions for all the Languages. Many translations are available on the nl:wiktionary. The articles can be copied to it:wiktionary just add content to some templates. They do need checking as well... :)

I'll do that right now - I just created a table with the codes and now insert the missing English names.

Ciao, Sabine

...

Andrew Dunbar

5 Sep 5 Sep

10 a.m.

New subject: ISO-639 + Glossaries / vocabulary lists / thematical lists

--- Gerard Meijssen gerardm@myrealbox.com wrote:

...

Sabine Cretella wrote:

...
Hi Gerard and all of you,

thinking about the code I was just considering some points.

What I noted on the page you gave me for the Italian version of the ISO-code is that you use a mixed version for language identifiers - the two letter code and where there's no two letter code the three letter code - is this correct? I also noted that not all languages are present in the ISO-3-letter-code - so they are standardised, but not completely. This would obviously lead to an

own

...

...
wiktionary standard.

Wikimedia does use two letter ISO 639 codes and when they do not exist they do use the three letter codes. There are missing ISO codes. There are also the SIL codes but personally I think mixing these three codes makes a mess. Preferably ISO adds missing codes for languages.

There are omissions, mergers, splits, and other differences between ISO and SIL. ISO is more likely to include artificial languages. SIL is more likely to include very rare and obscure human languages. Neither includes Klingon yet. ISO tends to merge many languages together - every Austrlalian aboriginal language is squashed together into a single code. There is much disagreement between what is a language and what is a dialect. ISO usually takes a more political definition. SIL takes a more linguistic definition. ISO includes Norwegian Bokmaal, Norwegian Nynorsk, and just plain old Norwegian! Niether takes script differences into account. Serbian has one code whether it is in Cyrillic or Latin. Punjabi has one code whether it is in Gurmukhi, Shahmukhi, or Devanagari.

I think a very flexible approach would be: 2-letter ISO if it exists. "en" then 3-letter ISO if it exists. "haw" then SIL with a prefix if it exists. "sil-PJT" or "sil:PJT" then Make up something temporary if we have to. "Klingon"

I'm not sure what to do about different scripts.

Apologies about not staying 100% on-topic.

Andrew (hippietrail)

===== http://linguaphile.sf.net/cgi-bin/translator.pl http://www.abisource.com

___________________________________________________________ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com

Muke Tever

1:15 p.m.

New subject: ISO-639 + Glossaries / vocabulary lists / thematical lists

On Sun, 5 Sep 2004 03:00:00 +0100 (BST), Andrew Dunbar hippietrail@yahoo.com wrote:

...

...
Wikimedia does use two letter ISO 639 codes and when they do not exist they do use the three letter codes. There are missing ISO codes. There are also the SIL codes but personally I think mixing these three codes makes a mess. Preferably ISO adds missing codes for languages.

There are omissions, mergers, splits, and other differences between ISO and SIL. ISO is more likely to include artificial languages. SIL is more likely to include very rare and obscure human languages. Neither includes Klingon yet.

[snip other stuff, with which I agree]

Klingon is tlh in ISO 639. http://www.loc.gov/standards/iso639-2/langcodes.html

Constructed and ancient languages are out of scope for the Ethnologue but there is an effort to extend the Ethnologue list and produce standardized codes for them ("LINGUIST codes"); in that list Klingon is apparently CKLN (though at least one page refers to it as CKLI). http://www.language-archives.org/wg/language-codes/linguist-20020219.html http://cf.linguistlist.org/cfdocs/new-website/LL-WorkingDirs/forms/langs/Get... http://cf.linguistlist.org/cfdocs/new-website/LL-WorkingDirs/forms/langs/Get...

*Muke!

-- website: http://frath.net/ LiveJournal: http://kohath.livejournal.com/ deviantArt: http://kohath.deviantart.com/ FrathWiki, a conlang and conculture wiki: http://wiki.frath.net/

Sabine Cretella

2:30 p.m.

New subject: language codes

Hmmm ... so this issue is getting more and more complicated ...I really thought quite a long tima about how to go ahead.

We first of all need the codes for known languages - and the better they are known, the better it is. Therefore the first option is for ISO-639 two letter.

Let's see: 1) ISO 639 two letter 2) ISO 639 three letter 3) SIL (adding the prefix "sil-") 4) linguist.org codes for ancient languages (adding prefix "lnga-") 5) linguist.org codes for constructed languages (adding prefix "lngc-") 6) wiktionary internal code (adding prefix "wkt-")

Could this be a feasible solution? For now I created a list made up oft ISO 639 two + three letter codes (only in English yesterday).

Now I'd like to proceed with adding translations into my languages (German/Italian) and maybe also some others.

In a second stage I'd then opt to inserting SIL-codes where possible (and step by step also for the other codes).

Going back tow work...

Ciao, Sabine

*****

Sabine Cretella s.cretella@wordsandmore.it www.wordsandmore.it Meetingplace for translators www.wesolveitnet.com

Muke Tever

18 Sep 18 Sep

11:32 p.m.

New subject: language codes

On Sun, 05 Sep 2004 08:30:44 +0200, Sabine Cretella sabine_cretella@yahoo.it wrote:

...

We first of all need the codes for known languages - and the better they are known, the better it is. Therefore the first option is for ISO-639 two letter.

Let's see:

ISO 639 two letter

ISO 639 three letter

SIL (adding the prefix "sil-")

linguist.org codes for ancient languages (adding prefix "lnga-")

linguist.org codes for constructed languages (adding prefix "lngc-")

wiktionary internal code (adding prefix "wkt-")

Could this be a feasible solution? For now I created a list made up oft ISO 639 two + three letter codes (only in English yesterday).

Actually we may not need to do this. It seems that the SIL codes and the linguist.org codes are being incorporated wholesale into ISO 639-3 ("Alpha-3 code for comprehensive coverage of languages").

*Muke!

-- website: http://frath.net/ LiveJournal: http://kohath.livejournal.com/ deviantArt: http://kohath.deviantart.com/ FrathWiki, a conlang and conculture wiki: http://wiki.frath.net/

7255

Age (days ago)

7269

Last active (days ago)

wiktionary-l@lists.wikimedia.org

6 comments

4 participants

tags (0)

participants (4)

Andrew Dunbar
Gerard Meijssen
Muke Tever
Sabine Cretella