Case sensitivity on Afrikaans Wiktionary - Wikitech-l

Ashar Voultoiz

15 Jul 15 Jul

2:45 a.m.

Ray Saintonge wrote:

A request by nl:Gebruiker:Jcwf <http://nl.wiktionary.org/wiki/Gebruiker:Jcwf> has been posted on the en:Beer parlour to have first letter case sensitivity on the inactive Afrikaans Wiktionary. Ec

Please give some references on af.wikipedia.org with users supporting this change. Thanks -- Ashar Voultoiz - WP++++ http://en.wikipedia.org/wiki/User:Hashar http://www.livejournal.com/community/wikitech/ IM: hashar(a)jabber.org ICQ: 15325080

Reply

Ævar Arnfjörð Bjarmason

7:51 p.m.

On 7/15/05, Timwi <timwi(a)gmx.net> wrote:

...

If that Wiktionary is inactive

It has approximately 20,000 articles.

Reply

Ævar Arnfjörð Bjarmason

8:25 p.m.

On 7/15/05, Angela <beesley(a)gmail.com> wrote:

...

Dutch has almost 20,000, but Afrikaans, which is the one requesting case sensitivity changes has just 19

Righto, sorry about that.

Reply

Mark Williamson

11:06 p.m.

Because they speak Afrikaans, and would be able to give the view of a native speaker. Mark On 15/07/05, Timwi <timwi(a)gmx.net> wrote:

...

Angela wrote:

Ashar's request was to ask af.wikipedia, not Wiktionary about this, and they have 4000 articles and a reasonably active, though small, community.

If none of them participates in Wiktionary, why does their opinion count any more than anyone else's? _______________________________________________ Wikitech-l mailing list Wikitech-l(a)wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

-- SI HOC LEGERE SCIS NIMIVM ERVDITIONIS HABES QVANTVM MATERIAE MATERIETVR MARMOTA MONAX SI MARMOTA MONAX MATERIAM POSSIT MATERIARI ESTNE VOLVMEN IN TOGA AN SOLVM TIBI LIBET ME VIDERE

Reply

Ray Saintonge

16 Jul 16 Jul

1:19 p.m.

Gerard Meijssen wrote:

...

As to capitalisation; any paper dictionary does not capitalise the words that are in there unless they are capitalised as a rule. It is due to some unfortunate history that it took so long to change the English wiktionary. There are currently 19 articles in the African wiktionary, Jcwf may want to use the same system of templates that are used in many of the other wiktionaries. To do this it helps to have capitalisation turned off. I second his request to turn capitalisation on the af.wiktionary off.

Personally, I was just passing on a request that appeared on the en:wiktionary Beer parlour. Naturally, I support it just as I supported the change in English. A more important and more general question might be what should the default situation be when a new Wiktionary is started. This will even be important for Wiktionaries is scripts where capitalization is unknown, since they too will include foreign words. Ec

Reply

Timwi

20 Jul 20 Jul

7:41 p.m.

Gerard Meijssen wrote:

...

I would welcome your comments about the ERD that I posted here http://commons.wikimedia.org/wiki/Image:ERD.jpg

Looks interesting, but is extremely bare. It would do well with a bit of documentation. For much of it, the purpose isn't entirely clear. I'm particularly confused as to why "Language", "Word" and "Meaning" are each duplicated.

Reply

Mark Williamson

10:43 p.m.

Hi Gerard, Signed languages are completely independent systems, and are separate languages from spoken languages, different in grammar, syntax, and the like. Unfortunately, there is no universally-agreed-upon method for transcribing signed languages. There are a few possibilities here. 1) Choose a particular transcription system. Top of the list would be Stokoe, and Sutton Sign Writing; HamNoSys is also possible but is mostly used by linguists while the other two are used more widely by people who use it as their everyday language. 2) Use multimedia. We can upload videos of people signing a particular word. Note that some signed languages also have conjugations and inflections. However, this will leave a problem of headwords -- how do you look up a word in a signed language? Which leads to the third option, 3) Introduce our own notation system. This is impractical and unlikely to work well. I suggest that instead, we adopt HamNoSys for lookup purposes, although it is not represented by Unicode, we can try an ASCII implementation. Regarding "dialects" of Chinese and Arabic, that is very simple. Treat them as separate languages. While certainly most often people write in "Standard Arabic" or "Standard Chinese", it is also possible to write in the local vernacular. This tends to be done more with Arabic, but is possible with either. With Chinese, you only see it very often with Cantonese, other varieties are occasionally but you are more likely to find a Bible translation in them than a newspaper. I hope very much that you will not restrict languages to those which appear on the ISO 639-3 list. It has many shortcomings and is very, very, very disappointing -- it would not allow for separate entries for Yavapai, Hualapai, and Havasupai (it has only one code for them all), even though they are very much different languages, and it by no means includes all the languages of the world. It also separates between Moroccan, Tunisian, and Algerian Arabic, when they're really nearly identical. Mark On 20/07/05, Gerard Meijssen <gerard.meijssen(a)gmail.com> wrote:

...

Timwi wrote:

Gerard Meijssen wrote:

I would welcome your comments about the ERD that I posted here http://commons.wikimedia.org/wiki/Image:ERD.jpg

Looks interesting, but is extremely bare. It would do well with a bit of documentation. For much of it, the purpose isn't entirely clear. I'm particularly confused as to why "Language", "Word" and "Meaning" are each duplicated.

Hoi, There is some documentation here: http://meta.wikimedia.org/wiki/Ultimate_Wiktionary_data_design and http://meta.wikimedia.org/wiki/Ultimate_Wiktionary_decisions_on_its_usage here. The duplication reflects that there is at least one table that has two relations to the same table. Language refers to itself for dialects, Word refers through Conju/Decli (conjucation or declinations) to a headword and derived words, Meaning is related through "Relations" this is to allow for thesaurus like structures. One reason why it is not as much documented as I would like is, because I am still working on the structure. At this moment I am thinking hard on how to include signed languages and the spoken dialects of the Chinese and Arabic written language. Thanks, GerardM _______________________________________________ Wikitech-l mailing list Wikitech-l(a)wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

-- SI HOC LEGERE SCIS NIMIVM ERVDITIONIS HABES QVANTVM MATERIAE MATERIETVR MARMOTA MONAX SI MARMOTA MONAX MATERIAM POSSIT MATERIARI ESTNE VOLVMEN IN TOGA AN SOLVM TIBI LIBET ME VIDERE

Reply

Andrew Dunbar

11:41 a.m.

Hi guys. I'd just like to insert a few comments... On 7/21/05, Gerard Meijssen <gerard.meijssen(a)gmail.com> wrote:

...

Mark Williamson wrote:

Hi Gerard,

When you look at the ERD it is explicitly indicated by the Conju/Decli-Word table what the Headword is. This is done for any

Why not call it the Inflection table since conjugation and declension are just types of inflection. It's a lot easier to pronounce.

...

language and it means that it is by practice and not design that in most languages the infinitive will be the Headword. Consequently, when one word is found the other will be also shown in a format that may need some screen design per type of conjugation or declination.

Many languages do not have a concept of "infinitive". Many languages use some other form as the headword in dictionaries. Such forms are known as citation forms. The other extremely common form used by many languages as the verbal citation form is the 3rd person singular present indicative. If verbs have gender then it is the masculine form. I am not aware of another verb form used as the citation form but there's a good chance there are others in exotic languages. Also it's worth noting that a few languages have more than one infinitive form. Hippietrail

Reply

Gerard Meijssen

12:25 p.m.

Nikola Smolenski wrote:

...

On Wednesday 20 July 2005 21:29, Gerard Meijssen wrote:

Timwi wrote: >Gerard Meijssen wrote: > > >>I would welcome your comments about the ERD that I posted here >>http://commons.wikimedia.org/wiki/Image:ERD.jpg >> >>

I haven't seen it before, and I was thinking about it on my own, so I'd like to comment on it :)

Looks interesting, but is extremely bare. It would do well with a bit of documentation. For much of it, the purpose isn't entirely clear. I'm particularly confused as to why "Language", "Word" and "Meaning" are each duplicated.

Hoi, There is some documentation here: http://meta.wikimedia.org/wiki/Ultimate_Wiktionary_data_design and http://meta.wikimedia.org/wiki/Ultimate_Wiktionary_decisions_on_its_usage here. The duplication reflects that there is at least one table that has two relations to the same table. Language refers to itself for dialects, Word refers through Conju/Decli (conjucation or declinations) to a headword and derived words, Meaning is related through "Relations" this is to allow for thesaurus like structures.

First, I'd suggest a cosmetic change, to use "script" instead of "charset" in table and column names. When I first saw it, I thought that you are referring to computer charsets (ISO 8859-1, Windows-1251...) and was wondering why wouldn't you simply use UTF-8 :)

Done

...

Related to this, I'd suggest to add "ISO15924" column to "characterset" (or future "script") table. This way a script can be formally specified and looked up, regardless of its name.

Done, however the name of the script is a record in the database in its own right so it may have as many translations as we care to enter. The code is just to anchor it. I understand from Erik's notes that a language can be indicated as the default value.. The default value will be English. So, add a translation to the English word and from then on the User Interface will show it localised.

...

Why is there column "gender" in table "word"? If a word can exist in multiple genders, shouldn't that rather be represented in "inflection" table? If a word has a gender on its own, wouldn't that rather be represented in WordType table? If not, there are other properties of words (for example, number) which can also be represented in "word" table, why is gender singled out?

When a word is inflected to a particular form, that word is a word in its own right and consequently will be found in the UW. The inflection is there because it does provide information and this information is relevant for the inflections and the headword.. A Wordtype indicates a noun a verb an adjective etc.

...

I'd strongly suggest to add a column "inflection", to either "wordtype" or "word" table; this would specify which inflection does a word use, and whether it is regular or irregular. If it is known which inflexion a word uses and if it is regular, then all its inflected forms could be generated automatically. I see that there is column "languageid" in "meaningtext" table. If I understand this, it means that meaning of a word could be written down in various languages, and I second this. But I wonder how are you going to do the same for other data which might need to be translated (for example, column "characterset" of of "characterset" table - I understand that this is pure text? Were you thinking about this?

The fields "Sign" :) "Gender" "WordType" all relate to meaning; Ultimate Wiktionary will eat its own dogfood or when a translation to a word like noun is added like I did for Afrikaans recently, this translation is the one that will be used in the User Interface

...

Were you thinking about a way to register examples of use, similar to meaning? Or would examples of use be simply a raw text in "meaningtext" table?

Idiom, proverbs will be a "wordtype" as much as noun is. Therefore a proverb will relate to a keyword through WordRelation. I have updated the table Relation with a newfield "SameLanguageOnly" this ensures that the relation is applicable within the same language so the relation would be "proverb" and it would combine "apple" with "an apple a day keeps the doctor away". MeaningText would be just the definition of a meaning in a given language.

...

One reason why it is not as much documented as I would like is, because I am still working on the structure. At this moment I am thinking hard on how to include signed languages and the spoken dialects of the Chinese and Arabic written language.

The last one is easy. Add ISO3166_1 and ISO3166_2 columns to language table. That way, you could formally represent a dialect within a certain country (as for Arabic) or even a region (as for Chinese). Perhaps even better solution would be to include a RegionID column which would point to table "regions" (relation 1-many), which would have RegionID, ISO3166_1 and ISO3166_2 columns; that way you could specify wider regions in which a dialect is spoken, even if they go over country boundaries.

The country code is irrelevant as far as this database is concerned. This database is about words in languages and dialects.

...

As a sidenote, I'd also suggest to rename ISO639-2 and ISO639-3 columns to ISO639_2 and ISO639_3, respectively. ISO639-2 might be interpreted as ISO639 minus 2 which can lead to all sorts of confusions.

Done

...

Were you thinking about a way to formally define a dialect? Ideally, beside a region (by the way, ISO3166-2 is not granular enough, and there should be a better way of expressing region in which a dialect is spoken, perhaps even to the level of a village), there should be specified a time period in which a dialect was spoken, social layer which was speaking it, and perhaps even a particular person or entity using it. OK, I got carried away a bit but such things might be important :) Though some of this should perhaps be (also) tied to a word and not a dialect.

Ultimate wiktionary is about words (written, spoken or signed) that is the starting point. There will be a need for some formality; once a dialect is recognised, it will be hard to take it away. Therefore in my opinion it will be after some discussion. They will be added as such by an admin. I would think that we need to consider what it takes before we add a dialect. Tentatively I would go for at least 100 words defined as such. With a dialect I would assume that words that are not defined are those of the higher level language. Things like where it is spoken and time periods at all sounds to me like etymological content. And that is where I would have it. Thanks, GerardM

Reply

Nikola Smolenski

7:23 a.m.

On Friday 22 July 2005 13:25, Gerard Meijssen wrote:

...

Nikola Smolenski wrote:

On Wednesday 20 July 2005 21:29, Gerard Meijssen wrote: Related to this, I'd suggest to add "ISO15924" column to "characterset" (or future "script") table. This way a script can be formally specified and looked up, regardless of its name.

Done, however the name of the script is a record in the database in its own right so it may have as many translations as we care to enter. The code is just to anchor it. I understand from Erik's notes that a language can be indicated as the default value.. The default value will be English. So, add a translation to the English word and from then on the User Interface will show it localised.

I'll talk about this below.

...

Why is there column "gender" in table "word"? If a word can exist in multiple genders, shouldn't that rather be represented in "inflection" table? If a word has a gender on its own, wouldn't that rather be represented in WordType table? If not, there are other properties of words (for example, number) which can also be represented in "word" table, why is gender singled out?

When a word is inflected to a particular form, that word is a word in its own right and consequently will be found in the UW. The inflection is there because it does provide information and this information is

Now I'm not so sure that I understand which table is for what. Could you give an example? For example, the word "white" is a base word and the word "whiter" is its inflection. How would these two words fit into the database?

...

relevant for the inflections and the headword.. A Wordtype indicates a noun a verb an adjective etc.

I still don't understand why is gender singled out of all properties a word could have. For example, a verb could be transitive or intransitive, and this information is important. To give an example: word: horse gender: male partofspeech: noun word: ship gender: female partofspeech: noun word: to drive gender: none partofspeech: transitive verb word: to swim gender: none partofspeech: intransitive verb See what I mean? If you have to specify transitivity of a verb in "partofspeech" table, you may as well specify gender of a noun in that table. It would be consistent either to remove "gender" column from "word" table: word: horse partofspeech: male noun word: ship partofspeech: female noun word: to drive partofspeech: transitive verb word: to swim partofspeech: intransitive verb or to rename it to, for example, "subtype": word: horse subtype: male partofspeech: noun word: ship subtype: female partofspeech: noun word: to drive subtype: transitive partofspeech: verb word: to swim subtype: intransitive partofspeech: verb If you are going to change this, I'd suggest the first solution. Firstly, because there may be words which would have more than one subtype; secondly, because it eliminates the possibility of having invalid mix of subtypes (horse: intransitive noun...).

...

I'd strongly suggest to add a column "inflection", to either "wordtype" or "word" table; this would specify which inflection does a word use, and whether it is regular or irregular. If it is known which inflexion a word uses and if it is regular, then all its inflected forms could be generated automatically. I see that there is column "languageid" in "meaningtext" table. If I understand this, it means that meaning of a word could be written down in various languages, and I second this. But I wonder how are you going to do the same for other data which might need to be translated (for example, column "characterset" of of "characterset" table - I understand that this is pure text? Were you thinking about this?

The fields "Sign" :) "Gender" "WordType" all relate to meaning; Ultimate Wiktionary will eat its own dogfood or when a translation to a word like noun is added like I did for Afrikaans recently, this translation is the one that will be used in the User Interface

OK, but what if you have a longer phrase as a table field? For example, an "inflection" in table "inflection" might be "male genitive superlative" or "3rd person plural female past". I don't think it makes sense to add such phrases to the dictionary as proper entries, only so that the dictionary would have translations of them. Are some table fields inherently translatable? Is this what you had in mind above?

...

Were you thinking about a way to register examples of use, similar to meaning? Or would examples of use be simply a raw text in "meaningtext" table?

Idiom, proverbs will be a "wordtype" as much as noun is. Therefore a proverb will relate to a keyword through WordRelation. I have updated the table Relation with a newfield "SameLanguageOnly" this ensures that the relation is applicable within the same language so the relation would be "proverb" and it would combine "apple" with "an apple a day keeps the doctor away". MeaningText would be just the definition of a meaning in a given language.

I was thinking about something else; for example, on http://en.wiktionary.org/wiki/account there is this example: "A beggarly account of empty boxes. - Shakespeare, Romeo and Juliet, V-i"; but I understand now it is going to be just a part of "meaningtext". I'm not so certain, but maybe it would be good to create a separate table for examples, because same examples could (and probably will) be used in "meaningtext"s in different languages. It would also make it easier to automatically add new examples (for example, by grepping Project Gutenberg ;)

...

One reason why it is not as much documented as I would like is, because I am still working on the structure. At this moment I am thinking hard on how to include signed languages and the spoken dialects of the Chinese and Arabic written language.

The last one is easy. Add ISO3166_1 and ISO3166_2 columns to language table. That way, you could formally represent a dialect within a certain country (as for Arabic) or even a region (as for Chinese). Perhaps even better solution would be to include a RegionID column which would point to table "regions" (relation 1-many), which would have RegionID, ISO3166_1 and ISO3166_2 columns; that way you could specify wider regions in which a dialect is spoken, even if they go over country boundaries.

The country code is irrelevant as far as this database is concerned. This database is about words in languages and dialects.

I still think that this would be an useful way of formally specifying a dialect. For example, British English would have ISO639_2 code "en" and ISO3166_1 code "uk" while Australian English would have ISO639_2 code "en" but ISO3166_1 code "au".

...

Were you thinking about a way to formally define a dialect? Ideally, beside a region (by the way, ISO3166-2 is not granular enough, and there should be a better way of expressing region in which a dialect is spoken, perhaps even to the level of a village), there should be specified a time period in which a dialect was spoken, social layer which was speaking it, and perhaps even a particular person or entity using it. OK, I got carried away a bit but such things might be important :) Though some of this should perhaps be (also) tied to a word and not a dialect.

Ultimate wiktionary is about words (written, spoken or signed) that is the starting point. There will be a need for some formality; once a dialect is recognised, it will be hard to take it away. Therefore in my opinion it will be after some discussion. They will be added as such by an admin. I would think that we need to consider what it takes before we add a dialect. Tentatively I would go for at least 100 words defined as such. With a dialect I would assume that words that are not defined are those of the higher level language.

When I was referring to "dialect", I did not have in mind a dialect that is officially recognised, but simply a set of words which could be identified as belonging to a certain group. So if you want to say that this word was part of London dockworkers' slang in 1800s, you should be able to do so, and not just stamp it with "British English". As a simple example, in Serbia, there are several publishing houses that were publishing Asterix, and in some translations "Idefix" is named "Garoviks" and "Panoramix" is named "Aspiriniks" while in others "Idefix" is named "Idefiks" and "Panoramix" is named "Panoramiks"; and this is consistent. If you are going to translate something about Asterix to Serbian, you should pick one of the translations, but you should be consistent in using only the words from the translation which you have picked, and they should somehow be marked as belonging to the same translation. There surely are more important things than Asterix where similar might apply.

Reply

Gerard Meijssen

9:13 a.m.

Nikola Smolenski wrote:

...

On Friday 22 July 2005 13:25, Gerard Meijssen wrote:

Nikola Smolenski wrote:

On Wednesday 20 July 2005 21:29, Gerard Meijssen wrote: Related to this, I'd suggest to add "ISO15924" column to "characterset" (or future "script") table. This way a script can be formally specified and looked up, regardless of its name.

Done, however the name of the script is a record in the database in its own right so it may have as many translations as we care to enter. The code is just to anchor it. I understand from Erik's notes that a language can be indicated as the default value.. The default value will be English. So, add a translation to the English word and from then on the User Interface will show it localised.

I'll talk about this below.

Why is there column "gender" in table "word"? If a word can exist in multiple genders, shouldn't that rather be represented in "inflection" table? If a word has a gender on its own, wouldn't that rather be represented in WordType table? If not, there are other properties of words (for example, number) which can also be represented in "word" table, why is gender singled out?

When a word is inflected to a particular form, that word is a word in its own right and consequently will be found in the UW. The inflection is there because it does provide information and this information is

Now I'm not so sure that I understand which table is for what. Could you give an example? For example, the word "white" is a base word and the word "whiter" is its inflection. How would these two words fit into the database?

Both words will exist as a Spelling, as a Word and they may share a Meaning. When the inflections are added, in the Inflection-Word, all the missing words will be created and they will all be related to each other through this table. Contrary to a paper dictionary we want them all.

...

relevant for the inflections and the headword.. A Wordtype indicates a noun a verb an adjective etc.

I still don't understand why is gender singled out of all properties a word could have. For example, a verb could be transitive or intransitive, and this information is important. To give an example: word: horse gender: male partofspeech: noun word: ship gender: female partofspeech: noun word: to drive gender: none partofspeech: transitive verb word: to swim gender: none partofspeech: intransitive verb See what I mean? If you have to specify transitivity of a verb in "partofspeech" table, you may as well specify gender of a noun in that table. It would be consistent either to remove "gender" column from "word" table: word: horse partofspeech: male noun word: ship partofspeech: female noun word: to drive partofspeech: transitive verb word: to swim partofspeech: intransitive verb or to rename it to, for example, "subtype": word: horse subtype: male partofspeech: noun word: ship subtype: female partofspeech: noun word: to drive subtype: transitive partofspeech: verb word: to swim subtype: intransitive partofspeech: verb If you are going to change this, I'd suggest the first solution. Firstly, because there may be words which would have more than one subtype; secondly, because it eliminates the possibility of having invalid mix of subtypes (horse: intransitive noun...).

At this moment in time I would not have intransitive verbs or transitive verbs at all. To me they are verbs. When they are transitive, they have a different meaning from when they are intransitive so to me the destinction is in the meaning.

...

I'd strongly suggest to add a column "inflection", to either "wordtype" or "word" table; this would specify which inflection does a word use, and whether it is regular or irregular. If it is known which inflexion a word uses and if it is regular, then all its inflected forms could be generated automatically. I see that there is column "languageid" in "meaningtext" table. If I understand this, it means that meaning of a word could be written down in various languages, and I second this. But I wonder how are you going to do the same for other data which might need to be translated (for example, column "characterset" of of "characterset" table - I understand that this is pure text? Were you thinking about this?

The fields "Sign" :) "Gender" "WordType" all relate to meaning; Ultimate Wiktionary will eat its own dogfood or when a translation to a word like noun is added like I did for Afrikaans recently, this translation is the one that will be used in the User Interface

OK, but what if you have a longer phrase as a table field? For example, an "inflection" in table "inflection" might be "male genitive superlative" or "3rd person plural female past". I don't think it makes sense to add such phrases to the dictionary as proper entries, only so that the dictionary would have translations of them. Are some table fields inherently translatable? Is this what you had in mind above?

Most if not all text fields will be inherently translatable, this is what I have very much in mind. The name of a font will not be translated but that is the only one at this point in time. It makes perfect sense to have this in the UW as it allows us to have a self learning User Interface. The thing is; it has function.

...

Were you thinking about a way to register examples of use, similar to meaning? Or would examples of use be simply a raw text in "meaningtext" table?

Idiom, proverbs will be a "wordtype" as much as noun is. Therefore a proverb will relate to a keyword through WordRelation. I have updated the table Relation with a newfield "SameLanguageOnly" this ensures that the relation is applicable within the same language so the relation would be "proverb" and it would combine "apple" with "an apple a day keeps the doctor away". MeaningText would be just the definition of a meaning in a given language.

I was thinking about something else; for example, on http://en.wiktionary.org/wiki/account there is this example: "A beggarly account of empty boxes. - Shakespeare, Romeo and Juliet, V-i"; but I understand now it is going to be just a part of "meaningtext". I'm not so certain, but maybe it would be good to create a separate table for examples, because same examples could (and probably will) be used in "meaningtext"s in different languages. It would also make it easier to automatically add new examples (for example, by grepping Project Gutenberg ;)

"A beggarly account of empty boxes" is a quote and why not have it as a seperate Word and marked as such ?? It would be a idiom for "account" and this is linked through Relation. Many famous quotes have been translated and we could have them all. (Een paard , een paard, een koninkrijk voor een paard) I do not think grepping Project Gutenberg makes much sense. If anything it helps you find occurances of the word but you have to be selective of what to include. That is an editorial process and just the fact that a word is used does not make for a good idiom in the UW.

...

One reason why it is not as much documented as I would like is, because I am still working on the structure. At this moment I am thinking hard on how to include signed languages and the spoken dialects of the Chinese and Arabic written language.

The last one is easy. Add ISO3166_1 and ISO3166_2 columns to language table. That way, you could formally represent a dialect within a certain country (as for Arabic) or even a region (as for Chinese). Perhaps even better solution would be to include a RegionID column which would point to table "regions" (relation 1-many), which would have RegionID, ISO3166_1 and ISO3166_2 columns; that way you could specify wider regions in which a dialect is spoken, even if they go over country boundaries.

The country code is irrelevant as far as this database is concerned. This database is about words in languages and dialects.

I still think that this would be an useful way of formally specifying a dialect. For example, British English would have ISO639_2 code "en" and ISO3166_1 code "uk" while Australian English would have ISO639_2 code "en" but ISO3166_1 code "au".

Even the ISO-639 codes in the table are there to connect what we are doing in the Wikipedias and other projects. As it is a standard I added it but in the database the ISO 639 fields are not compulsory, the "WMF key" is. If we "need" these ISO639_2 codes, then we would adhere to the principle that a language is a dialect with an army. Have a look at http://nl.wiktionary.org/wiki/WikiWoordenboek:Lijst_van_messages#Schrijfwij… and you will see how we do some of the uk and au stuff for you. This is however not a great example because it is a mix of different spelling but also vocabulary and scripts. As I was not content with this I came up with the current ERD.

...

Were you thinking about a way to formally define a dialect? Ideally, beside a region (by the way, ISO3166-2 is not granular enough, and there should be a better way of expressing region in which a dialect is spoken, perhaps even to the level of a village), there should be specified a time period in which a dialect was spoken, social layer which was speaking it, and perhaps even a particular person or entity using it. OK, I got carried away a bit but such things might be important :) Though some of this should perhaps be (also) tied to a word and not a dialect.

Ultimate wiktionary is about words (written, spoken or signed) that is the starting point. There will be a need for some formality; once a dialect is recognised, it will be hard to take it away. Therefore in my opinion it will be after some discussion. They will be added as such by an admin. I would think that we need to consider what it takes before we add a dialect. Tentatively I would go for at least 100 words defined as such. With a dialect I would assume that words that are not defined are those of the higher level language.

When I was referring to "dialect", I did not have in mind a dialect that is officially recognised, but simply a set of words which could be identified as belonging to a certain group. So if you want to say that this word was part of London dockworkers' slang in 1800s, you should be able to do so, and not just stamp it with "British English".

When there are words that are specific to London dockworkers in the 1800s, I would not call it a dialect because like many professions they have there own vocabulary. These I would mark within a collection as the bulk of what they say would be London English of the 1800s. Now there is one thing that is relevant, the UW wants all words of all languages but its primary purpose it to have the current vocabulary. So yes, these words exist and have their place but when they are not used anymore they should be marked as such.

...

As a simple example, in Serbia, there are several publishing houses that were publishing Asterix, and in some translations "Idefix" is named "Garoviks" and "Panoramix" is named "Aspiriniks" while in others "Idefix" is named "Idefiks" and "Panoramix" is named "Panoramiks"; and this is consistent. If you are going to translate something about Asterix to Serbian, you should pick one of the translations, but you should be consistent in using only the words from the translation which you have picked, and they should somehow be marked as belonging to the same translation. There surely are more important things than Asterix where similar might apply.

Garoviks and Idefiks are for the Serbian language synonyms and as such I do not have to choose because they are both correct. As a matter of interest you could explain things either in the etymology or in the meaning of the word.

Reply

Nikola Smolenski

24 Jul 24 Jul

8:49 a.m.

On Saturday 23 July 2005 11:42, Mark Williamson wrote:

...

I still think that this would be an useful way of formally specifying a dialect. For example, British English would have ISO639_2 code "en" and ISO3166_1 code "uk" while Australian English would have ISO639_2 code "en" but ISO3166_1 code "au".

Even the ISO-639 codes in the table are there to connect what we are doing in the Wikipedias and other projects. As it is a standard I added it but in the database the ISO 639 fields are not compulsory, the "WMF key" is. If we "need" these ISO639_2 codes, then we would adhere to the principle that a language is a dialect with an army. Have a look at http://nl.wiktionary.org/wiki/WikiWoordenboek:Lijst_van_messages#Schrijfw ijzen_binnen_een_taal_2 and you will see how we do some of the uk and au stuff for you. This is however not a great example because it is a mix of different spelling but also vocabulary and scripts. As I was not content with this I came up with the current ERD.

Nikola, if you'll look up information on English dialects, you'll find that the division between British, Australian, American, etc. is all very exaggerated. The British, Australian, and American standard languages are all based on the same dialect.

I knew this (it's been told to me that there is much more difference in pronunciation than there is in different words), however, if there are words specific to one dialect, they should be marked as belonging to it.

...

For example, "to starve" is the same in British, Australian, and American standard English, while in Yorkshire dialect it's "to clem". Similarly, "mouth", which is the same in British, Australian, and American English (as far as the standard languages go), is "flep" in Yorkshire. Incidentally, "flep" also refers to the lips.

I didn't knew that, these are interesting examples :)

...

This could perhaps be compared to Serbo-Croatian: Serbian, Croatian, and Bosnian are all based on Stovakian, and there's not much variance

Stokavian.

...

between them; the real variance in Southwest Slavic is between Stovakian, Cakavian, and Kajkavian. In this example, Australian, British, and American correspond to Serbian, Croatian, and Bosnian, while true dialects such as those of Yorkshire, Northumbria, or Liverpool correspond to Stovakian, Cakavian, and Kajkavian.

Congratulations! :) You just made a huge cultural faux-pas, claiming that Serbo-Croatian is in fact Croatian :) Better comparison would be: Australian/British/American correspond to Serbian/Croatian/Bosnian, while Yorkshirian/Northumbrian/Liverpoolian correspond to Vojvodinian/Slavonian/Herzegovinian (the latter are dialects of Stokavian).

Reply

Gerard Meijssen

10:04 a.m.

Nikola Smolenski wrote:

...

On Saturday 23 July 2005 10:13, Gerard Meijssen wrote:

Nikola Smolenski wrote:

On Friday 22 July 2005 13:25, Gerard Meijssen wrote:

Nikola Smolenski wrote: >Why is there column "gender" in table "word"? If a word can exist in >multiple genders, shouldn't that rather be represented in "inflection" >table? If a word has a gender on its own, wouldn't that rather be >represented in WordType table? If not, there are other properties of >words (for example, number) which can also be represented in "word" >table, why is gender singled out? > > When a word is inflected to a particular form, that word is a word in its own right and consequently will be found in the UW. The inflection is there because it does provide information and this information is

Now I'm not so sure that I understand which table is for what. Could you give an example? For example, the word "white" is a base word and the word "whiter" is its inflection. How would these two words fit into the database?

Both words will exist as a Spelling, as a Word and they may share a Meaning. When the inflections are added, in the Inflection-Word, all the missing words will be created and they will all be related to each other through this table. Contrary to a paper dictionary we want them all.

Then I have misunderstood the database design :( I believed at first that inflections would be stored in "inflection" table. Now when I understand the design better, I don't think that it is a good idea to have separate "word" for each inflection because it brings a lot of unneccesary redudance, and much room for error. For example, it would be possible to mark "whiter" as an adverb and "white" as a verb! And then, imagine the horror which would ensue if someone would use wrong PartOfSpeech for base word and now it has to be changed for 100 inflections... Though this would be a crucial change, please think about it. I think that "word" table should contain only lemmas.

Right, well this is very much a design decision. The inflections will have to be entered by hand. And if some poor sod does enter all these inflections and they are wrong, there will be the need for an other poor sod to remove them.

...

>>relevant for the inflections and the headword.. A Wordtype indicates a >>noun a verb an adjective etc. >> >> >I still don't understand why is gender singled out of all properties a >word could have. For example, a verb could be transitive or intransitive, >and this information is important. To give an example: > > >

[...]

At this moment in time I would not have intransitive verbs or transitive verbs at all. To me they are verbs. When they are transitive, they have a different meaning from when they are intransitive so to me the destinction is in the meaning.

OK, for a better example, why not number? Perhaps transitivity doesn't, but number also affects inflection, much as gender does.

When it comes to meaning, all the inflections can share the same meaning. The number (first, second, third person) will be implied by the Inflection in the table Inflection-Word. (at this moment it still says Conjugation in this table)

...

OK, but what if you have a longer phrase as a table field? For example, an "inflection" in table "inflection" might be "male genitive superlative" or "3rd person plural female past". I don't think it makes sense to add such phrases to the dictionary as proper entries, only so that the dictionary would have translations of them. Are some table fields inherently translatable? Is this what you had in mind above?

Most if not all text fields will be inherently translatable, this is what I have very much in mind. The name of a font will not be translated but that is the only one at this point in time. It makes perfect sense to have this in the UW as it allows us to have a self learning User Interface. The thing is; it has function.

OK, so this solves it :)

I was thinking about something else; for example, on http://en.wiktionary.org/wiki/account there is this example: "A beggarly account of empty boxes. - Shakespeare, Romeo and Juliet, V-i"; but I understand now it is going to be just a part of "meaningtext". I'm not so certain, but maybe it would be good to create a separate table for examples, because same examples could (and probably will) be used in "meaningtext"s in different languages. It would also make it easier to automatically add new examples (for example, by grepping Project Gutenberg ;)

"A beggarly account of empty boxes" is a quote and why not have it as a seperate Word and marked as such ?? It would be a idiom for "account" and this is linked through Relation. Many famous quotes have been translated and we could have them all. (Een paard , een paard, een koninkrijk voor een paard)

Because, ideally each word (in each language) should have an example or two, and so the number of examples would approach the number of words; and, it would become impossible to distinguish between notable quotes (Kingdom for a horse!), which occur frequently, need a description, and need to be canonically translated, and non-notable quotes, which are in the wiktionary only to be used as examples of use for other words, need not have a description, and translators won't encounter them at all.

The idioms, proverbs and quotes will be "Word" records in their own right. So we have to be selective in the idiom that we choose. What is new ?? That is what the editorial process is for. For instance for the Dutch French speaking people the phrase "Papa fume un pipe" is famous and as such it is noteworthy but its significance will bewilder the French.. :)

...

I do not think grepping Project Gutenberg makes much sense. If anything it helps you find occurances of the word but you have to be selective of what to include. That is an editorial process and just the fact that a word is used does not make for a good idiom in the UW.

I think it would make sense for rarer words, which might occur once or a few times in entire Gutenberg's corpus. Of course, at the end a human editor has to decide whether a quote is really relevant. Related to grepping Project Gutenberg, have you considered adding information on word frequency? Only a single new table is needed, "frequency", with fields "spellingID", "corpus" and "frequency"; eventually "corpus" colud become "corpusID".

There is more to frequency than that. If anything grep may find it but you still need to know the meaning of the word in that text. When the word gets a new meaning, that is what you want to know .. I will speak to the people of Rotterdam CS (developers of Lucene) about just these kind of issues.

...

Once the UW is up and running, how hard would it be to make such changes?

This is the time when it is easy to make fundamental changes to the design of the UW, it is still also the time to come with an alternative to the design I propose and as you have noticed, I do change things when there is a good argument to do so. When the UW is life, changing the software will be more difficult.

...

When I was referring to "dialect", I did not have in mind a dialect that is officially recognised, but simply a set of words which could be identified as belonging to a certain group. So if you want to say that this word was part of London dockworkers' slang in 1800s, you should be able to do so, and not just stamp it with "British English".

When there are words that are specific to London dockworkers in the 1800s, I would not call it a dialect because like many professions they have there own vocabulary. These I would mark within a collection as the bulk of what they say would be London English of the 1800s. Now there is

I agree, it is not a dialect, but if some words are recognisable as belonging to a distinctive group of words, they should somehow be marked as belonging to it, and I was suggesting that they are marked in a same way they would be marked as belonging to a certain dialect. Another solution would be to use "wordrelation" table instead, even though it isn't meant to be used in that way :)

Collection is the mechanism of choise for this. Relation is to indicate thesaurus like structures including antonymes..

...

one thing that is relevant, the UW wants all words of all languages but its primary purpose it to have the current vocabulary. So yes, these words exist and have their place but when they are not used anymore they should be marked as such.

Well, just replace 1800s with 2000s and you still have the same problem :)

These words are still welcome and the Collection is there for it.

...

As a simple example, in Serbia, there are several publishing houses that were publishing Asterix, and in some translations "Idefix" is named "Garoviks" and "Panoramix" is named "Aspiriniks" while in others "Idefix" is named "Idefiks" and "Panoramix" is named "Panoramiks"; and this is consistent. If you are going to translate something about Asterix to Serbian, you should pick one of the translations, but you should be consistent in using only the words from the translation which you have picked, and they should somehow be marked as belonging to the same translation. There surely are more important things than Asterix where similar might apply.

Garoviks and Idefiks are for the Serbian language synonyms and as such I do not have to choose because they are both correct. As a matter of interest you could explain things either in the etymology or in the meaning of the word.

They are synonyms, but they are stylistically marked: it would be wrong to translate Idefix first as Garoviks and later as Idefiks, or to consistently translate Idefix with Idefiks but Panoramix with Aspiriniks, much as it would be wrong to write "I recognise you recognized me"; a translator has to choose and make the choice consistent.

A translator has to make a consistent choise, Collections of translated names of Asterisk characters can be used for that. We have the technology. :)

...

Unrelated to any of the above, could you move "word" table a bit to the right, because currently it is hard to see what is relation between "word", "spelling" and "etymology" tables, the lines overlap.

I did put the table Word out of whack to show its importance. I put Collection on the same level as Meaning because that one too is very important for several applications. Table is technically challeging and that is why it is also given some prominence Thanks, GerardM

Reply

Nikola Smolenski

27 Jul 27 Jul

7:52 a.m.

On Monday 25 July 2005 11:04, Gerard Meijssen wrote:

...

Nikola Smolenski wrote:

On Saturday 23 July 2005 10:13, Gerard Meijssen wrote:

Nikola Smolenski wrote: >On Friday 22 July 2005 13:25, Gerard Meijssen wrote: >>Nikola Smolenski wrote:

Then I have misunderstood the database design :( I believed at first that inflections would be stored in "inflection" table. Now when I understand the design better, I don't think that it is a good idea to have separate "word" for each inflection because it brings a lot of unneccesary redudance, and much room for error. For example, it would be possible to mark "whiter" as an adverb and "white" as a verb! And then, imagine the horror which would ensue if someone would use wrong PartOfSpeech for base word and now it has to be changed for 100 inflections... Though this would be a crucial change, please think about it. I think that "word" table should contain only lemmas.

Right, well this is very much a design decision. The inflections will have to be entered by hand. And if some poor sod does enter all these inflections and they are wrong, there will be the need for an other poor sod to remove them.

Well, I see it as a bad design decision. First, the inflections don't have to be entered by hand. If a word is not irregular, the inflections could, and should, be entered automatically. Second, I don't understand this boasting of a flaw. If a problem with database structure is noticed, it should be solved. At the very very least it should be concluded that the problem can't be solved. Instead you are telling me that users will have to work around the problem. I knew that already, but do you see a solution?

...

>>relevant for the inflections and the headword.. A Wordtype indicates a >>noun a verb an adjective etc. > >I still don't understand why is gender singled out of all properties a >word could have. For example, a verb could be transitive or > intransitive, and this information is important. To give an example:

[...]

At this moment in time I would not have intransitive verbs or transitive verbs at all. To me they are verbs. When they are transitive, they have a different meaning from when they are intransitive so to me the destinction is in the meaning.

OK, for a better example, why not number? Perhaps transitivity doesn't, but number also affects inflection, much as gender does.

When it comes to meaning, all the inflections can share the same meaning. The number (first, second, third person) will be implied by the Inflection in the table Inflection-Word. (at this moment it still says Conjugation in this table)

By number I meant singular/plural. But regardless, why then gender wouldn't be specified in inflection-word?

...

I was thinking about something else; for example, on http://en.wiktionary.org/wiki/account there is this example: "A beggarly account of empty boxes. - Shakespeare, Romeo and Juliet, V-i"; but I understand now it is going to be just a part of "meaningtext". I'm not so certain, but maybe it would be good to create a separate table for examples, because same examples could (and probably will) be used in "meaningtext"s in different languages. It would also make it easier to automatically add new examples (for example, by grepping Project Gutenberg ;)

"A beggarly account of empty boxes" is a quote and why not have it as a seperate Word and marked as such ?? It would be a idiom for "account" and this is linked through Relation. Many famous quotes have been translated and we could have them all. (Een paard , een paard, een koninkrijk voor een paard)

Because, ideally each word (in each language) should have an example or two, and so the number of examples would approach the number of words; and, it would become impossible to distinguish between notable quotes (Kingdom for a horse!), which occur frequently, need a description, and need to be canonically translated, and non-notable quotes, which are in the wiktionary only to be used as examples of use for other words, need not have a description, and translators won't encounter them at all.

The idioms, proverbs and quotes will be "Word" records in their own right. So we have to be selective in the idiom that we choose. What is new ?? That is what the editorial process is for. For instance for the Dutch French speaking people the phrase "Papa fume un pipe" is famous and as such it is noteworthy but its significance will bewilder the French.. :)

Problem is, for ultimate majority of words we will have to choose a non-notable quote as an example. Maybe we don't understand each other: maybe this isn't the case with other languages, but in a dictionary of Serbian that I have, *EACH* word has at least one, usually two, sometimes even more examples, from common words like "what" to rare and complex words. At least for Serbian and other languages with same lexicographic tradition we will want to do the same in the Wiktionary.

...

I do not think grepping Project Gutenberg makes much sense. If anything it helps you find occurances of the word but you have to be selective of what to include. That is an editorial process and just the fact that a word is used does not make for a good idiom in the UW.

I think it would make sense for rarer words, which might occur once or a few times in entire Gutenberg's corpus. Of course, at the end a human editor has to decide whether a quote is really relevant. Related to grepping Project Gutenberg, have you considered adding information on word frequency? Only a single new table is needed, "frequency", with fields "spellingID", "corpus" and "frequency"; eventually "corpus" colud become "corpusID".

There is more to frequency than that. If anything grep may find it but you still need to know the meaning of the word in that text. When the

This is why "frequency" is related to "spelling" and not to "meaning". Change of meaning is not the only useful thing which could be gathered from a frequency analysis.

...

word gets a new meaning, that is what you want to know .. I will speak to the people of Rotterdam CS (developers of Lucene) about just these kind of issues.

A corpus could (would) be as small as a single text, usually a book. So, you would be able to extract frequency in any desired timespan, or observe how it changes over time.

...

I agree, it is not a dialect, but if some words are recognisable as belonging to a distinctive group of words, they should somehow be marked as belonging to it, and I was suggesting that they are marked in a same way they would be marked as belonging to a certain dialect. Another solution would be to use "wordrelation" table instead, even though it isn't meant to be used in that way :)

Collection is the mechanism of choise for this. Relation is to indicate thesaurus like structures including antonymes..

Wait, "collection" is related to "meaning" and not to "word". I don't see how could it be used for such things. It would be possible to have names of Asterix characters/Disney characters/whatever grouped together, and that is good. But it still isn't possible to distinguish between two groups of translations of names of Asterix characters. It would be possible to have all words related to seamanship grouped together, but it would not be possible to mark which of these are dockworkers' slang, which are sailors' slang, and which are not slang.

Reply

Lars Aronsson

29 Jul 29 Jul

8:40 p.m.

Gerard Meijssen wrote:

...

Because it is important to know for a noun what its gender is.

Sorry if I've not been following UW discussions, but the gender of words can and have changed over time. How will UW address this? Can I, for example, indicate that a certain word had a male gender in the 17th, 18th, 19th century, but neuter in the 20th? And that in 1890-1920 the percentage of people who used male or neuter gradually shifted? In a Wikipedia or Wiktionary article, these complex relations and exceptional cases can be described in plain text, as I did in the previous paragraph. But how do you express them in a relational database schema? And what if you discover such complex relations as the project develops, what is the UW strategy for modifying the schema over time? Right now you seem to be designing an "ultimate" schema that will then be frozen and kept static for all time. The very name of the UW project suggests this kind of thinking, and to me that is about as foreign as marxism. -- Lars Aronsson (lars(a)aronsson.se) Aronsson Datateknik - http://aronsson.se

Reply

Tomer Chachamu

9:42 p.m.

On 27/07/05, Gerard Meijssen <gerard.meijssen(a)gmail.com> wrote:

...

First off all, if creating inflections is done programmatically, it is not part of the database design. The database design says that there will be a record for each inflection. The inflections are translated as every other word is, there is Spelling for it. This means that these words have an importance in their own right and that is more than just the sharing of the meaning with a headword. So I do not share your argument at all. Yes, we can generate inflections but this WILL result in new Spelling - Word - Meaning. And as long as we do not have software to do this for us, we will have to do it by hand.

Perhaps this software can simply be awk, sed or one of those things? And perhaps we can call for people to start writing them? :)

Reply

Tomer Chachamu

11:09 p.m.

On 30/07/05, Lars Aronsson <lars(a)aronsson.se> wrote:

...

Oops, Hungarian has two kinds of plural, let's change the schema. Oops, Thai has genders for verbs, let's change the schema. Oops... The axiom that it "can't be planned for" sounds like a recipy for failure.

Those things will not need a change in the schema.

...

Are there indeed any non-free (commercial or research) projects that have attempted anything like this? Wikipedia has many precursors such as Britannica, Brockhaus, etc. And so does Wiktionary, of course. But which precursors does UW have? What kind of data model or database schema do they apply? Or is UW a piece of original research in computational linguistics?

Something was mentioned on Slashdot which just wanted to be a cross-language dictionary (i.e. not providing definitions as such) but I doubt they got off the ground. UW puts a lot more in the database anyway.

Reply

Lars Aronsson

4:10 p.m.

New subject: UW database stuff (nothing to do with Afrikaans)

Gerard Meijssen wrote:

...

As the Dutch spelling will change in August 2006, we added things like spelling authorities.

Great, then you can introduce a new authority for every century of a language, so the 19th century Swedish can be told apart from 20th century Swedish. But how are spelling authorities different from languages? Couldn't "19th century Swedish" and "Dutch after 2006" be treated as languages of their own?

...

So the design will be set for some time allowing us to learn what it is we have. It will be changed when we know what to change and why. The requirements for content will be minimal and improvements will happen because of a collective effort.

Have you started to populate the database yet? What cycle span do you plan for between evaluations and redesigns? -- Lars Aronsson (lars(a)aronsson.se) Aronsson Datateknik - http://aronsson.se

Reply

Andrew Dunbar

31 Jul 31 Jul

8:24 p.m.

New subject: UW database stuff (nothing to do with Afrikaans)

What is the solution for varied spellings which do not depend on an authority such as color vs. colour in English? What about when certain aspects of spelling are independent? colourise, colourize, colorize all exist in English, colorise seems not to exist. What about the other orthographical variations I mentioned in an earlier post such as ASCII vs curved apostrophe in most languages or Hebrew and Arabic with varying degrees of pointing? Andrew Dunbar (hippietrail) On 7/30/05, Gerard Meijssen <gerard.meijssen(a)gmail.com> wrote:

...

Lars Aronsson wrote:

Gerard Meijssen wrote:

As the Dutch spelling will change in August 2006, we added things like spelling authorities.

Great, then you can introduce a new authority for every century of a language, so the 19th century Swedish can be told apart from 20th century Swedish. But how are spelling authorities different from languages? Couldn't "19th century Swedish" and "Dutch after 2006" be treated as languages of their own?

No, an authority is just that. When the Spelling is marked as depreciated, you change the record in ValidSpelling. The reason for this extra table is, that the word "paardenbloem" will be depreciated and the older version of "paardebloem" is to be appreciated again. The Spelling does give a date, that one is about when it was introduce. A spelling authority is an organisation that decides on a specific spelling. The NTU is such an organisation for the Dutch language, the change in 2006 follows the change of 1996. As I said in a previous mail, this was created because I will have to accomodate this change in a contemporary language. When 19th century Swedish changed to a more modern version, the old spelling can be depreciated by adding the ValidUntil in a ValidSpelling record. The old spelling and the new Spelling are related by a Relation record.

So the design will be set for some time allowing us to learn what it is we have. It will be changed when we know what to change and why. The requirements for content will be minimal and improvements will happen because of a collective effort.

Have you started to populate the database yet? What cycle span do you plan for between evaluations and redesigns?

At this moment I am working on the database design, Erik is working on Wikidata and when he has finished his part, we will work on the implementation as described on Meta. As to data, there are several many that will be included into the Ultimate Wiktionary. The content of the Wiktionaries is important among these. We have a wordlist in Stellingwerfs of 18.865 waiting in the wings enough to rank it as the sixth biggest Wiktionary. I have been informed that another version will have some 20K+ articles .. We have some 222.930 correctly spelled Dutch words that we may use (only spelling with hypenations). We have the content of several glossaries and thesauri. So at this stage questions on the database stuff are very welcome. We do not expect to start including data until somewhere September. But when we do, it will be available for everybody to play with it. Thanks, Gerard _______________________________________________ Wikitech-l mailing list Wikitech-l(a)wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

-- http://linguaphile.sf.net

Reply