A request by nl:Gebruiker:Jcwf http://nl.wiktionary.org/wiki/Gebruiker:Jcwf has been posted on the en:Beer parlour to have first letter case sensitivity on the inactive Afrikaans Wiktionary.
Ec
Ray Saintonge wrote:
A request by nl:Gebruiker:Jcwf http://nl.wiktionary.org/wiki/Gebruiker:Jcwf has been posted on the en:Beer parlour to have first letter case sensitivity on the inactive Afrikaans Wiktionary.
Ec
Please give some references on af.wikipedia.org with users supporting this change.
Thanks
Ashar Voultoiz wrote:
Ray Saintonge wrote:
A request by nl:Gebruiker:Jcwf http://nl.wiktionary.org/wiki/Gebruiker:Jcwf has been posted on the en:Beer parlour to have first letter case sensitivity on the inactive Afrikaans Wiktionary.
Please give some references on af.wikipedia.org with users supporting this change.
If that Wiktionary is inactive, then obviously there won't be such users. And that's all the more reason to change it now so that people don't start creating yet another broken Wiktionary and later require a complicated conversion like en.wiktionary did.
Timwi
On 7/15/05, Timwi timwi@gmx.net wrote:
If that Wiktionary is inactive
It has approximately 20,000 articles.
On 7/15/05, Ævar Arnfjörð Bjarmason avarab@gmail.com wrote:
On 7/15/05, Timwi timwi@gmx.net wrote:
If that Wiktionary is inactive
It has approximately 20,000 articles.
Dutch has almost 20,000, but Afrikaans, which is the one requesting case sensitivity changes has just 19 (just 19, not 19,000). Ashar's request was to ask af.wikipedia, not Wiktionary about this, and they have 4000 articles and a reasonably active, though small, community.
Angela.
On 7/15/05, Angela beesley@gmail.com wrote:
Dutch has almost 20,000, but Afrikaans, which is the one requesting case sensitivity changes has just 19
Righto, sorry about that.
Because they speak Afrikaans, and would be able to give the view of a native speaker.
Mark
On 15/07/05, Timwi timwi@gmx.net wrote:
Angela wrote:
Ashar's request was to ask af.wikipedia, not Wiktionary about this, and they have 4000 articles and a reasonably active, though small, community.
If none of them participates in Wiktionary, why does their opinion count any more than anyone else's?
Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Mark, You may now something about languages, but you have no clue about Wiktionary. And if you did, I would welcome your comments about the ERD that I posted here http://commons.wikimedia.org/wiki/Image:ERD.jpg It is a work in progress and it is working towards the Ultimate Wiktionary. I can put dialects in there but I am not yet happy about this aspect as I cannot truly enter simplified Chinese in there in a proper way. It also places etymology on a different place from where it is traditionally placed in the wiktionaries due to the fact that there are some that differ depending on the meaning of a word.
As to capitalisation; any paper dictionary does not capitalise the words that are in there unless they are capitalised as a rule. It is due to some unfortunate history that it took so long to change the English wiktionary. There are currently 19 articles in the African wiktionary, Jcwf may want to use the same system of templates that are used in many of the other wiktionaries. To do this it helps to have capitalisation turned off.
I second his request to turn capitalisation on the af.wiktionary off.
Thanks, GerardM
Mark Williamson wrote:
Because they speak Afrikaans, and would be able to give the view of a native speaker.
Mark
On 15/07/05, Timwi timwi@gmx.net wrote:
Angela wrote:
Ashar's request was to ask af.wikipedia, not Wiktionary about this, and they have 4000 articles and a reasonably active, though small, community.
If none of them participates in Wiktionary, why does their opinion count any more than anyone else's?
Gerard Meijssen wrote:
As to capitalisation; any paper dictionary does not capitalise the words that are in there unless they are capitalised as a rule. It is due to some unfortunate history that it took so long to change the English wiktionary. There are currently 19 articles in the African wiktionary, Jcwf may want to use the same system of templates that are used in many of the other wiktionaries. To do this it helps to have capitalisation turned off.
I second his request to turn capitalisation on the af.wiktionary off.
Personally, I was just passing on a request that appeared on the en:wiktionary Beer parlour. Naturally, I support it just as I supported the change in English. A more important and more general question might be what should the default situation be when a new Wiktionary is started. This will even be important for Wiktionaries is scripts where capitalization is unknown, since they too will include foreign words.
Ec
Hi,
Le Saturday 16 July 2005 15:19, Ray Saintonge a écrit :
Gerard Meijssen wrote:
As to capitalisation; any paper dictionary does not capitalise the words that are in there unless they are capitalised as a rule. It is due to some unfortunate history that it took so long to change the English wiktionary. There are currently 19 articles in the African wiktionary, Jcwf may want to use the same system of templates that are used in many of the other wiktionaries. To do this it helps to have capitalisation turned off.
I second his request to turn capitalisation on the af.wiktionary off.
Personally, I was just passing on a request that appeared on the en:wiktionary Beer parlour. Naturally, I support it just as I supported the change in English. A more important and more general question might be what should the default situation be when a new Wiktionary is started. This will even be important for Wiktionaries is scripts where capitalization is unknown, since they too will include foreign words.
Yes, I agree with that. All new Wiktionaries should have capitalization off by default.
Ec
Yann
Gerard Meijssen wrote:
I would welcome your comments about the ERD that I posted here http://commons.wikimedia.org/wiki/Image:ERD.jpg
Looks interesting, but is extremely bare. It would do well with a bit of documentation. For much of it, the purpose isn't entirely clear. I'm particularly confused as to why "Language", "Word" and "Meaning" are each duplicated.
Timwi wrote:
Gerard Meijssen wrote:
I would welcome your comments about the ERD that I posted here http://commons.wikimedia.org/wiki/Image:ERD.jpg
Looks interesting, but is extremely bare. It would do well with a bit of documentation. For much of it, the purpose isn't entirely clear. I'm particularly confused as to why "Language", "Word" and "Meaning" are each duplicated.
Hoi, There is some documentation here: http://meta.wikimedia.org/wiki/Ultimate_Wiktionary_data_design and http://meta.wikimedia.org/wiki/Ultimate_Wiktionary_decisions_on_its_usage here. The duplication reflects that there is at least one table that has two relations to the same table. Language refers to itself for dialects, Word refers through Conju/Decli (conjucation or declinations) to a headword and derived words, Meaning is related through "Relations" this is to allow for thesaurus like structures.
One reason why it is not as much documented as I would like is, because I am still working on the structure. At this moment I am thinking hard on how to include signed languages and the spoken dialects of the Chinese and Arabic written language.
Thanks, GerardM
Hi Gerard,
Signed languages are completely independent systems, and are separate languages from spoken languages, different in grammar, syntax, and the like. Unfortunately, there is no universally-agreed-upon method for transcribing signed languages.
There are a few possibilities here.
1) Choose a particular transcription system. Top of the list would be Stokoe, and Sutton Sign Writing; HamNoSys is also possible but is mostly used by linguists while the other two are used more widely by people who use it as their everyday language.
2) Use multimedia. We can upload videos of people signing a particular word. Note that some signed languages also have conjugations and inflections. However, this will leave a problem of headwords -- how do you look up a word in a signed language? Which leads to the third option,
3) Introduce our own notation system. This is impractical and unlikely to work well. I suggest that instead, we adopt HamNoSys for lookup purposes, although it is not represented by Unicode, we can try an ASCII implementation.
Regarding "dialects" of Chinese and Arabic, that is very simple. Treat them as separate languages. While certainly most often people write in "Standard Arabic" or "Standard Chinese", it is also possible to write in the local vernacular. This tends to be done more with Arabic, but is possible with either. With Chinese, you only see it very often with Cantonese, other varieties are occasionally but you are more likely to find a Bible translation in them than a newspaper.
I hope very much that you will not restrict languages to those which appear on the ISO 639-3 list. It has many shortcomings and is very, very, very disappointing -- it would not allow for separate entries for Yavapai, Hualapai, and Havasupai (it has only one code for them all), even though they are very much different languages, and it by no means includes all the languages of the world. It also separates between Moroccan, Tunisian, and Algerian Arabic, when they're really nearly identical.
Mark
On 20/07/05, Gerard Meijssen gerard.meijssen@gmail.com wrote:
Timwi wrote:
Gerard Meijssen wrote:
I would welcome your comments about the ERD that I posted here http://commons.wikimedia.org/wiki/Image:ERD.jpg
Looks interesting, but is extremely bare. It would do well with a bit of documentation. For much of it, the purpose isn't entirely clear. I'm particularly confused as to why "Language", "Word" and "Meaning" are each duplicated.
Hoi, There is some documentation here: http://meta.wikimedia.org/wiki/Ultimate_Wiktionary_data_design and http://meta.wikimedia.org/wiki/Ultimate_Wiktionary_decisions_on_its_usage here. The duplication reflects that there is at least one table that has two relations to the same table. Language refers to itself for dialects, Word refers through Conju/Decli (conjucation or declinations) to a headword and derived words, Meaning is related through "Relations" this is to allow for thesaurus like structures.
One reason why it is not as much documented as I would like is, because I am still working on the structure. At this moment I am thinking hard on how to include signed languages and the spoken dialects of the Chinese and Arabic written language.
Thanks, GerardM _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Mark Williamson wrote:
Hi Gerard,
Signed languages are completely independent systems, and are separate languages from spoken languages, different in grammar, syntax, and the like. Unfortunately, there is no universally-agreed-upon method for transcribing signed languages.
There are a few possibilities here.
- Choose a particular transcription system. Top of the list would be
Stokoe, and Sutton Sign Writing; HamNoSys is also possible but is mostly used by linguists while the other two are used more widely by people who use it as their everyday language.
It being the Ultimate Wiktionary, I would prefer not make a choise and have them all.
- Use multimedia. We can upload videos of people signing a particular
word. Note that some signed languages also have conjugations and inflections. However, this will leave a problem of headwords -- how do you look up a word in a signed language? Which leads to the third option,
When you look at the ERD it is explicitly indicated by the Conju/Decli-Word table what the Headword is. This is done for any language and it means that it is by practice and not design that in most languages the infinitive will be the Headword. Consequently, when one word is found the other will be also shown in a format that may need some screen design per type of conjugation or declination.
- Introduce our own notation system. This is impractical and unlikely
to work well. I suggest that instead, we adopt HamNoSys for lookup purposes, although it is not represented by Unicode, we can try an ASCII implementation.
Introducing our own notation system is not an option. To a very large extend I want deaf people involved and they can sort it out. The only involvement I may have is see if it is possible to implement it within the confines of UW.
Regarding "dialects" of Chinese and Arabic, that is very simple. Treat them as separate languages. While certainly most often people write in "Standard Arabic" or "Standard Chinese", it is also possible to write in the local vernacular. This tends to be done more with Arabic, but is possible with either. With Chinese, you only see it very often with Cantonese, other varieties are occasionally but you are more likely to find a Bible translation in them than a newspaper.
One way I am thinking is that there are often transcriptions of these dialects / languages. I am happy to have these as well as long as I can quote an authority who did the transcribing. Bible translations are one of the most important resources for rare languages, we will have at some stage a lot of Bible texts that we will analyse for its content. It may help us create large translation memories and translation glossaries.
I hope very much that you will not restrict languages to those which appear on the ISO 639-3 list. It has many shortcomings and is very, very, very disappointing -- it would not allow for separate entries for Yavapai, Hualapai, and Havasupai (it has only one code for them all), even though they are very much different languages, and it by no means includes all the languages of the world. It also separates between Moroccan, Tunisian, and Algerian Arabic, when they're really nearly identical.
With the design of the UW I have no technical restriction on what languages and dialects I add. I would insert Valkenburgs (a Limburgian dialect) as readily as any other language. If there is an intrest and if it is no original research (ie not a newly created language) I am happy to have it. When people can destinguish Maroccan from Tunesian etc I am happy to include them if only to have the different pronunciation .. The only thing I need is to explicitly have an agreed code for the languages/dialects that the WMF agrees on. This code will be very much internal but it needs to exist and be agreed upon. I also am thinking of a project that should have us hear many dialects from the translation of one A4 written text. The thing is to find an interesting text, a short story preferably and have it spoken and recorded by the dialects / languages that we are going to include.
Thanks Mark, Gerard
Mark
On 20/07/05, Gerard Meijssen gerard.meijssen@gmail.com wrote:
Timwi wrote:
Gerard Meijssen wrote:
I would welcome your comments about the ERD that I posted here http://commons.wikimedia.org/wiki/Image:ERD.jpg
Looks interesting, but is extremely bare. It would do well with a bit of documentation. For much of it, the purpose isn't entirely clear. I'm particularly confused as to why "Language", "Word" and "Meaning" are each duplicated.
Hoi, There is some documentation here: http://meta.wikimedia.org/wiki/Ultimate_Wiktionary_data_design and http://meta.wikimedia.org/wiki/Ultimate_Wiktionary_decisions_on_its_usage here. The duplication reflects that there is at least one table that has two relations to the same table. Language refers to itself for dialects, Word refers through Conju/Decli (conjucation or declinations) to a headword and derived words, Meaning is related through "Relations" this is to allow for thesaurus like structures.
One reason why it is not as much documented as I would like is, because I am still working on the structure. At this moment I am thinking hard on how to include signed languages and the spoken dialects of the Chinese and Arabic written language.
Thanks, GerardM
Hi guys. I'd just like to insert a few comments...
On 7/21/05, Gerard Meijssen gerard.meijssen@gmail.com wrote:
Mark Williamson wrote:
Hi Gerard,
When you look at the ERD it is explicitly indicated by the Conju/Decli-Word table what the Headword is. This is done for any
Why not call it the Inflection table since conjugation and declension are just types of inflection. It's a lot easier to pronounce.
language and it means that it is by practice and not design that in most languages the infinitive will be the Headword. Consequently, when one word is found the other will be also shown in a format that may need some screen design per type of conjugation or declination.
Many languages do not have a concept of "infinitive". Many languages use some other form as the headword in dictionaries. Such forms are known as citation forms. The other extremely common form used by many languages as the verbal citation form is the 3rd person singular present indicative. If verbs have gender then it is the masculine form. I am not aware of another verb form used as the citation form but there's a good chance there are others in exotic languages. Also it's worth noting that a few languages have more than one infinitive form.
Hippietrail
On Thu, 21 Jul 2005 21:41:21 +1000, Andrew Dunbar hippytrail@gmail.com wrote:
Many languages do not have a concept of "infinitive". Many languages use some other form as the headword in dictionaries. Such forms are known as citation forms. The other extremely common form used by many languages as the verbal citation form is the 3rd person singular present indicative. If verbs have gender then it is the masculine form. I am not aware of another verb form used as the citation form but there's a good chance there are others in exotic languages.
FWIW, (Modern) Greek uses first person singular present indicative. I believe that Ancient Greek and Latin dictionaries often use the first person singular present indicative as the citation form as well, even though both languages have infinitives.
Arabic uses third person singular masculine *perfect* (usually past), rather than imperfect ("present"), as does Maltese by tradition, though I'm told that some Maltese consider the second person singular imperfect to be the preferred citation form.
--Ph.
On Wednesday 20 July 2005 21:29, Gerard Meijssen wrote:
Timwi wrote:
Gerard Meijssen wrote:
I would welcome your comments about the ERD that I posted here http://commons.wikimedia.org/wiki/Image:ERD.jpg
I haven't seen it before, and I was thinking about it on my own, so I'd like to comment on it :)
Looks interesting, but is extremely bare. It would do well with a bit of documentation. For much of it, the purpose isn't entirely clear. I'm particularly confused as to why "Language", "Word" and "Meaning" are each duplicated.
Hoi, There is some documentation here: http://meta.wikimedia.org/wiki/Ultimate_Wiktionary_data_design and http://meta.wikimedia.org/wiki/Ultimate_Wiktionary_decisions_on_its_usage here. The duplication reflects that there is at least one table that has two relations to the same table. Language refers to itself for dialects, Word refers through Conju/Decli (conjucation or declinations) to a headword and derived words, Meaning is related through "Relations" this is to allow for thesaurus like structures.
First, I'd suggest a cosmetic change, to use "script" instead of "charset" in table and column names. When I first saw it, I thought that you are referring to computer charsets (ISO 8859-1, Windows-1251...) and was wondering why wouldn't you simply use UTF-8 :)
Related to this, I'd suggest to add "ISO15924" column to "characterset" (or future "script") table. This way a script can be formally specified and looked up, regardless of its name.
Why is there column "gender" in table "word"? If a word can exist in multiple genders, shouldn't that rather be represented in "inflection" table? If a word has a gender on its own, wouldn't that rather be represented in WordType table? If not, there are other properties of words (for example, number) which can also be represented in "word" table, why is gender singled out?
I'd strongly suggest to add a column "inflection", to either "wordtype" or "word" table; this would specify which inflection does a word use, and whether it is regular or irregular. If it is known which inflexion a word uses and if it is regular, then all its inflected forms could be generated automatically.
I see that there is column "languageid" in "meaningtext" table. If I understand this, it means that meaning of a word could be written down in various languages, and I second this. But I wonder how are you going to do the same for other data which might need to be translated (for example, column "characterset" of of "characterset" table - I understand that this is pure text? Were you thinking about this?
Were you thinking about a way to register examples of use, similar to meaning? Or would examples of use be simply a raw text in "meaningtext" table?
One reason why it is not as much documented as I would like is, because I am still working on the structure. At this moment I am thinking hard on how to include signed languages and the spoken dialects of the Chinese and Arabic written language.
The last one is easy. Add ISO3166_1 and ISO3166_2 columns to language table. That way, you could formally represent a dialect within a certain country (as for Arabic) or even a region (as for Chinese). Perhaps even better solution would be to include a RegionID column which would point to table "regions" (relation 1-many), which would have RegionID, ISO3166_1 and ISO3166_2 columns; that way you could specify wider regions in which a dialect is spoken, even if they go over country boundaries.
As a sidenote, I'd also suggest to rename ISO639-2 and ISO639-3 columns to ISO639_2 and ISO639_3, respectively. ISO639-2 might be interpreted as ISO639 minus 2 which can lead to all sorts of confusions.
Were you thinking about a way to formally define a dialect? Ideally, beside a region (by the way, ISO3166-2 is not granular enough, and there should be a better way of expressing region in which a dialect is spoken, perhaps even to the level of a village), there should be specified a time period in which a dialect was spoken, social layer which was speaking it, and perhaps even a particular person or entity using it. OK, I got carried away a bit but such things might be important :) Though some of this should perhaps be (also) tied to a word and not a dialect.
Nikola Smolenski wrote:
On Wednesday 20 July 2005 21:29, Gerard Meijssen wrote:
Timwi wrote:
Gerard Meijssen wrote:
I would welcome your comments about the ERD that I posted here http://commons.wikimedia.org/wiki/Image:ERD.jpg
I haven't seen it before, and I was thinking about it on my own, so I'd like to comment on it :)
Looks interesting, but is extremely bare. It would do well with a bit of documentation. For much of it, the purpose isn't entirely clear. I'm particularly confused as to why "Language", "Word" and "Meaning" are each duplicated.
Hoi, There is some documentation here: http://meta.wikimedia.org/wiki/Ultimate_Wiktionary_data_design and http://meta.wikimedia.org/wiki/Ultimate_Wiktionary_decisions_on_its_usage here. The duplication reflects that there is at least one table that has two relations to the same table. Language refers to itself for dialects, Word refers through Conju/Decli (conjucation or declinations) to a headword and derived words, Meaning is related through "Relations" this is to allow for thesaurus like structures.
First, I'd suggest a cosmetic change, to use "script" instead of "charset" in table and column names. When I first saw it, I thought that you are referring to computer charsets (ISO 8859-1, Windows-1251...) and was wondering why wouldn't you simply use UTF-8 :)
Done
Related to this, I'd suggest to add "ISO15924" column to "characterset" (or future "script") table. This way a script can be formally specified and looked up, regardless of its name.
Done, however the name of the script is a record in the database in its own right so it may have as many translations as we care to enter. The code is just to anchor it. I understand from Erik's notes that a language can be indicated as the default value.. The default value will be English. So, add a translation to the English word and from then on the User Interface will show it localised.
Why is there column "gender" in table "word"? If a word can exist in multiple genders, shouldn't that rather be represented in "inflection" table? If a word has a gender on its own, wouldn't that rather be represented in WordType table? If not, there are other properties of words (for example, number) which can also be represented in "word" table, why is gender singled out?
When a word is inflected to a particular form, that word is a word in its own right and consequently will be found in the UW. The inflection is there because it does provide information and this information is relevant for the inflections and the headword.. A Wordtype indicates a noun a verb an adjective etc.
I'd strongly suggest to add a column "inflection", to either "wordtype" or "word" table; this would specify which inflection does a word use, and whether it is regular or irregular. If it is known which inflexion a word uses and if it is regular, then all its inflected forms could be generated automatically.
I see that there is column "languageid" in "meaningtext" table. If I understand this, it means that meaning of a word could be written down in various languages, and I second this. But I wonder how are you going to do the same for other data which might need to be translated (for example, column "characterset" of of "characterset" table - I understand that this is pure text? Were you thinking about this?
The fields "Sign" :) "Gender" "WordType" all relate to meaning; Ultimate Wiktionary will eat its own dogfood or when a translation to a word like noun is added like I did for Afrikaans recently, this translation is the one that will be used in the User Interface
Were you thinking about a way to register examples of use, similar to meaning? Or would examples of use be simply a raw text in "meaningtext" table?
Idiom, proverbs will be a "wordtype" as much as noun is. Therefore a proverb will relate to a keyword through WordRelation. I have updated the table Relation with a newfield "SameLanguageOnly" this ensures that the relation is applicable within the same language so the relation would be "proverb" and it would combine "apple" with "an apple a day keeps the doctor away".
MeaningText would be just the definition of a meaning in a given language.
One reason why it is not as much documented as I would like is, because I am still working on the structure. At this moment I am thinking hard on how to include signed languages and the spoken dialects of the Chinese and Arabic written language.
The last one is easy. Add ISO3166_1 and ISO3166_2 columns to language table. That way, you could formally represent a dialect within a certain country (as for Arabic) or even a region (as for Chinese). Perhaps even better solution would be to include a RegionID column which would point to table "regions" (relation 1-many), which would have RegionID, ISO3166_1 and ISO3166_2 columns; that way you could specify wider regions in which a dialect is spoken, even if they go over country boundaries.
The country code is irrelevant as far as this database is concerned. This database is about words in languages and dialects.
As a sidenote, I'd also suggest to rename ISO639-2 and ISO639-3 columns to ISO639_2 and ISO639_3, respectively. ISO639-2 might be interpreted as ISO639 minus 2 which can lead to all sorts of confusions.
Done
Were you thinking about a way to formally define a dialect? Ideally, beside a region (by the way, ISO3166-2 is not granular enough, and there should be a better way of expressing region in which a dialect is spoken, perhaps even to the level of a village), there should be specified a time period in which a dialect was spoken, social layer which was speaking it, and perhaps even a particular person or entity using it. OK, I got carried away a bit but such things might be important :) Though some of this should perhaps be (also) tied to a word and not a dialect.
Ultimate wiktionary is about words (written, spoken or signed) that is the starting point. There will be a need for some formality; once a dialect is recognised, it will be hard to take it away. Therefore in my opinion it will be after some discussion. They will be added as such by an admin. I would think that we need to consider what it takes before we add a dialect. Tentatively I would go for at least 100 words defined as such. With a dialect I would assume that words that are not defined are those of the higher level language.
Things like where it is spoken and time periods at all sounds to me like etymological content. And that is where I would have it.
Thanks, GerardM
On 7/22/05, Gerard Meijssen gerard.meijssen@gmail.com wrote:
Idiom, proverbs will be a "wordtype" as much as noun is. Therefore a proverb will relate to a keyword through WordRelation. I have updated the table Relation with a newfield "SameLanguageOnly" this ensures that the relation is applicable within the same language so the relation would be "proverb" and it would combine "apple" with "an apple a day keeps the doctor away".
In this case how will you prevent loss of which function an idiom has: "to eat one's hat" is a verb and an idiom. A "red herring" is an idiom and a noun.
Can you guys change the subject on this thread? af.wiktionary.org was changed to case-sensitive mode several days ago and none of this thread has been about it in ages. :D
-- brion vibber (brion @ pobox.com)
On Friday 22 July 2005 13:25, Gerard Meijssen wrote:
Nikola Smolenski wrote:
On Wednesday 20 July 2005 21:29, Gerard Meijssen wrote: Related to this, I'd suggest to add "ISO15924" column to "characterset" (or future "script") table. This way a script can be formally specified and looked up, regardless of its name.
Done, however the name of the script is a record in the database in its own right so it may have as many translations as we care to enter. The code is just to anchor it. I understand from Erik's notes that a language can be indicated as the default value.. The default value will be English. So, add a translation to the English word and from then on the User Interface will show it localised.
I'll talk about this below.
Why is there column "gender" in table "word"? If a word can exist in multiple genders, shouldn't that rather be represented in "inflection" table? If a word has a gender on its own, wouldn't that rather be represented in WordType table? If not, there are other properties of words (for example, number) which can also be represented in "word" table, why is gender singled out?
When a word is inflected to a particular form, that word is a word in its own right and consequently will be found in the UW. The inflection is there because it does provide information and this information is
Now I'm not so sure that I understand which table is for what. Could you give an example? For example, the word "white" is a base word and the word "whiter" is its inflection. How would these two words fit into the database?
relevant for the inflections and the headword.. A Wordtype indicates a noun a verb an adjective etc.
I still don't understand why is gender singled out of all properties a word could have. For example, a verb could be transitive or intransitive, and this information is important. To give an example:
word: horse gender: male partofspeech: noun
word: ship gender: female partofspeech: noun
word: to drive gender: none partofspeech: transitive verb
word: to swim gender: none partofspeech: intransitive verb
See what I mean? If you have to specify transitivity of a verb in "partofspeech" table, you may as well specify gender of a noun in that table. It would be consistent either to remove "gender" column from "word" table:
word: horse partofspeech: male noun
word: ship partofspeech: female noun
word: to drive partofspeech: transitive verb
word: to swim partofspeech: intransitive verb
or to rename it to, for example, "subtype":
word: horse subtype: male partofspeech: noun
word: ship subtype: female partofspeech: noun
word: to drive subtype: transitive partofspeech: verb
word: to swim subtype: intransitive partofspeech: verb
If you are going to change this, I'd suggest the first solution. Firstly, because there may be words which would have more than one subtype; secondly, because it eliminates the possibility of having invalid mix of subtypes (horse: intransitive noun...).
I'd strongly suggest to add a column "inflection", to either "wordtype" or "word" table; this would specify which inflection does a word use, and whether it is regular or irregular. If it is known which inflexion a word uses and if it is regular, then all its inflected forms could be generated automatically.
I see that there is column "languageid" in "meaningtext" table. If I understand this, it means that meaning of a word could be written down in various languages, and I second this. But I wonder how are you going to do the same for other data which might need to be translated (for example, column "characterset" of of "characterset" table - I understand that this is pure text? Were you thinking about this?
The fields "Sign" :) "Gender" "WordType" all relate to meaning; Ultimate Wiktionary will eat its own dogfood or when a translation to a word like noun is added like I did for Afrikaans recently, this translation is the one that will be used in the User Interface
OK, but what if you have a longer phrase as a table field? For example, an "inflection" in table "inflection" might be "male genitive superlative" or "3rd person plural female past". I don't think it makes sense to add such phrases to the dictionary as proper entries, only so that the dictionary would have translations of them.
Are some table fields inherently translatable? Is this what you had in mind above?
Were you thinking about a way to register examples of use, similar to meaning? Or would examples of use be simply a raw text in "meaningtext" table?
Idiom, proverbs will be a "wordtype" as much as noun is. Therefore a proverb will relate to a keyword through WordRelation. I have updated the table Relation with a newfield "SameLanguageOnly" this ensures that the relation is applicable within the same language so the relation would be "proverb" and it would combine "apple" with "an apple a day keeps the doctor away".
MeaningText would be just the definition of a meaning in a given language.
I was thinking about something else; for example, on http://en.wiktionary.org/wiki/account there is this example: "A beggarly account of empty boxes. - Shakespeare, Romeo and Juliet, V-i"; but I understand now it is going to be just a part of "meaningtext". I'm not so certain, but maybe it would be good to create a separate table for examples, because same examples could (and probably will) be used in "meaningtext"s in different languages. It would also make it easier to automatically add new examples (for example, by grepping Project Gutenberg ;)
One reason why it is not as much documented as I would like is, because I am still working on the structure. At this moment I am thinking hard on how to include signed languages and the spoken dialects of the Chinese and Arabic written language.
The last one is easy. Add ISO3166_1 and ISO3166_2 columns to language table. That way, you could formally represent a dialect within a certain country (as for Arabic) or even a region (as for Chinese). Perhaps even better solution would be to include a RegionID column which would point to table "regions" (relation 1-many), which would have RegionID, ISO3166_1 and ISO3166_2 columns; that way you could specify wider regions in which a dialect is spoken, even if they go over country boundaries.
The country code is irrelevant as far as this database is concerned. This database is about words in languages and dialects.
I still think that this would be an useful way of formally specifying a dialect. For example, British English would have ISO639_2 code "en" and ISO3166_1 code "uk" while Australian English would have ISO639_2 code "en" but ISO3166_1 code "au".
Were you thinking about a way to formally define a dialect? Ideally, beside a region (by the way, ISO3166-2 is not granular enough, and there should be a better way of expressing region in which a dialect is spoken, perhaps even to the level of a village), there should be specified a time period in which a dialect was spoken, social layer which was speaking it, and perhaps even a particular person or entity using it. OK, I got carried away a bit but such things might be important :) Though some of this should perhaps be (also) tied to a word and not a dialect.
Ultimate wiktionary is about words (written, spoken or signed) that is the starting point. There will be a need for some formality; once a dialect is recognised, it will be hard to take it away. Therefore in my opinion it will be after some discussion. They will be added as such by an admin. I would think that we need to consider what it takes before we add a dialect. Tentatively I would go for at least 100 words defined as such. With a dialect I would assume that words that are not defined are those of the higher level language.
When I was referring to "dialect", I did not have in mind a dialect that is officially recognised, but simply a set of words which could be identified as belonging to a certain group. So if you want to say that this word was part of London dockworkers' slang in 1800s, you should be able to do so, and not just stamp it with "British English".
As a simple example, in Serbia, there are several publishing houses that were publishing Asterix, and in some translations "Idefix" is named "Garoviks" and "Panoramix" is named "Aspiriniks" while in others "Idefix" is named "Idefiks" and "Panoramix" is named "Panoramiks"; and this is consistent. If you are going to translate something about Asterix to Serbian, you should pick one of the translations, but you should be consistent in using only the words from the translation which you have picked, and they should somehow be marked as belonging to the same translation. There surely are more important things than Asterix where similar might apply.
Nikola Smolenski wrote:
On Friday 22 July 2005 13:25, Gerard Meijssen wrote:
Nikola Smolenski wrote:
On Wednesday 20 July 2005 21:29, Gerard Meijssen wrote: Related to this, I'd suggest to add "ISO15924" column to "characterset" (or future "script") table. This way a script can be formally specified and looked up, regardless of its name.
Done, however the name of the script is a record in the database in its own right so it may have as many translations as we care to enter. The code is just to anchor it. I understand from Erik's notes that a language can be indicated as the default value.. The default value will be English. So, add a translation to the English word and from then on the User Interface will show it localised.
I'll talk about this below.
Why is there column "gender" in table "word"? If a word can exist in multiple genders, shouldn't that rather be represented in "inflection" table? If a word has a gender on its own, wouldn't that rather be represented in WordType table? If not, there are other properties of words (for example, number) which can also be represented in "word" table, why is gender singled out?
When a word is inflected to a particular form, that word is a word in its own right and consequently will be found in the UW. The inflection is there because it does provide information and this information is
Now I'm not so sure that I understand which table is for what. Could you give an example? For example, the word "white" is a base word and the word "whiter" is its inflection. How would these two words fit into the database?
Both words will exist as a Spelling, as a Word and they may share a Meaning. When the inflections are added, in the Inflection-Word, all the missing words will be created and they will all be related to each other through this table. Contrary to a paper dictionary we want them all.
relevant for the inflections and the headword.. A Wordtype indicates a noun a verb an adjective etc.
I still don't understand why is gender singled out of all properties a word could have. For example, a verb could be transitive or intransitive, and this information is important. To give an example:
word: horse gender: male partofspeech: noun
word: ship gender: female partofspeech: noun
word: to drive gender: none partofspeech: transitive verb
word: to swim gender: none partofspeech: intransitive verb
See what I mean? If you have to specify transitivity of a verb in "partofspeech" table, you may as well specify gender of a noun in that table. It would be consistent either to remove "gender" column from "word" table:
word: horse partofspeech: male noun
word: ship partofspeech: female noun
word: to drive partofspeech: transitive verb
word: to swim partofspeech: intransitive verb
or to rename it to, for example, "subtype":
word: horse subtype: male partofspeech: noun
word: ship subtype: female partofspeech: noun
word: to drive subtype: transitive partofspeech: verb
word: to swim subtype: intransitive partofspeech: verb
If you are going to change this, I'd suggest the first solution. Firstly, because there may be words which would have more than one subtype; secondly, because it eliminates the possibility of having invalid mix of subtypes (horse: intransitive noun...).
At this moment in time I would not have intransitive verbs or transitive verbs at all. To me they are verbs. When they are transitive, they have a different meaning from when they are intransitive so to me the destinction is in the meaning.
I'd strongly suggest to add a column "inflection", to either "wordtype" or "word" table; this would specify which inflection does a word use, and whether it is regular or irregular. If it is known which inflexion a word uses and if it is regular, then all its inflected forms could be generated automatically.
I see that there is column "languageid" in "meaningtext" table. If I understand this, it means that meaning of a word could be written down in various languages, and I second this. But I wonder how are you going to do the same for other data which might need to be translated (for example, column "characterset" of of "characterset" table - I understand that this is pure text? Were you thinking about this?
The fields "Sign" :) "Gender" "WordType" all relate to meaning; Ultimate Wiktionary will eat its own dogfood or when a translation to a word like noun is added like I did for Afrikaans recently, this translation is the one that will be used in the User Interface
OK, but what if you have a longer phrase as a table field? For example, an "inflection" in table "inflection" might be "male genitive superlative" or "3rd person plural female past". I don't think it makes sense to add such phrases to the dictionary as proper entries, only so that the dictionary would have translations of them.
Are some table fields inherently translatable? Is this what you had in mind above?
Most if not all text fields will be inherently translatable, this is what I have very much in mind. The name of a font will not be translated but that is the only one at this point in time. It makes perfect sense to have this in the UW as it allows us to have a self learning User Interface. The thing is; it has function.
Were you thinking about a way to register examples of use, similar to meaning? Or would examples of use be simply a raw text in "meaningtext" table?
Idiom, proverbs will be a "wordtype" as much as noun is. Therefore a proverb will relate to a keyword through WordRelation. I have updated the table Relation with a newfield "SameLanguageOnly" this ensures that the relation is applicable within the same language so the relation would be "proverb" and it would combine "apple" with "an apple a day keeps the doctor away".
MeaningText would be just the definition of a meaning in a given language.
I was thinking about something else; for example, on http://en.wiktionary.org/wiki/account there is this example: "A beggarly account of empty boxes. - Shakespeare, Romeo and Juliet, V-i"; but I understand now it is going to be just a part of "meaningtext". I'm not so certain, but maybe it would be good to create a separate table for examples, because same examples could (and probably will) be used in "meaningtext"s in different languages. It would also make it easier to automatically add new examples (for example, by grepping Project Gutenberg ;)
"A beggarly account of empty boxes" is a quote and why not have it as a seperate Word and marked as such ?? It would be a idiom for "account" and this is linked through Relation. Many famous quotes have been translated and we could have them all. (Een paard , een paard, een koninkrijk voor een paard)
I do not think grepping Project Gutenberg makes much sense. If anything it helps you find occurances of the word but you have to be selective of what to include. That is an editorial process and just the fact that a word is used does not make for a good idiom in the UW.
One reason why it is not as much documented as I would like is, because I am still working on the structure. At this moment I am thinking hard on how to include signed languages and the spoken dialects of the Chinese and Arabic written language.
The last one is easy. Add ISO3166_1 and ISO3166_2 columns to language table. That way, you could formally represent a dialect within a certain country (as for Arabic) or even a region (as for Chinese). Perhaps even better solution would be to include a RegionID column which would point to table "regions" (relation 1-many), which would have RegionID, ISO3166_1 and ISO3166_2 columns; that way you could specify wider regions in which a dialect is spoken, even if they go over country boundaries.
The country code is irrelevant as far as this database is concerned. This database is about words in languages and dialects.
I still think that this would be an useful way of formally specifying a dialect. For example, British English would have ISO639_2 code "en" and ISO3166_1 code "uk" while Australian English would have ISO639_2 code "en" but ISO3166_1 code "au".
Even the ISO-639 codes in the table are there to connect what we are doing in the Wikipedias and other projects. As it is a standard I added it but in the database the ISO 639 fields are not compulsory, the "WMF key" is. If we "need" these ISO639_2 codes, then we would adhere to the principle that a language is a dialect with an army. Have a look at http://nl.wiktionary.org/wiki/WikiWoordenboek:Lijst_van_messages#Schrijfwijz... and you will see how we do some of the uk and au stuff for you. This is however not a great example because it is a mix of different spelling but also vocabulary and scripts. As I was not content with this I came up with the current ERD.
Were you thinking about a way to formally define a dialect? Ideally, beside a region (by the way, ISO3166-2 is not granular enough, and there should be a better way of expressing region in which a dialect is spoken, perhaps even to the level of a village), there should be specified a time period in which a dialect was spoken, social layer which was speaking it, and perhaps even a particular person or entity using it. OK, I got carried away a bit but such things might be important :) Though some of this should perhaps be (also) tied to a word and not a dialect.
Ultimate wiktionary is about words (written, spoken or signed) that is the starting point. There will be a need for some formality; once a dialect is recognised, it will be hard to take it away. Therefore in my opinion it will be after some discussion. They will be added as such by an admin. I would think that we need to consider what it takes before we add a dialect. Tentatively I would go for at least 100 words defined as such. With a dialect I would assume that words that are not defined are those of the higher level language.
When I was referring to "dialect", I did not have in mind a dialect that is officially recognised, but simply a set of words which could be identified as belonging to a certain group. So if you want to say that this word was part of London dockworkers' slang in 1800s, you should be able to do so, and not just stamp it with "British English".
When there are words that are specific to London dockworkers in the 1800s, I would not call it a dialect because like many professions they have there own vocabulary. These I would mark within a collection as the bulk of what they say would be London English of the 1800s. Now there is one thing that is relevant, the UW wants all words of all languages but its primary purpose it to have the current vocabulary. So yes, these words exist and have their place but when they are not used anymore they should be marked as such.
As a simple example, in Serbia, there are several publishing houses that were publishing Asterix, and in some translations "Idefix" is named "Garoviks" and "Panoramix" is named "Aspiriniks" while in others "Idefix" is named "Idefiks" and "Panoramix" is named "Panoramiks"; and this is consistent. If you are going to translate something about Asterix to Serbian, you should pick one of the translations, but you should be consistent in using only the words from the translation which you have picked, and they should somehow be marked as belonging to the same translation. There surely are more important things than Asterix where similar might apply.
Garoviks and Idefiks are for the Serbian language synonyms and as such I do not have to choose because they are both correct. As a matter of interest you could explain things either in the etymology or in the meaning of the word.
I still think that this would be an useful way of formally specifying a dialect. For example, British English would have ISO639_2 code "en" and ISO3166_1 code "uk" while Australian English would have ISO639_2 code "en" but ISO3166_1 code "au".
Even the ISO-639 codes in the table are there to connect what we are doing in the Wikipedias and other projects. As it is a standard I added it but in the database the ISO 639 fields are not compulsory, the "WMF key" is. If we "need" these ISO639_2 codes, then we would adhere to the principle that a language is a dialect with an army. Have a look at http://nl.wiktionary.org/wiki/WikiWoordenboek:Lijst_van_messages#Schrijfwijz... and you will see how we do some of the uk and au stuff for you. This is however not a great example because it is a mix of different spelling but also vocabulary and scripts. As I was not content with this I came up with the current ERD.
Nikola, if you'll look up information on English dialects, you'll find that the division between British, Australian, American, etc. is all very exaggerated. The British, Australian, and American standard languages are all based on the same dialect.
For example, "to starve" is the same in British, Australian, and American standard English, while in Yorkshire dialect it's "to clem".
Similarly, "mouth", which is the same in British, Australian, and American English (as far as the standard languages go), is "flep" in Yorkshire. Incidentally, "flep" also refers to the lips.
This could perhaps be compared to Serbo-Croatian: Serbian, Croatian, and Bosnian are all based on Stovakian, and there's not much variance between them; the real variance in Southwest Slavic is between Stovakian, Cakavian, and Kajkavian. In this example, Australian, British, and American correspond to Serbian, Croatian, and Bosnian, while true dialects such as those of Yorkshire, Northumbria, or Liverpool correspond to Stovakian, Cakavian, and Kajkavian.
Mark
On Saturday 23 July 2005 11:42, Mark Williamson wrote:
I still think that this would be an useful way of formally specifying a dialect. For example, British English would have ISO639_2 code "en" and ISO3166_1 code "uk" while Australian English would have ISO639_2 code "en" but ISO3166_1 code "au".
Even the ISO-639 codes in the table are there to connect what we are doing in the Wikipedias and other projects. As it is a standard I added it but in the database the ISO 639 fields are not compulsory, the "WMF key" is. If we "need" these ISO639_2 codes, then we would adhere to the principle that a language is a dialect with an army. Have a look at http://nl.wiktionary.org/wiki/WikiWoordenboek:Lijst_van_messages#Schrijfw ijzen_binnen_een_taal_2 and you will see how we do some of the uk and au stuff for you. This is however not a great example because it is a mix of different spelling but also vocabulary and scripts. As I was not content with this I came up with the current ERD.
Nikola, if you'll look up information on English dialects, you'll find that the division between British, Australian, American, etc. is all very exaggerated. The British, Australian, and American standard languages are all based on the same dialect.
I knew this (it's been told to me that there is much more difference in pronunciation than there is in different words), however, if there are words specific to one dialect, they should be marked as belonging to it.
For example, "to starve" is the same in British, Australian, and American standard English, while in Yorkshire dialect it's "to clem".
Similarly, "mouth", which is the same in British, Australian, and American English (as far as the standard languages go), is "flep" in Yorkshire. Incidentally, "flep" also refers to the lips.
I didn't knew that, these are interesting examples :)
This could perhaps be compared to Serbo-Croatian: Serbian, Croatian, and Bosnian are all based on Stovakian, and there's not much variance
Stokavian.
between them; the real variance in Southwest Slavic is between Stovakian, Cakavian, and Kajkavian. In this example, Australian, British, and American correspond to Serbian, Croatian, and Bosnian, while true dialects such as those of Yorkshire, Northumbria, or Liverpool correspond to Stovakian, Cakavian, and Kajkavian.
Congratulations! :) You just made a huge cultural faux-pas, claiming that Serbo-Croatian is in fact Croatian :)
Better comparison would be: Australian/British/American correspond to Serbian/Croatian/Bosnian, while Yorkshirian/Northumbrian/Liverpoolian correspond to Vojvodinian/Slavonian/Herzegovinian (the latter are dialects of Stokavian).
On Saturday 23 July 2005 10:13, Gerard Meijssen wrote:
Nikola Smolenski wrote:
On Friday 22 July 2005 13:25, Gerard Meijssen wrote:
Nikola Smolenski wrote:
Why is there column "gender" in table "word"? If a word can exist in multiple genders, shouldn't that rather be represented in "inflection" table? If a word has a gender on its own, wouldn't that rather be represented in WordType table? If not, there are other properties of words (for example, number) which can also be represented in "word" table, why is gender singled out?
When a word is inflected to a particular form, that word is a word in its own right and consequently will be found in the UW. The inflection is there because it does provide information and this information is
Now I'm not so sure that I understand which table is for what. Could you give an example? For example, the word "white" is a base word and the word "whiter" is its inflection. How would these two words fit into the database?
Both words will exist as a Spelling, as a Word and they may share a Meaning. When the inflections are added, in the Inflection-Word, all the missing words will be created and they will all be related to each other through this table. Contrary to a paper dictionary we want them all.
Then I have misunderstood the database design :( I believed at first that inflections would be stored in "inflection" table. Now when I understand the design better, I don't think that it is a good idea to have separate "word" for each inflection because it brings a lot of unneccesary redudance, and much room for error. For example, it would be possible to mark "whiter" as an adverb and "white" as a verb! And then, imagine the horror which would ensue if someone would use wrong PartOfSpeech for base word and now it has to be changed for 100 inflections...
Though this would be a crucial change, please think about it. I think that "word" table should contain only lemmas.
relevant for the inflections and the headword.. A Wordtype indicates a noun a verb an adjective etc.
I still don't understand why is gender singled out of all properties a word could have. For example, a verb could be transitive or intransitive, and this information is important. To give an example:
[...]
At this moment in time I would not have intransitive verbs or transitive verbs at all. To me they are verbs. When they are transitive, they have a different meaning from when they are intransitive so to me the destinction is in the meaning.
OK, for a better example, why not number? Perhaps transitivity doesn't, but number also affects inflection, much as gender does.
OK, but what if you have a longer phrase as a table field? For example, an "inflection" in table "inflection" might be "male genitive superlative" or "3rd person plural female past". I don't think it makes sense to add such phrases to the dictionary as proper entries, only so that the dictionary would have translations of them.
Are some table fields inherently translatable? Is this what you had in mind above?
Most if not all text fields will be inherently translatable, this is what I have very much in mind. The name of a font will not be translated but that is the only one at this point in time. It makes perfect sense to have this in the UW as it allows us to have a self learning User Interface. The thing is; it has function.
OK, so this solves it :)
I was thinking about something else; for example, on http://en.wiktionary.org/wiki/account there is this example: "A beggarly account of empty boxes. - Shakespeare, Romeo and Juliet, V-i"; but I understand now it is going to be just a part of "meaningtext". I'm not so certain, but maybe it would be good to create a separate table for examples, because same examples could (and probably will) be used in "meaningtext"s in different languages. It would also make it easier to automatically add new examples (for example, by grepping Project Gutenberg ;)
"A beggarly account of empty boxes" is a quote and why not have it as a seperate Word and marked as such ?? It would be a idiom for "account" and this is linked through Relation. Many famous quotes have been translated and we could have them all. (Een paard , een paard, een koninkrijk voor een paard)
Because, ideally each word (in each language) should have an example or two, and so the number of examples would approach the number of words; and, it would become impossible to distinguish between notable quotes (Kingdom for a horse!), which occur frequently, need a description, and need to be canonically translated, and non-notable quotes, which are in the wiktionary only to be used as examples of use for other words, need not have a description, and translators won't encounter them at all.
I do not think grepping Project Gutenberg makes much sense. If anything it helps you find occurances of the word but you have to be selective of what to include. That is an editorial process and just the fact that a word is used does not make for a good idiom in the UW.
I think it would make sense for rarer words, which might occur once or a few times in entire Gutenberg's corpus. Of course, at the end a human editor has to decide whether a quote is really relevant.
Related to grepping Project Gutenberg, have you considered adding information on word frequency? Only a single new table is needed, "frequency", with fields "spellingID", "corpus" and "frequency"; eventually "corpus" colud become "corpusID".
Once the UW is up and running, how hard would it be to make such changes?
When I was referring to "dialect", I did not have in mind a dialect that is officially recognised, but simply a set of words which could be identified as belonging to a certain group. So if you want to say that this word was part of London dockworkers' slang in 1800s, you should be able to do so, and not just stamp it with "British English".
When there are words that are specific to London dockworkers in the 1800s, I would not call it a dialect because like many professions they have there own vocabulary. These I would mark within a collection as the bulk of what they say would be London English of the 1800s. Now there is
I agree, it is not a dialect, but if some words are recognisable as belonging to a distinctive group of words, they should somehow be marked as belonging to it, and I was suggesting that they are marked in a same way they would be marked as belonging to a certain dialect. Another solution would be to use "wordrelation" table instead, even though it isn't meant to be used in that way :)
one thing that is relevant, the UW wants all words of all languages but its primary purpose it to have the current vocabulary. So yes, these words exist and have their place but when they are not used anymore they should be marked as such.
Well, just replace 1800s with 2000s and you still have the same problem :)
As a simple example, in Serbia, there are several publishing houses that were publishing Asterix, and in some translations "Idefix" is named "Garoviks" and "Panoramix" is named "Aspiriniks" while in others "Idefix" is named "Idefiks" and "Panoramix" is named "Panoramiks"; and this is consistent. If you are going to translate something about Asterix to Serbian, you should pick one of the translations, but you should be consistent in using only the words from the translation which you have picked, and they should somehow be marked as belonging to the same translation. There surely are more important things than Asterix where similar might apply.
Garoviks and Idefiks are for the Serbian language synonyms and as such I do not have to choose because they are both correct. As a matter of interest you could explain things either in the etymology or in the meaning of the word.
They are synonyms, but they are stylistically marked: it would be wrong to translate Idefix first as Garoviks and later as Idefiks, or to consistently translate Idefix with Idefiks but Panoramix with Aspiriniks, much as it would be wrong to write "I recognise you recognized me"; a translator has to choose and make the choice consistent.
Unrelated to any of the above, could you move "word" table a bit to the right, because currently it is hard to see what is relation between "word", "spelling" and "etymology" tables, the lines overlap.
Nikola Smolenski wrote:
On Saturday 23 July 2005 10:13, Gerard Meijssen wrote:
Nikola Smolenski wrote:
On Friday 22 July 2005 13:25, Gerard Meijssen wrote:
Nikola Smolenski wrote:
Why is there column "gender" in table "word"? If a word can exist in multiple genders, shouldn't that rather be represented in "inflection" table? If a word has a gender on its own, wouldn't that rather be represented in WordType table? If not, there are other properties of words (for example, number) which can also be represented in "word" table, why is gender singled out?
When a word is inflected to a particular form, that word is a word in its own right and consequently will be found in the UW. The inflection is there because it does provide information and this information is
Now I'm not so sure that I understand which table is for what. Could you give an example? For example, the word "white" is a base word and the word "whiter" is its inflection. How would these two words fit into the database?
Both words will exist as a Spelling, as a Word and they may share a Meaning. When the inflections are added, in the Inflection-Word, all the missing words will be created and they will all be related to each other through this table. Contrary to a paper dictionary we want them all.
Then I have misunderstood the database design :( I believed at first that inflections would be stored in "inflection" table. Now when I understand the design better, I don't think that it is a good idea to have separate "word" for each inflection because it brings a lot of unneccesary redudance, and much room for error. For example, it would be possible to mark "whiter" as an adverb and "white" as a verb! And then, imagine the horror which would ensue if someone would use wrong PartOfSpeech for base word and now it has to be changed for 100 inflections...
Though this would be a crucial change, please think about it. I think that "word" table should contain only lemmas.
Right, well this is very much a design decision. The inflections will have to be entered by hand. And if some poor sod does enter all these inflections and they are wrong, there will be the need for an other poor sod to remove them.
relevant for the inflections and the headword.. A Wordtype indicates a noun a verb an adjective etc.
I still don't understand why is gender singled out of all properties a word could have. For example, a verb could be transitive or intransitive, and this information is important. To give an example:
[...]
At this moment in time I would not have intransitive verbs or transitive verbs at all. To me they are verbs. When they are transitive, they have a different meaning from when they are intransitive so to me the destinction is in the meaning.
OK, for a better example, why not number? Perhaps transitivity doesn't, but number also affects inflection, much as gender does.
When it comes to meaning, all the inflections can share the same meaning. The number (first, second, third person) will be implied by the Inflection in the table Inflection-Word. (at this moment it still says Conjugation in this table)
OK, but what if you have a longer phrase as a table field? For example, an "inflection" in table "inflection" might be "male genitive superlative" or "3rd person plural female past". I don't think it makes sense to add such phrases to the dictionary as proper entries, only so that the dictionary would have translations of them.
Are some table fields inherently translatable? Is this what you had in mind above?
Most if not all text fields will be inherently translatable, this is what I have very much in mind. The name of a font will not be translated but that is the only one at this point in time. It makes perfect sense to have this in the UW as it allows us to have a self learning User Interface. The thing is; it has function.
OK, so this solves it :)
I was thinking about something else; for example, on http://en.wiktionary.org/wiki/account there is this example: "A beggarly account of empty boxes. - Shakespeare, Romeo and Juliet, V-i"; but I understand now it is going to be just a part of "meaningtext". I'm not so certain, but maybe it would be good to create a separate table for examples, because same examples could (and probably will) be used in "meaningtext"s in different languages. It would also make it easier to automatically add new examples (for example, by grepping Project Gutenberg ;)
"A beggarly account of empty boxes" is a quote and why not have it as a seperate Word and marked as such ?? It would be a idiom for "account" and this is linked through Relation. Many famous quotes have been translated and we could have them all. (Een paard , een paard, een koninkrijk voor een paard)
Because, ideally each word (in each language) should have an example or two, and so the number of examples would approach the number of words; and, it would become impossible to distinguish between notable quotes (Kingdom for a horse!), which occur frequently, need a description, and need to be canonically translated, and non-notable quotes, which are in the wiktionary only to be used as examples of use for other words, need not have a description, and translators won't encounter them at all.
The idioms, proverbs and quotes will be "Word" records in their own right. So we have to be selective in the idiom that we choose. What is new ?? That is what the editorial process is for. For instance for the Dutch French speaking people the phrase "Papa fume un pipe" is famous and as such it is noteworthy but its significance will bewilder the French.. :)
I do not think grepping Project Gutenberg makes much sense. If anything it helps you find occurances of the word but you have to be selective of what to include. That is an editorial process and just the fact that a word is used does not make for a good idiom in the UW.
I think it would make sense for rarer words, which might occur once or a few times in entire Gutenberg's corpus. Of course, at the end a human editor has to decide whether a quote is really relevant.
Related to grepping Project Gutenberg, have you considered adding information on word frequency? Only a single new table is needed, "frequency", with fields "spellingID", "corpus" and "frequency"; eventually "corpus" colud become "corpusID".
There is more to frequency than that. If anything grep may find it but you still need to know the meaning of the word in that text. When the word gets a new meaning, that is what you want to know .. I will speak to the people of Rotterdam CS (developers of Lucene) about just these kind of issues.
Once the UW is up and running, how hard would it be to make such changes?
This is the time when it is easy to make fundamental changes to the design of the UW, it is still also the time to come with an alternative to the design I propose and as you have noticed, I do change things when there is a good argument to do so. When the UW is life, changing the software will be more difficult.
When I was referring to "dialect", I did not have in mind a dialect that is officially recognised, but simply a set of words which could be identified as belonging to a certain group. So if you want to say that this word was part of London dockworkers' slang in 1800s, you should be able to do so, and not just stamp it with "British English".
When there are words that are specific to London dockworkers in the 1800s, I would not call it a dialect because like many professions they have there own vocabulary. These I would mark within a collection as the bulk of what they say would be London English of the 1800s. Now there is
I agree, it is not a dialect, but if some words are recognisable as belonging to a distinctive group of words, they should somehow be marked as belonging to it, and I was suggesting that they are marked in a same way they would be marked as belonging to a certain dialect. Another solution would be to use "wordrelation" table instead, even though it isn't meant to be used in that way :)
Collection is the mechanism of choise for this. Relation is to indicate thesaurus like structures including antonymes..
one thing that is relevant, the UW wants all words of all languages but its primary purpose it to have the current vocabulary. So yes, these words exist and have their place but when they are not used anymore they should be marked as such.
Well, just replace 1800s with 2000s and you still have the same problem :)
These words are still welcome and the Collection is there for it.
As a simple example, in Serbia, there are several publishing houses that were publishing Asterix, and in some translations "Idefix" is named "Garoviks" and "Panoramix" is named "Aspiriniks" while in others "Idefix" is named "Idefiks" and "Panoramix" is named "Panoramiks"; and this is consistent. If you are going to translate something about Asterix to Serbian, you should pick one of the translations, but you should be consistent in using only the words from the translation which you have picked, and they should somehow be marked as belonging to the same translation. There surely are more important things than Asterix where similar might apply.
Garoviks and Idefiks are for the Serbian language synonyms and as such I do not have to choose because they are both correct. As a matter of interest you could explain things either in the etymology or in the meaning of the word.
They are synonyms, but they are stylistically marked: it would be wrong to translate Idefix first as Garoviks and later as Idefiks, or to consistently translate Idefix with Idefiks but Panoramix with Aspiriniks, much as it would be wrong to write "I recognise you recognized me"; a translator has to choose and make the choice consistent.
A translator has to make a consistent choise, Collections of translated names of Asterisk characters can be used for that. We have the technology. :)
Unrelated to any of the above, could you move "word" table a bit to the right, because currently it is hard to see what is relation between "word", "spelling" and "etymology" tables, the lines overlap.
I did put the table Word out of whack to show its importance. I put Collection on the same level as Meaning because that one too is very important for several applications. Table is technically challeging and that is why it is also given some prominence
Thanks, GerardM
On Monday 25 July 2005 11:04, Gerard Meijssen wrote:
Nikola Smolenski wrote:
On Saturday 23 July 2005 10:13, Gerard Meijssen wrote:
Nikola Smolenski wrote:
On Friday 22 July 2005 13:25, Gerard Meijssen wrote:
Nikola Smolenski wrote:
Then I have misunderstood the database design :( I believed at first that inflections would be stored in "inflection" table. Now when I understand the design better, I don't think that it is a good idea to have separate "word" for each inflection because it brings a lot of unneccesary redudance, and much room for error. For example, it would be possible to mark "whiter" as an adverb and "white" as a verb! And then, imagine the horror which would ensue if someone would use wrong PartOfSpeech for base word and now it has to be changed for 100 inflections...
Though this would be a crucial change, please think about it. I think that "word" table should contain only lemmas.
Right, well this is very much a design decision. The inflections will have to be entered by hand. And if some poor sod does enter all these inflections and they are wrong, there will be the need for an other poor sod to remove them.
Well, I see it as a bad design decision.
First, the inflections don't have to be entered by hand. If a word is not irregular, the inflections could, and should, be entered automatically.
Second, I don't understand this boasting of a flaw. If a problem with database structure is noticed, it should be solved. At the very very least it should be concluded that the problem can't be solved. Instead you are telling me that users will have to work around the problem. I knew that already, but do you see a solution?
relevant for the inflections and the headword.. A Wordtype indicates a noun a verb an adjective etc.
I still don't understand why is gender singled out of all properties a word could have. For example, a verb could be transitive or intransitive, and this information is important. To give an example:
[...]
At this moment in time I would not have intransitive verbs or transitive verbs at all. To me they are verbs. When they are transitive, they have a different meaning from when they are intransitive so to me the destinction is in the meaning.
OK, for a better example, why not number? Perhaps transitivity doesn't, but number also affects inflection, much as gender does.
When it comes to meaning, all the inflections can share the same meaning. The number (first, second, third person) will be implied by the Inflection in the table Inflection-Word. (at this moment it still says Conjugation in this table)
By number I meant singular/plural. But regardless, why then gender wouldn't be specified in inflection-word?
I was thinking about something else; for example, on http://en.wiktionary.org/wiki/account there is this example: "A beggarly account of empty boxes. - Shakespeare, Romeo and Juliet, V-i"; but I understand now it is going to be just a part of "meaningtext". I'm not so certain, but maybe it would be good to create a separate table for examples, because same examples could (and probably will) be used in "meaningtext"s in different languages. It would also make it easier to automatically add new examples (for example, by grepping Project Gutenberg ;)
"A beggarly account of empty boxes" is a quote and why not have it as a seperate Word and marked as such ?? It would be a idiom for "account" and this is linked through Relation. Many famous quotes have been translated and we could have them all. (Een paard , een paard, een koninkrijk voor een paard)
Because, ideally each word (in each language) should have an example or two, and so the number of examples would approach the number of words; and, it would become impossible to distinguish between notable quotes (Kingdom for a horse!), which occur frequently, need a description, and need to be canonically translated, and non-notable quotes, which are in the wiktionary only to be used as examples of use for other words, need not have a description, and translators won't encounter them at all.
The idioms, proverbs and quotes will be "Word" records in their own right. So we have to be selective in the idiom that we choose. What is new ?? That is what the editorial process is for. For instance for the Dutch French speaking people the phrase "Papa fume un pipe" is famous and as such it is noteworthy but its significance will bewilder the French.. :)
Problem is, for ultimate majority of words we will have to choose a non-notable quote as an example.
Maybe we don't understand each other: maybe this isn't the case with other languages, but in a dictionary of Serbian that I have, *EACH* word has at least one, usually two, sometimes even more examples, from common words like "what" to rare and complex words. At least for Serbian and other languages with same lexicographic tradition we will want to do the same in the Wiktionary.
I do not think grepping Project Gutenberg makes much sense. If anything it helps you find occurances of the word but you have to be selective of what to include. That is an editorial process and just the fact that a word is used does not make for a good idiom in the UW.
I think it would make sense for rarer words, which might occur once or a few times in entire Gutenberg's corpus. Of course, at the end a human editor has to decide whether a quote is really relevant.
Related to grepping Project Gutenberg, have you considered adding information on word frequency? Only a single new table is needed, "frequency", with fields "spellingID", "corpus" and "frequency"; eventually "corpus" colud become "corpusID".
There is more to frequency than that. If anything grep may find it but you still need to know the meaning of the word in that text. When the
This is why "frequency" is related to "spelling" and not to "meaning". Change of meaning is not the only useful thing which could be gathered from a frequency analysis.
word gets a new meaning, that is what you want to know .. I will speak to the people of Rotterdam CS (developers of Lucene) about just these kind of issues.
A corpus could (would) be as small as a single text, usually a book. So, you would be able to extract frequency in any desired timespan, or observe how it changes over time.
I agree, it is not a dialect, but if some words are recognisable as belonging to a distinctive group of words, they should somehow be marked as belonging to it, and I was suggesting that they are marked in a same way they would be marked as belonging to a certain dialect. Another solution would be to use "wordrelation" table instead, even though it isn't meant to be used in that way :)
Collection is the mechanism of choise for this. Relation is to indicate thesaurus like structures including antonymes..
Wait, "collection" is related to "meaning" and not to "word". I don't see how could it be used for such things. It would be possible to have names of Asterix characters/Disney characters/whatever grouped together, and that is good. But it still isn't possible to distinguish between two groups of translations of names of Asterix characters. It would be possible to have all words related to seamanship grouped together, but it would not be possible to mark which of these are dockworkers' slang, which are sailors' slang, and which are not slang.
Nikola Smolenski wrote:
On Monday 25 July 2005 11:04, Gerard Meijssen wrote:
Nikola Smolenski wrote:
On Saturday 23 July 2005 10:13, Gerard Meijssen wrote:
Nikola Smolenski wrote:
On Friday 22 July 2005 13:25, Gerard Meijssen wrote:
Nikola Smolenski wrote:
Then I have misunderstood the database design :( I believed at first that inflections would be stored in "inflection" table. Now when I understand the design better, I don't think that it is a good idea to have separate "word" for each inflection because it brings a lot of unneccesary redudance, and much room for error. For example, it would be possible to mark "whiter" as an adverb and "white" as a verb! And then, imagine the horror which would ensue if someone would use wrong PartOfSpeech for base word and now it has to be changed for 100 inflections...
Though this would be a crucial change, please think about it. I think that "word" table should contain only lemmas.
Right, well this is very much a design decision. The inflections will have to be entered by hand. And if some poor sod does enter all these inflections and they are wrong, there will be the need for an other poor sod to remove them.
Well, I see it as a bad design decision.
First, the inflections don't have to be entered by hand. If a word is not irregular, the inflections could, and should, be entered automatically.
Second, I don't understand this boasting of a flaw. If a problem with database structure is noticed, it should be solved. At the very very least it should be concluded that the problem can't be solved. Instead you are telling me that users will have to work around the problem. I knew that already, but do you see a solution?
First off all, if creating inflections is done programmatically, it is not part of the database design. The database design says that there will be a record for each inflection. The inflections are translated as every other word is, there is Spelling for it. This means that these words have an importance in their own right and that is more than just the sharing of the meaning with a headword. So I do not share your argument at all. Yes, we can generate inflections but this WILL result in new Spelling - Word - Meaning. And as long as we do not have software to do this for us, we will have to do it by hand.
relevant for the inflections and the headword.. A Wordtype indicates a noun a verb an adjective etc.
I still don't understand why is gender singled out of all properties a word could have. For example, a verb could be transitive or intransitive, and this information is important. To give an example:
[...]
At this moment in time I would not have intransitive verbs or transitive verbs at all. To me they are verbs. When they are transitive, they have a different meaning from when they are intransitive so to me the destinction is in the meaning.
OK, for a better example, why not number? Perhaps transitivity doesn't, but number also affects inflection, much as gender does.
When it comes to meaning, all the inflections can share the same meaning. The number (first, second, third person) will be implied by the Inflection in the table Inflection-Word. (at this moment it still says Conjugation in this table)
By number I meant singular/plural. But regardless, why then gender wouldn't be specified in inflection-word?
Because it is important to know for a noun what its gender is. When you know "probleem" (neutral) you know by inference that the idiom "het probleem is groter dan ik dacht" is correct because the neutral implies "het". This is the base knowledge that may be expected or when we go the extra mile and you do not know about genders, you may be led to an article about a gender in a particular language.
I was thinking about something else; for example, on http://en.wiktionary.org/wiki/account there is this example: "A beggarly account of empty boxes. - Shakespeare, Romeo and Juliet, V-i"; but I understand now it is going to be just a part of "meaningtext". I'm not so certain, but maybe it would be good to create a separate table for examples, because same examples could (and probably will) be used in "meaningtext"s in different languages. It would also make it easier to automatically add new examples (for example, by grepping Project Gutenberg ;)
"A beggarly account of empty boxes" is a quote and why not have it as a seperate Word and marked as such ?? It would be a idiom for "account" and this is linked through Relation. Many famous quotes have been translated and we could have them all. (Een paard , een paard, een koninkrijk voor een paard)
Because, ideally each word (in each language) should have an example or two, and so the number of examples would approach the number of words; and, it would become impossible to distinguish between notable quotes (Kingdom for a horse!), which occur frequently, need a description, and need to be canonically translated, and non-notable quotes, which are in the wiktionary only to be used as examples of use for other words, need not have a description, and translators won't encounter them at all.
The idioms, proverbs and quotes will be "Word" records in their own right. So we have to be selective in the idiom that we choose. What is new ?? That is what the editorial process is for. For instance for the Dutch French speaking people the phrase "Papa fume un pipe" is famous and as such it is noteworthy but its significance will bewilder the French.. :)
Problem is, for ultimate majority of words we will have to choose a non-notable quote as an example.
Maybe we don't understand each other: maybe this isn't the case with other languages, but in a dictionary of Serbian that I have, *EACH* word has at least one, usually two, sometimes even more examples, from common words like "what" to rare and complex words. At least for Serbian and other languages with same lexicographic tradition we will want to do the same in the Wiktionary.
That is fine with me. Use notable quotes if possible and when you do not have them use non notable quotes.
I do not think grepping Project Gutenberg makes much sense. If anything it helps you find occurances of the word but you have to be selective of what to include. That is an editorial process and just the fact that a word is used does not make for a good idiom in the UW.
I think it would make sense for rarer words, which might occur once or a few times in entire Gutenberg's corpus. Of course, at the end a human editor has to decide whether a quote is really relevant.
Related to grepping Project Gutenberg, have you considered adding information on word frequency? Only a single new table is needed, "frequency", with fields "spellingID", "corpus" and "frequency"; eventually "corpus" colud become "corpusID".
There is more to frequency than that. If anything grep may find it but you still need to know the meaning of the word in that text. When the
This is why "frequency" is related to "spelling" and not to "meaning". Change of meaning is not the only useful thing which could be gathered from a frequency analysis.
word gets a new meaning, that is what you want to know .. I will speak to the people of Rotterdam CS (developers of Lucene) about just these kind of issues.
A corpus could (would) be as small as a single text, usually a book. So, you would be able to extract frequency in any desired timespan, or observe how it changes over time.
I agree, it is not a dialect, but if some words are recognisable as belonging to a distinctive group of words, they should somehow be marked as belonging to it, and I was suggesting that they are marked in a same way they would be marked as belonging to a certain dialect. Another solution would be to use "wordrelation" table instead, even though it isn't meant to be used in that way :)
Collection is the mechanism of choise for this. Relation is to indicate thesaurus like structures including antonymes..
Wait, "collection" is related to "meaning" and not to "word". I don't see how could it be used for such things. It would be possible to have names of Asterix characters/Disney characters/whatever grouped together, and that is good. But it still isn't possible to distinguish between two groups of translations of names of Asterix characters. It would be possible to have all words related to seamanship grouped together, but it would not be possible to mark which of these are dockworkers' slang, which are sailors' slang, and which are not slang.
You can have multiple collections; the names of one translation tradition can be one collection the other the other. The two words for Asterisk can be used as synonym.
Thanks, GerardM
Gerard Meijssen wrote:
Because it is important to know for a noun what its gender is.
Sorry if I've not been following UW discussions, but the gender of words can and have changed over time. How will UW address this? Can I, for example, indicate that a certain word had a male gender in the 17th, 18th, 19th century, but neuter in the 20th? And that in 1890-1920 the percentage of people who used male or neuter gradually shifted?
In a Wikipedia or Wiktionary article, these complex relations and exceptional cases can be described in plain text, as I did in the previous paragraph. But how do you express them in a relational database schema?
And what if you discover such complex relations as the project develops, what is the UW strategy for modifying the schema over time? Right now you seem to be designing an "ultimate" schema that will then be frozen and kept static for all time. The very name of the UW project suggests this kind of thinking, and to me that is about as foreign as marxism.
On 29/07/05, Lars Aronsson lars@aronsson.se wrote:
Gerard Meijssen wrote:
Because it is important to know for a noun what its gender is.
Sorry if I've not been following UW discussions, but the gender of words can and have changed over time. How will UW address this? Can I, for example, indicate that a certain word had a male gender in the 17th, 18th, 19th century, but neuter in the 20th? And that in 1890-1920 the percentage of people who used male or neuter gradually shifted?
In a Wikipedia or Wiktionary article, these complex relations and exceptional cases can be described in plain text, as I did in the previous paragraph. But how do you express them in a relational database schema?
And what if you discover such complex relations as the project develops, what is the UW strategy for modifying the schema over time? Right now you seem to be designing an "ultimate" schema that will then be frozen and kept static for all time. The very name of the UW project suggests this kind of thinking, and to me that is about as foreign as marxism.
As was done with WM 1.4 to 1.5, some very complex queries - and a lot of processing time :) - would be needed to change the database design. Therefore it would be undertaken with care and, well, it can't really be planned for.
Tomer Chachamu wrote:
On 29/07/05, Lars Aronsson lars@aronsson.se wrote:
And what if you discover such complex relations as the project develops, what is the UW strategy for modifying the schema over time? Right now you seem to be designing an "ultimate" schema that will then be frozen and kept static for all time. The very name of the UW project suggests this kind of thinking, and to me that is about as foreign as marxism.
As was done with WM 1.4 to 1.5, some very complex queries - and a lot of processing time :) - would be needed to change the database design. Therefore it would be undertaken with care and, well, it can't really be planned for.
Before MediaWiki 1.5 the same database schema had been used for a long time with only minor additions that didn't require conversion. But with UW it seems that bigger changes will be required more often, because so much detail about the structure of every human language needs to be encoded in the database schema. Oops, Hungarian has two kinds of plural, let's change the schema. Oops, Thai has genders for verbs, let's change the schema. Oops...
The axiom that it "can't be planned for" sounds like a recipy for failure.
Are there indeed any non-free (commercial or research) projects that have attempted anything like this? Wikipedia has many precursors such as Britannica, Brockhaus, etc. And so does Wiktionary, of course. But which precursors does UW have? What kind of data model or database schema do they apply? Or is UW a piece of original research in computational linguistics?
On 30/07/05, Lars Aronsson lars@aronsson.se wrote:
Oops, Hungarian has two kinds of plural, let's change the schema. Oops, Thai has genders for verbs, let's change the schema. Oops...
The axiom that it "can't be planned for" sounds like a recipy for failure.
Those things will not need a change in the schema.
Are there indeed any non-free (commercial or research) projects that have attempted anything like this? Wikipedia has many precursors such as Britannica, Brockhaus, etc. And so does Wiktionary, of course. But which precursors does UW have? What kind of data model or database schema do they apply? Or is UW a piece of original research in computational linguistics?
Something was mentioned on Slashdot which just wanted to be a cross-language dictionary (i.e. not providing definitions as such) but I doubt they got off the ground. UW puts a lot more in the database anyway.
Lars Aronsson wrote:
Gerard Meijssen wrote:
Because it is important to know for a noun what its gender is.
Sorry if I've not been following UW discussions, but the gender of words can and have changed over time. How will UW address this? Can I, for example, indicate that a certain word had a male gender in the 17th, 18th, 19th century, but neuter in the 20th? And that in 1890-1920 the percentage of people who used male or neuter gradually shifted?
In a Wikipedia or Wiktionary article, these complex relations and exceptional cases can be described in plain text, as I did in the previous paragraph. But how do you express them in a relational database schema?
And what if you discover such complex relations as the project develops, what is the UW strategy for modifying the schema over time? Right now you seem to be designing an "ultimate" schema that will then be frozen and kept static for all time. The very name of the UW project suggests this kind of thinking, and to me that is about as foreign as marxism.
Hoi, At this moment in time there is no room defined for free format text. It is fairly easy to allow for some free text fields at some level. It will be fairly straightforward to add a field certainly when it has no relation to anywhere else. The requirements for information are fairly minimal. A Spelling can only be added if you know what language it is in.. A Word needs a Spelling and a Meaning needs one Word. Translations require this Meaning and so do synonyms. After this extra information can be added.
Changes over time for genders is not in the database and percentages of people that use a particular form are not in there as well. When the Ultimate Wiktionary is live it will be possible to add this kind of data to the database. It will require planning, testing and propably a conversion. It may be added when we want to include this.
The idea of Ultimate Wiktionary grew out of the frustration of all these wiktionaries all wanting the same thing, all having to add the same things without the benefit of the efforts of the other wiktionaries. First we experimented with templates, Technically the most interesting stuff can be seen on the la.wiktionary, Then we decided that a database would allow us to share the benefits of the work done on wiktionary content. As the Dutch spelling will change in August 2006, we added things like spelling authorities. We have not considered including changes of gender but I do know this happens in the Dutch language as well. Statistical numbers is not in there either it has always been outside my scope. Initially it is very much intended as a dictionary for contemporary languages. Historical Spelling, Word and Meaning can be included to some extend but this is more an extra than something that was planned.
So the design will be set for some time allowing us to learn what it is we have. It will be changed when we know what to change and why. The requirements for content will be minimal and improvements will happen because of a collective effort.
Thanks, GerardM
Gerard Meijssen wrote:
As the Dutch spelling will change in August 2006, we added things like spelling authorities.
Great, then you can introduce a new authority for every century of a language, so the 19th century Swedish can be told apart from 20th century Swedish. But how are spelling authorities different from languages? Couldn't "19th century Swedish" and "Dutch after 2006" be treated as languages of their own?
So the design will be set for some time allowing us to learn what it is we have. It will be changed when we know what to change and why. The requirements for content will be minimal and improvements will happen because of a collective effort.
Have you started to populate the database yet? What cycle span do you plan for between evaluations and redesigns?
Lars Aronsson wrote:
Gerard Meijssen wrote:
As the Dutch spelling will change in August 2006, we added things like spelling authorities.
Great, then you can introduce a new authority for every century of a language, so the 19th century Swedish can be told apart from 20th century Swedish. But how are spelling authorities different from languages? Couldn't "19th century Swedish" and "Dutch after 2006" be treated as languages of their own?
No, an authority is just that. When the Spelling is marked as depreciated, you change the record in ValidSpelling. The reason for this extra table is, that the word "paardenbloem" will be depreciated and the older version of "paardebloem" is to be appreciated again. The Spelling does give a date, that one is about when it was introduce. A spelling authority is an organisation that decides on a specific spelling. The NTU is such an organisation for the Dutch language, the change in 2006 follows the change of 1996.
As I said in a previous mail, this was created because I will have to accomodate this change in a contemporary language. When 19th century Swedish changed to a more modern version, the old spelling can be depreciated by adding the ValidUntil in a ValidSpelling record. The old spelling and the new Spelling are related by a Relation record.
So the design will be set for some time allowing us to learn what it is we have. It will be changed when we know what to change and why. The requirements for content will be minimal and improvements will happen because of a collective effort.
Have you started to populate the database yet? What cycle span do you plan for between evaluations and redesigns?
At this moment I am working on the database design, Erik is working on Wikidata and when he has finished his part, we will work on the implementation as described on Meta. As to data, there are several many that will be included into the Ultimate Wiktionary. The content of the Wiktionaries is important among these. We have a wordlist in Stellingwerfs of 18.865 waiting in the wings enough to rank it as the sixth biggest Wiktionary. I have been informed that another version will have some 20K+ articles .. We have some 222.930 correctly spelled Dutch words that we may use (only spelling with hypenations). We have the content of several glossaries and thesauri.
So at this stage questions on the database stuff are very welcome. We do not expect to start including data until somewhere September. But when we do, it will be available for everybody to play with it. Thanks, Gerard
What is the solution for varied spellings which do not depend on an authority such as color vs. colour in English? What about when certain aspects of spelling are independent? colourise, colourize, colorize all exist in English, colorise seems not to exist.
What about the other orthographical variations I mentioned in an earlier post such as ASCII vs curved apostrophe in most languages or Hebrew and Arabic with varying degrees of pointing?
Andrew Dunbar (hippietrail)
On 7/30/05, Gerard Meijssen gerard.meijssen@gmail.com wrote:
Lars Aronsson wrote:
Gerard Meijssen wrote:
As the Dutch spelling will change in August 2006, we added things like spelling authorities.
Great, then you can introduce a new authority for every century of a language, so the 19th century Swedish can be told apart from 20th century Swedish. But how are spelling authorities different from languages? Couldn't "19th century Swedish" and "Dutch after 2006" be treated as languages of their own?
No, an authority is just that. When the Spelling is marked as depreciated, you change the record in ValidSpelling. The reason for this extra table is, that the word "paardenbloem" will be depreciated and the older version of "paardebloem" is to be appreciated again. The Spelling does give a date, that one is about when it was introduce. A spelling authority is an organisation that decides on a specific spelling. The NTU is such an organisation for the Dutch language, the change in 2006 follows the change of 1996.
As I said in a previous mail, this was created because I will have to accomodate this change in a contemporary language. When 19th century Swedish changed to a more modern version, the old spelling can be depreciated by adding the ValidUntil in a ValidSpelling record. The old spelling and the new Spelling are related by a Relation record.
So the design will be set for some time allowing us to learn what it is we have. It will be changed when we know what to change and why. The requirements for content will be minimal and improvements will happen because of a collective effort.
Have you started to populate the database yet? What cycle span do you plan for between evaluations and redesigns?
At this moment I am working on the database design, Erik is working on Wikidata and when he has finished his part, we will work on the implementation as described on Meta. As to data, there are several many that will be included into the Ultimate Wiktionary. The content of the Wiktionaries is important among these. We have a wordlist in Stellingwerfs of 18.865 waiting in the wings enough to rank it as the sixth biggest Wiktionary. I have been informed that another version will have some 20K+ articles .. We have some 222.930 correctly spelled Dutch words that we may use (only spelling with hypenations). We have the content of several glossaries and thesauri.
So at this stage questions on the database stuff are very welcome. We do not expect to start including data until somewhere September. But when we do, it will be available for everybody to play with it. Thanks, Gerard _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
On 27/07/05, Gerard Meijssen gerard.meijssen@gmail.com wrote:
First off all, if creating inflections is done programmatically, it is not part of the database design. The database design says that there will be a record for each inflection. The inflections are translated as every other word is, there is Spelling for it. This means that these words have an importance in their own right and that is more than just the sharing of the meaning with a headword. So I do not share your argument at all. Yes, we can generate inflections but this WILL result in new Spelling - Word - Meaning. And as long as we do not have software to do this for us, we will have to do it by hand.
Perhaps this software can simply be awk, sed or one of those things?
And perhaps we can call for people to start writing them?
:)
wikitech-l@lists.wikimedia.org