Nikola Smolenski wrote:
On Friday 22 July 2005 13:25, Gerard Meijssen wrote:
Nikola Smolenski wrote:
On Wednesday 20 July 2005 21:29, Gerard Meijssen wrote: Related to this, I'd suggest to add "ISO15924" column to "characterset" (or future "script") table. This way a script can be formally specified and looked up, regardless of its name.
Done, however the name of the script is a record in the database in its own right so it may have as many translations as we care to enter. The code is just to anchor it. I understand from Erik's notes that a language can be indicated as the default value.. The default value will be English. So, add a translation to the English word and from then on the User Interface will show it localised.
I'll talk about this below.
Why is there column "gender" in table "word"? If a word can exist in multiple genders, shouldn't that rather be represented in "inflection" table? If a word has a gender on its own, wouldn't that rather be represented in WordType table? If not, there are other properties of words (for example, number) which can also be represented in "word" table, why is gender singled out?
When a word is inflected to a particular form, that word is a word in its own right and consequently will be found in the UW. The inflection is there because it does provide information and this information is
Now I'm not so sure that I understand which table is for what. Could you give an example? For example, the word "white" is a base word and the word "whiter" is its inflection. How would these two words fit into the database?
Both words will exist as a Spelling, as a Word and they may share a Meaning. When the inflections are added, in the Inflection-Word, all the missing words will be created and they will all be related to each other through this table. Contrary to a paper dictionary we want them all.
relevant for the inflections and the headword.. A Wordtype indicates a noun a verb an adjective etc.
I still don't understand why is gender singled out of all properties a word could have. For example, a verb could be transitive or intransitive, and this information is important. To give an example:
word: horse gender: male partofspeech: noun
word: ship gender: female partofspeech: noun
word: to drive gender: none partofspeech: transitive verb
word: to swim gender: none partofspeech: intransitive verb
See what I mean? If you have to specify transitivity of a verb in "partofspeech" table, you may as well specify gender of a noun in that table. It would be consistent either to remove "gender" column from "word" table:
word: horse partofspeech: male noun
word: ship partofspeech: female noun
word: to drive partofspeech: transitive verb
word: to swim partofspeech: intransitive verb
or to rename it to, for example, "subtype":
word: horse subtype: male partofspeech: noun
word: ship subtype: female partofspeech: noun
word: to drive subtype: transitive partofspeech: verb
word: to swim subtype: intransitive partofspeech: verb
If you are going to change this, I'd suggest the first solution. Firstly, because there may be words which would have more than one subtype; secondly, because it eliminates the possibility of having invalid mix of subtypes (horse: intransitive noun...).
At this moment in time I would not have intransitive verbs or transitive verbs at all. To me they are verbs. When they are transitive, they have a different meaning from when they are intransitive so to me the destinction is in the meaning.
I'd strongly suggest to add a column "inflection", to either "wordtype" or "word" table; this would specify which inflection does a word use, and whether it is regular or irregular. If it is known which inflexion a word uses and if it is regular, then all its inflected forms could be generated automatically.
I see that there is column "languageid" in "meaningtext" table. If I understand this, it means that meaning of a word could be written down in various languages, and I second this. But I wonder how are you going to do the same for other data which might need to be translated (for example, column "characterset" of of "characterset" table - I understand that this is pure text? Were you thinking about this?
The fields "Sign" :) "Gender" "WordType" all relate to meaning; Ultimate Wiktionary will eat its own dogfood or when a translation to a word like noun is added like I did for Afrikaans recently, this translation is the one that will be used in the User Interface
OK, but what if you have a longer phrase as a table field? For example, an "inflection" in table "inflection" might be "male genitive superlative" or "3rd person plural female past". I don't think it makes sense to add such phrases to the dictionary as proper entries, only so that the dictionary would have translations of them.
Are some table fields inherently translatable? Is this what you had in mind above?
Most if not all text fields will be inherently translatable, this is what I have very much in mind. The name of a font will not be translated but that is the only one at this point in time. It makes perfect sense to have this in the UW as it allows us to have a self learning User Interface. The thing is; it has function.
Were you thinking about a way to register examples of use, similar to meaning? Or would examples of use be simply a raw text in "meaningtext" table?
Idiom, proverbs will be a "wordtype" as much as noun is. Therefore a proverb will relate to a keyword through WordRelation. I have updated the table Relation with a newfield "SameLanguageOnly" this ensures that the relation is applicable within the same language so the relation would be "proverb" and it would combine "apple" with "an apple a day keeps the doctor away".
MeaningText would be just the definition of a meaning in a given language.
I was thinking about something else; for example, on http://en.wiktionary.org/wiki/account there is this example: "A beggarly account of empty boxes. - Shakespeare, Romeo and Juliet, V-i"; but I understand now it is going to be just a part of "meaningtext". I'm not so certain, but maybe it would be good to create a separate table for examples, because same examples could (and probably will) be used in "meaningtext"s in different languages. It would also make it easier to automatically add new examples (for example, by grepping Project Gutenberg ;)
"A beggarly account of empty boxes" is a quote and why not have it as a seperate Word and marked as such ?? It would be a idiom for "account" and this is linked through Relation. Many famous quotes have been translated and we could have them all. (Een paard , een paard, een koninkrijk voor een paard)
I do not think grepping Project Gutenberg makes much sense. If anything it helps you find occurances of the word but you have to be selective of what to include. That is an editorial process and just the fact that a word is used does not make for a good idiom in the UW.
One reason why it is not as much documented as I would like is, because I am still working on the structure. At this moment I am thinking hard on how to include signed languages and the spoken dialects of the Chinese and Arabic written language.
The last one is easy. Add ISO3166_1 and ISO3166_2 columns to language table. That way, you could formally represent a dialect within a certain country (as for Arabic) or even a region (as for Chinese). Perhaps even better solution would be to include a RegionID column which would point to table "regions" (relation 1-many), which would have RegionID, ISO3166_1 and ISO3166_2 columns; that way you could specify wider regions in which a dialect is spoken, even if they go over country boundaries.
The country code is irrelevant as far as this database is concerned. This database is about words in languages and dialects.
I still think that this would be an useful way of formally specifying a dialect. For example, British English would have ISO639_2 code "en" and ISO3166_1 code "uk" while Australian English would have ISO639_2 code "en" but ISO3166_1 code "au".
Even the ISO-639 codes in the table are there to connect what we are doing in the Wikipedias and other projects. As it is a standard I added it but in the database the ISO 639 fields are not compulsory, the "WMF key" is. If we "need" these ISO639_2 codes, then we would adhere to the principle that a language is a dialect with an army. Have a look at http://nl.wiktionary.org/wiki/WikiWoordenboek:Lijst_van_messages#Schrijfwijz... and you will see how we do some of the uk and au stuff for you. This is however not a great example because it is a mix of different spelling but also vocabulary and scripts. As I was not content with this I came up with the current ERD.
Were you thinking about a way to formally define a dialect? Ideally, beside a region (by the way, ISO3166-2 is not granular enough, and there should be a better way of expressing region in which a dialect is spoken, perhaps even to the level of a village), there should be specified a time period in which a dialect was spoken, social layer which was speaking it, and perhaps even a particular person or entity using it. OK, I got carried away a bit but such things might be important :) Though some of this should perhaps be (also) tied to a word and not a dialect.
Ultimate wiktionary is about words (written, spoken or signed) that is the starting point. There will be a need for some formality; once a dialect is recognised, it will be hard to take it away. Therefore in my opinion it will be after some discussion. They will be added as such by an admin. I would think that we need to consider what it takes before we add a dialect. Tentatively I would go for at least 100 words defined as such. With a dialect I would assume that words that are not defined are those of the higher level language.
When I was referring to "dialect", I did not have in mind a dialect that is officially recognised, but simply a set of words which could be identified as belonging to a certain group. So if you want to say that this word was part of London dockworkers' slang in 1800s, you should be able to do so, and not just stamp it with "British English".
When there are words that are specific to London dockworkers in the 1800s, I would not call it a dialect because like many professions they have there own vocabulary. These I would mark within a collection as the bulk of what they say would be London English of the 1800s. Now there is one thing that is relevant, the UW wants all words of all languages but its primary purpose it to have the current vocabulary. So yes, these words exist and have their place but when they are not used anymore they should be marked as such.
As a simple example, in Serbia, there are several publishing houses that were publishing Asterix, and in some translations "Idefix" is named "Garoviks" and "Panoramix" is named "Aspiriniks" while in others "Idefix" is named "Idefiks" and "Panoramix" is named "Panoramiks"; and this is consistent. If you are going to translate something about Asterix to Serbian, you should pick one of the translations, but you should be consistent in using only the words from the translation which you have picked, and they should somehow be marked as belonging to the same translation. There surely are more important things than Asterix where similar might apply.
Garoviks and Idefiks are for the Serbian language synonyms and as such I do not have to choose because they are both correct. As a matter of interest you could explain things either in the etymology or in the meaning of the word.