On Wednesday 20 July 2005 21:29, Gerard Meijssen wrote:
Timwi wrote:
Gerard Meijssen wrote:
I would welcome your comments about the ERD that I posted here http://commons.wikimedia.org/wiki/Image:ERD.jpg
I haven't seen it before, and I was thinking about it on my own, so I'd like to comment on it :)
Looks interesting, but is extremely bare. It would do well with a bit of documentation. For much of it, the purpose isn't entirely clear. I'm particularly confused as to why "Language", "Word" and "Meaning" are each duplicated.
Hoi, There is some documentation here: http://meta.wikimedia.org/wiki/Ultimate_Wiktionary_data_design and http://meta.wikimedia.org/wiki/Ultimate_Wiktionary_decisions_on_its_usage here. The duplication reflects that there is at least one table that has two relations to the same table. Language refers to itself for dialects, Word refers through Conju/Decli (conjucation or declinations) to a headword and derived words, Meaning is related through "Relations" this is to allow for thesaurus like structures.
First, I'd suggest a cosmetic change, to use "script" instead of "charset" in table and column names. When I first saw it, I thought that you are referring to computer charsets (ISO 8859-1, Windows-1251...) and was wondering why wouldn't you simply use UTF-8 :)
Related to this, I'd suggest to add "ISO15924" column to "characterset" (or future "script") table. This way a script can be formally specified and looked up, regardless of its name.
Why is there column "gender" in table "word"? If a word can exist in multiple genders, shouldn't that rather be represented in "inflection" table? If a word has a gender on its own, wouldn't that rather be represented in WordType table? If not, there are other properties of words (for example, number) which can also be represented in "word" table, why is gender singled out?
I'd strongly suggest to add a column "inflection", to either "wordtype" or "word" table; this would specify which inflection does a word use, and whether it is regular or irregular. If it is known which inflexion a word uses and if it is regular, then all its inflected forms could be generated automatically.
I see that there is column "languageid" in "meaningtext" table. If I understand this, it means that meaning of a word could be written down in various languages, and I second this. But I wonder how are you going to do the same for other data which might need to be translated (for example, column "characterset" of of "characterset" table - I understand that this is pure text? Were you thinking about this?
Were you thinking about a way to register examples of use, similar to meaning? Or would examples of use be simply a raw text in "meaningtext" table?
One reason why it is not as much documented as I would like is, because I am still working on the structure. At this moment I am thinking hard on how to include signed languages and the spoken dialects of the Chinese and Arabic written language.
The last one is easy. Add ISO3166_1 and ISO3166_2 columns to language table. That way, you could formally represent a dialect within a certain country (as for Arabic) or even a region (as for Chinese). Perhaps even better solution would be to include a RegionID column which would point to table "regions" (relation 1-many), which would have RegionID, ISO3166_1 and ISO3166_2 columns; that way you could specify wider regions in which a dialect is spoken, even if they go over country boundaries.
As a sidenote, I'd also suggest to rename ISO639-2 and ISO639-3 columns to ISO639_2 and ISO639_3, respectively. ISO639-2 might be interpreted as ISO639 minus 2 which can lead to all sorts of confusions.
Were you thinking about a way to formally define a dialect? Ideally, beside a region (by the way, ISO3166-2 is not granular enough, and there should be a better way of expressing region in which a dialect is spoken, perhaps even to the level of a village), there should be specified a time period in which a dialect was spoken, social layer which was speaking it, and perhaps even a particular person or entity using it. OK, I got carried away a bit but such things might be important :) Though some of this should perhaps be (also) tied to a word and not a dialect.