Re: [Wikitech-l] Re: Case sensitivity on Afrikaans Wiktionary

22 Jul 2005


      Nikola Smolenski wrote:
...
On Wednesday 20 July 2005 21:29, Gerard Meijssen wrote:
...
Timwi wrote:
...
Gerard Meijssen wrote:
...
I would welcome your comments about the ERD that I posted here
http://commons.wikimedia.org/wiki/Image:ERD.jpg
I haven't seen it before, and I was thinking about it on my own, so I'd like 
to comment on it :)
...
...
Looks interesting, but is extremely bare. It would do well with a bit
of documentation. For much of it, the purpose isn't entirely clear.
I'm particularly confused as to why "Language", "Word" and "Meaning"
are each duplicated.
Hoi,
There is some documentation here:
http://meta.wikimedia.org/wiki/Ultimate_Wiktionary_data_design and
http://meta.wikimedia.org/wiki/Ultimate_Wiktionary_decisions_on_its_usage
here.
The duplication reflects that there is at least one table that has two
relations to the same table. Language refers to itself for dialects,
Word refers through Conju/Decli (conjucation or declinations) to a
headword and derived words, Meaning is related through "Relations" this
is to allow for thesaurus like structures.
First, I'd suggest a cosmetic change, to use "script" instead of "charset" in 
table and column names. When I first saw it, I thought that you are referring 
to computer charsets (ISO 8859-1, Windows-1251...) and was wondering why 
wouldn't you simply use UTF-8 :)
Done
...
Related to this, I'd suggest to add "ISO15924" column to "characterset" (or 
future "script") table. This way a script can be formally specified and 
looked up, regardless of its name.
Done, however the name of the script is a record in the database in its 
own right so it may have as many translations as we care to enter. The 
code is just to anchor it. I understand from Erik's notes that a 
language can be indicated as the default value.. The default value will 
be English. So, add a translation to the English word and from then on 
the User Interface will show it  localised.
...
Why is there column "gender" in table "word"? If a word can exist in multiple 
genders, shouldn't that rather be represented in "inflection" table? If a 
word has a gender on its own, wouldn't that rather be represented in WordType 
table? If not, there are other properties of words (for example, number) 
which can also be represented in "word" table, why is gender singled out?
When a word is inflected to a particular form, that word is a word in 
its own right and consequently will be found in the UW. The inflection 
is there because it does provide information and this information is 
relevant for the inflections and the headword.. A Wordtype indicates a 
noun a verb an adjective etc.
...
I'd strongly suggest to add a column "inflection", to either "wordtype" or 
"word" table; this would specify which inflection does a word use, and 
whether it is regular or irregular. If it is known which inflexion a 
word uses and if it is regular, then all its inflected forms could be 
generated automatically.
I see that there is column "languageid" in "meaningtext" table. If I 
understand this, it means that meaning of a word could be written down in 
various languages, and I second this. But I wonder how are you going to do 
the same for other data which might need to be translated (for example, 
column "characterset" of of "characterset" table - I understand that this is 
pure text? Were you thinking about this?
The fields "Sign" :) "Gender" "WordType" all relate to meaning; Ultimate 
Wiktionary will eat its own dogfood or when a translation to a word like 
noun is added like I did for Afrikaans recently, this translation is the 
one that will be used in the User Interface
...
Were you thinking about a way to register examples of use, similar to meaning? 
Or would examples of use be simply a raw text in "meaningtext" table?
Idiom, proverbs will be a "wordtype" as much as noun is. Therefore a 
proverb will relate to a keyword through WordRelation. I have updated 
the table Relation with a newfield "SameLanguageOnly" this ensures that 
the relation is applicable within the same language so the relation 
would be "proverb" and it would combine "apple" with "an apple a day 
keeps the doctor away".
MeaningText would be just the definition of a meaning in a given language.
...
...
One reason why it is not as much documented as I would like is, because
I am still working on the structure. At this moment I am thinking hard
on how to include signed languages and the spoken dialects of the
Chinese and Arabic written language.
The last one is easy. Add ISO3166_1 and ISO3166_2 columns to language table. 
That way, you could formally represent a dialect within a certain country (as 
for Arabic) or even a region (as for Chinese). Perhaps even better solution 
would be to include a RegionID column which would point to table 
"regions" (relation 1-many), which would have RegionID, ISO3166_1 and 
ISO3166_2 columns; that way you could specify wider regions in which a 
dialect is spoken, even if they go over country boundaries.
The country code is irrelevant as far as this database is concerned. 
This database is about words in languages and dialects.
...
As a sidenote, I'd also suggest to rename ISO639-2 and ISO639-3 columns to 
ISO639_2 and ISO639_3, respectively. ISO639-2 might be interpreted as ISO639 
minus 2 which can lead to all sorts of confusions.
Done
...
Were you thinking about a way to formally define a dialect? Ideally, beside a 
region (by the way, ISO3166-2 is not granular enough, and there should be a 
better way of expressing region in which a dialect is spoken, perhaps even to 
the level of a village), there should be specified a time period in which a 
dialect was spoken, social layer which was speaking it, and perhaps even a 
particular person or entity using it. OK, I got carried away a bit but such 
things might be important :) Though some of this should perhaps be (also) 
tied to a word and not a dialect.
Ultimate wiktionary is about words (written, spoken or signed) that is 
the starting point. There will be a need for some formality; once a 
dialect is recognised, it will be hard to take it away. Therefore in my 
opinion it will be after some discussion. They will be added as such by 
an admin. I would think that we need to consider what it takes before we 
add a dialect. Tentatively I would go for at least 100 words defined as 
such. With a dialect I would assume that words that are not defined are 
those of the higher level language.
Things like where it is spoken and time periods at all sounds to me like 
etymological content. And that is where I would have it.
Thanks,
    GerardM

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Re: Case sensitivity on Afrikaans Wiktionary