There is a need for using XML with wiktionary.
The definitions of a word in wiktionary, can be structured in a fixed way. For each word/phrase you have a: *Indication what language it is in *Name of the word/phrase *Definition of the word/phrase *Translations *Pronounciation *Synonyms *Antonyms etc
I do not try to be complete here, but my point is, the data is structured. Other organisations that work with words already structure their data using XML for instance GEMET. When Wiktionary is structured exlicitly, the result will be that the import and export from Wiktionary becomes possible and it will become possible for other dictionary/ glossary project to get a changed content in XML format that is specific for working with words. This will enhance the importance of wiktionary and it will help achieve out aim which is open accessible dictionary content.
The flip side of the coin is that we can get {changed) content from other dictionary/ glossary projects for inclusion in wiktionary.
The GEMET data is available as XML data and, it would be great to import it straight in from XML.
*Issues: #Using XML standards for dictionary content. #Importing data / Exporting data using the current wiktionary content. #Structuring wiktionary using MySQL tables. #Importing data / Exporting data using the future wiktionary structure.
NB I have posted this on META as well
Thanks, GerardM
Geerd, please have a look at this:
********************************************************* <glossword> - <line> <term t1="A" t2="AT" id="4">attuatore</term> - <defn> <trns lang="106">Trieb</trns> <trns lang="106">Arbeitszylinder</trns> Trieb ->azionatore (comandi idraulici); Arbeitshylinder -> ölhydraulisch </defn> </line> - <line> <term t1="B" t2="BA" id="62">batteriostatico</term> - <defn> <abbr lang="027"/> <trns lang="106">bakteriostatisch</trns> </defn> </line> - <line> <term t1="B" t2="BO" id="54">bovino</term> - <defn> <abbr lang="027"/> <trns lang="106">bovin</trns> </defn> </line> - <line> <term t1="C" t2="CE" id="33">cellule ematiche</term> - <defn> <abbr lang="027"/> <trns lang="106">Blutzellen</trns> </defn> </line> - <line> <term t1="C" t2="CO" id="37">concentrazione serica</term> - <defn> <abbr lang="027"/> <trns lang="106">Serumkonzentration</trns> </defn> </line> - <line> <term t1="E" t2="EM" id="27">ematico</term> - <defn> <abbr lang="027"/> <trns lang="106">Hämato-</trns> <trns lang="106">Blut-</trns> </defn> </line> - <line> <term t1="E" t2="EN" id="8">endovenoso</term> - <defn> <abbr lang="027"/> <trns lang="106">intravenös</trns> </defn> </line> - <line> <term t1="E" t2="ET" id="5">etambutolo</term> - <defn> <trns lang="106">Ethambutol</trns> <trns lang="103">ethambutol</trns> <src>http://www.gesundheit.de/roche/ro10000/r10785.html</src> </defn> </line> - <line> <term t1="F" t2="FL" id="24">fleboclisi</term> - <defn> <abbr lang="027"/> <trns lang="106">intravenöse Infusion</trns> </defn> </line> - <line> <term t1="I" t2="IN" id="10">iniezione endovenosa</term> - <defn> <abbr lang="027"/> <trns lang="106">intravenöse Injektion</trns> </defn> </line> - <line> <term t1="L" t2="LE" id="77">legno impiallacciato</term> - <defn> <abbr lang="025"/> <trns lang="106">furniertes Holz</trns> <src>eurodicautom</src> </defn> </line> - <line> <term t1="L" t2="LE" id="71">legno tamburato</term> - <defn> <abbr lang="025"/> <trns lang="106">furniertes Holz</trns> <src>eurodicautom</src> </defn> </line> - <line> <term t1="M" t2="MA" id="44">mass media</term> - <defn> <trns lang="106">Massenmedien</trns> </defn> </line> - <line> <term t1="O" t2="OC" id="51">ocratossina</term> - <defn> <abbr lang="054"/> <trns lang="106">Ochratoxin</trns> - <src> http://www.verbraucherministerium.de/forschungsreport/rep2-99/ochra.htm </src> - <src> Schimmelpilze sind in der Lage, unter sehr unterschiedlichen Bedingungen giftige Substanzen (Mykotoxine) zu bilden, die in der gesamten Nahrungskette vorkommen können. Zu den bekanntesten Vertretern zählt das Ochratoxin A, das von bestimmten Penicillium- und Aspergillus-Arten gebildet wird. Dieses Toxin kann die Nieren und das Immunsystem schädigen und zeigt im Tierversuch eine kanzerogene Wirkung. Der Aufgabe des Gesetzgebers, einen umfassenden Schutz der Verbraucher zu gewährleisten, ist das Bundesministerium für Gesundheit (BMG) nachgekommen und hat eine wissenschaftliche Studie über Ochratoxin A initiiert und gefördert, die 1996 begonnen und 1999 abgeschlossen wurde. Das Projekt hatte zum Ziel, Verzehrsdaten auf epidemiologischer Grundlage zu ermitteln und Ochratoxin A in relevanten Lebensmitteln sowie im Blutserum von Probanden zu bestimmen. Aus der Verknüpfung der Ergebnisse von Verzehrsdaten, Lebensmitteluntersuchungen und Blutserum-Analysen können eine fundierte Beurteilung der tatsächlichen Exposition der Bevölkerung in Deutschland erstellt und Empfehlungen für eine Höchstmengenregelung abgeleitet werden. (weiteres auf der Website) </src> </defn> </line> - <line> <term t1="P" t2="PE" id="17">per via endovenosa</term> - <defn> <abbr lang="027"/> <trns lang="106">intravenös</trns> </defn> </line> - <line> <term t1="S" t2="SO" id="68">soluzione fisiologica</term> - <defn> <abbr lang="027"/> <trns lang="106">Ringer Lösung</trns> </defn> </line> </glossword>
**************************************************** This was the original glossary project where I tried to ask people for co-operation, but online I had just two members trying to help ... that's why I passed to .csv tables that can then easily be converted itno xml in a second stage (most people don't like working online even if they are connected 24 hrs a day ...). Glossword (www.glossword.info) is an OpenSource project - Sorceforge page: http://sourceforge.net/projects/glossword/
Instead of ISO-language code Glossword uses numbers to identify languages. This is a minor problem as it is easy to change to ISO using search and replace.
My installation of the software can be found here: www.dict.wesolveitnet.com. If you'd like to try around there, just let me know I'll then create a user account with access to all dics. Really I am not sure if this can be useful ... maybe just for conversion issues?
Ciao, Sabine
*************
Sabine Cretella s.cretella@wordsandmore.it www.wordsandmore.it Meetingplace for translators www.wesolveitnet.com
The current format is adequate. Your proposal makes no mention of the possible sacrifices in terms of ease of editing, a key feature in all the wikis. How will flexibility of format be maintained?
Ec
Gerard Meijssen wrote:
There is a need for using XML with wiktionary.
The definitions of a word in wiktionary, can be structured in a fixed way. For each word/phrase you have a: *Indication what language it is in *Name of the word/phrase *Definition of the word/phrase *Translations *Pronounciation *Synonyms *Antonyms etc
I do not try to be complete here, but my point is, the data is structured. Other organisations that work with words already structure their data using XML for instance GEMET. When Wiktionary is structured exlicitly, the result will be that the import and export from Wiktionary becomes possible and it will become possible for other dictionary/ glossary project to get a changed content in XML format that is specific for working with words. This will enhance the importance of wiktionary and it will help achieve out aim which is open accessible dictionary content.
The flip side of the coin is that we can get {changed) content from other dictionary/ glossary projects for inclusion in wiktionary.
The GEMET data is available as XML data and, it would be great to import it straight in from XML.
*Issues: #Using XML standards for dictionary content. #Importing data / Exporting data using the current wiktionary content. #Structuring wiktionary using MySQL tables. #Importing data / Exporting data using the future wiktionary structure.
Ray Saintonge wrote:
The current format is adequate. Your proposal makes no mention of the possible sacrifices in terms of ease of editing, a key feature in all the wikis. How will flexibility of format be maintained?
XML is not to edited by hand. You are absolutely right about that. However, the current format is not without its problems. At this moment an English word cannot be re-used easily in other languages. Things are free formatted at the moment. It would be a good thing if we start thinking about creating some database structures for use within wiktionary. It would rid us of these dratted templates like {{en}} and {{-en-}}. They work, it is the best thing around but they are ugly.
What I propose at this time is to get us thinking about importing and exporting in an XML format. And considering changes to enhance the functionality within all wiktionaries and the functionality to the outside world.
One of the aims of wikimedia is to create open content. By having our data in our proprietary format, we do not achieve what can be achieved.
Thanks, Gerard
Gerard Meijssen wrote:
Ray Saintonge wrote:
The current format is adequate. Your proposal makes no mention of the possible sacrifices in terms of ease of editing, a key feature in all the wikis. How will flexibility of format be maintained?
XML is not to edited by hand. You are absolutely right about that. However, the current format is not without its problems. At this moment an English word cannot be re-used easily in other languages. Things are free formatted at the moment. It would be a good thing if we start thinking about creating some database structures for use within wiktionary. It would rid us of these dratted templates like {{en}} and {{-en-}}. They work, it is the best thing around but they are ugly.
What I propose at this time is to get us thinking about importing and exporting in an XML format. And considering changes to enhance the functionality within all wiktionaries and the functionality to the outside world.
One of the aims of wikimedia is to create open content. By having our data in our proprietary format, we do not achieve what can be achieved.
Being able to extract the data from the English Wiktionary is something that has always been on my mind. It is one of the considerations. Ease of editing is another. That's why the entries in Wiktionary may not look very pretty, but they are built up in a very logical way. It is possible, although probably not trivial, to process them with a script.
If XML is not to be edited by hand, then how are we going to edit it? Is it compatible with the Wiki concept?
If you want database structures, I can give you those. I have been designing a relational database capable of storing everything that is relevant for a dictionary. The problem is in the user interface. That's where I'm stuck. Another problem is how to keep a history of what was changed. Another thing is that I don't have any idea how performant my database would be. Building a presentable report for an entry involves solving relations between many, many tables.
Anyway, if you're interested in having a look at it, I will gladly send you the OpenOffice.org drawing with the table structures.
Polyglot
What I propose at this time is to get us thinking about importing and exporting in an XML format. And considering changes to enhance the functionality within all wiktionaries and the functionality to the outside world.
One of the aims of wikimedia is to create open content. By having our data in our proprietary format, we do not achieve what can be achieved.
Being able to extract the data from the English Wiktionary is something that has always been on my mind. It is one of the considerations. Ease of editing is another. That's why the entries in Wiktionary may not look very pretty, but they are built up in a very logical way. It is possible, although probably not trivial, to process them with a script.
If XML is not to be edited by hand, then how are we going to edit it? Is it compatible with the Wiki concept?
If you want database structures, I can give you those. I have been designing a relational database capable of storing everything that is relevant for a dictionary. The problem is in the user interface. That's where I'm stuck. Another problem is how to keep a history of what was changed. Another thing is that I don't have any idea how performant my database would be. Building a presentable report for an entry involves solving relations between many, many tables.
Anyway, if you're interested in having a look at it, I will gladly send you the OpenOffice.org drawing with the table structures.
Jo, all this is very interesting - not only to me, but to some colleagues as well.
I am copying this message to some of them.
Many of us are translators and our difficulty normally is the structure and not the content. So if there's a way to combine things it is mostly needed and the more people know about this the easier we probably can find a way on how to concentrate efforts and exclude double work.
I am sure: most of the things we need are already there, they just need to be adapted to be able to work "cross over" for different projects.
So please, if you can: let us have a look at your structure.
Ciao and thank you!
Sabine
Jo wrote:
Gerard Meijssen wrote:
Ray Saintonge wrote:
The current format is adequate. Your proposal makes no mention of the possible sacrifices in terms of ease of editing, a key feature in all the wikis. How will flexibility of format be maintained?
XML is not to edited by hand. You are absolutely right about that. However, the current format is not without its problems. At this moment an English word cannot be re-used easily in other languages. Things are free formatted at the moment. It would be a good thing if we start thinking about creating some database structures for use within wiktionary. It would rid us of these dratted templates like {{en}} and {{-en-}}. They work, it is the best thing around but they are ugly.
What I propose at this time is to get us thinking about importing and exporting in an XML format. And considering changes to enhance the functionality within all wiktionaries and the functionality to the outside world.
One of the aims of wikimedia is to create open content. By having our data in our proprietary format, we do not achieve what can be achieved.
Being able to extract the data from the English Wiktionary is something that has always been on my mind. It is one of the considerations. Ease of editing is another. That's why the entries in Wiktionary may not look very pretty, but they are built up in a very logical way. It is possible, although probably not trivial, to process them with a script.
If XML is not to be edited by hand, then how are we going to edit it? Is it compatible with the Wiki concept?
If you want database structures, I can give you those. I have been designing a relational database capable of storing everything that is relevant for a dictionary. The problem is in the user interface. That's where I'm stuck. Another problem is how to keep a history of what was changed. Another thing is that I don't have any idea how performant my database would be. Building a presentable report for an entry involves solving relations between many, many tables.
Anyway, if you're interested in having a look at it, I will gladly send you the OpenOffice.org drawing with the table structures.
Polyglot
Wiktionary-l mailing list Wiktionary-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/wiktionary-l
XML will be the result of an extraction process. In the same way, XML will/may result in entries into the Wiktionary database.
The problem with the current "structure" is that it is too fluent and is prone to produce errors. Particularly problematic are the translations with numbers indicating that may or may not exist indicating to what meaning of a word they refer. It will be almost impossible to export to XML because of this.
I am really happy that your database design is on META so that people can comment on it. (http://meta.wikimedia.org/wiki/Tables_for_Wiktionary). As discussed with Polyglot using SKYPE, I would prefer to have less tables. However, it is really usefull to have thought out designs and as such it is hopefull for the things that may come.
One thing that XML should be used for is to create a history that people can subscribe to. This will export the Wiktionary content and will make it more relevant to many translators as it is then a matter of reading it into whatever format. When the changes are entered after 24 hours of the last change, things like vandalism have less chance.
Some excellent touches in the db design are to include pictures and sounds in it. Have SAMPA, but nothing beats hearing a native speaker saying a word, a phrase. A description of a monkey sure, but a picture paints a thousant words. One thing that can be added is somehing on etymology.
Thanks, Gerard
Gerard Meijssen schreef:
Jo wrote:
Gerard Meijssen wrote:
Ray Saintonge wrote:
The current format is adequate. Your proposal makes no mention of the possible sacrifices in terms of ease of editing, a key feature in all the wikis. How will flexibility of format be maintained?
XML is not to edited by hand. You are absolutely right about that. However, the current format is not without its problems. At this moment an English word cannot be re-used easily in other languages. Things are free formatted at the moment. It would be a good thing if we start thinking about creating some database structures for use within wiktionary. It would rid us of these dratted templates like {{en}} and {{-en-}}. They work, it is the best thing around but they are ugly.
What I propose at this time is to get us thinking about importing and exporting in an XML format. And considering changes to enhance the functionality within all wiktionaries and the functionality to the outside world.
One of the aims of wikimedia is to create open content. By having our data in our proprietary format, we do not achieve what can be achieved.
Being able to extract the data from the English Wiktionary is something that has always been on my mind. It is one of the considerations. Ease of editing is another. That's why the entries in Wiktionary may not look very pretty, but they are built up in a very logical way. It is possible, although probably not trivial, to process them with a script.
If XML is not to be edited by hand, then how are we going to edit it? Is it compatible with the Wiki concept?
If you want database structures, I can give you those. I have been designing a relational database capable of storing everything that is relevant for a dictionary. The problem is in the user interface. That's where I'm stuck. Another problem is how to keep a history of what was changed. Another thing is that I don't have any idea how performant my database would be. Building a presentable report for an entry involves solving relations between many, many tables.
Anyway, if you're interested in having a look at it, I will gladly send you the OpenOffice.org drawing with the table structures.
Polyglot
Wiktionary-l mailing list Wiktionary-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/wiktionary-l
XML will be the result of an extraction process. In the same way, XML will/may result in entries into the Wiktionary database.
The problem with the current "structure" is that it is too fluent and is prone to produce errors. Particularly problematic are the translations with numbers indicating that may or may not exist indicating to what meaning of a word they refer. It will be almost impossible to export to XML because of this.
I am really happy that your database design is on META so that people can comment on it. (http://meta.wikimedia.org/wiki/Tables_for_Wiktionary). As discussed with Polyglot using SKYPE, I would prefer to have less tables. However, it is really usefull to have thought out designs and as such it is hopefull for the things that may come.
One thing that XML should be used for is to create a history that people can subscribe to. This will export the Wiktionary content and will make it more relevant to many translators as it is then a matter of reading it into whatever format. When the changes are entered after 24 hours of the last change, things like vandalism have less chance.
Some excellent touches in the db design are to include pictures and sounds in it. Have SAMPA, but nothing beats hearing a native speaker saying a word, a phrase. A description of a monkey sure, but a picture paints a thousant words. One thing that can be added is somehing on etymology.
I have elaborated a bit on what the tables are meant for. Please feel free to ask questions about it. I will gladly (try to) respond.
Jo
--- Gerard Meijssen gerardm@myrealbox.com wrote:
There is a need for using XML with wiktionary.
I agree.
The definitions of a word in wiktionary, can be structured in a fixed way.
I disagree. But it depends on *how much* structure you want.
For each word/phrase you have a: *Indication what language it is in *Name of the word/phrase *Definition of the word/phrase *Translations *Pronounciation *Synonyms *Antonyms etc
Actually If we only wanted to structure these parts it would work ok. Many other properties of words and phrases are a lot more difficult, such as part-of-speech.
I do not try to be complete here, but my point is, the data is structured. Other organisations that work with words already structure their data using XML for instance GEMET. When Wiktionary is structured exlicitly, the result will be that the import and export from Wiktionary becomes possible and it will become possible for other dictionary/ glossary project to get a changed content in XML format that is specific for working with words. This will enhance the importance of wiktionary and it will help achieve out aim which is open accessible dictionary content.
The flip side of the coin is that we can get {changed) content from other dictionary/ glossary projects for inclusion in wiktionary.
The GEMET data is available as XML data and, it would be great to import it straight in from XML.
*Issues: #Using XML standards for dictionary content. #Importing data / Exporting data using the current wiktionary content. #Structuring wiktionary using MySQL tables. #Importing data / Exporting data using the future wiktionary structure.
NB I have posted this on META as well
I do think a dictionary requires structure which an encyclopedia does not. A very loose structure like you have described would be a benefit for Wiktionary. The problems I see are these: 1. Once we have some structure people will push for more structure such as part-of-speech, not realizing how difficult that is to get right in a multilingual dictionary. 2. To work with the wiki software we can have a tool/ script/routine which maps from internal XML into wiki/HTML so it can be displayed. 3. People will have to input XML, or we need a friendly interface which can take input from non-expert users and turn it into correct XML.
Number 3 would mean a *lot* of work for developers.
Andrew (hippietrail).
Thanks, GerardM
Wiktionary-l mailing list Wiktionary-l@Wikipedia.org
http://mail.wikipedia.org/mailman/listinfo/wiktionary-l
===== http://linguaphile.sf.net/cgi-bin/translator.pl http://www.abisource.com
___________________________________________________________ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com
wiktionary-l@lists.wikimedia.org