I have been on this list awhile, when i originally joined i was interesting in the possibility of exporting the wiktionary data as .dict format. Now that the newest version of OSX 10.4 has a built-in dictionary that uses the dict:// to look-up words i was interested to see if anyone on the technicaly side would like to explore the possibility of either exporting the Wiktionary database as .dict format, or run a dictionary daemon that would access the wiktionary database server and return dict entries. It would be read-only, but it would be another interesting way to access the wiktionary besides the web interface.
Does anyone on the tech list know if this is even possible? I'm not asking you to do it (i can write the export), i was wondering if there is some sort of database schema available to extract the data into dict format, or are the entries too fragmented to even attempt an export?
-brian
Brian Suda wrote:
I have been on this list awhile, when i originally joined i was interesting in the possibility of exporting the wiktionary data as .dict format. Now that the newest version of OSX 10.4 has a built-in dictionary that uses the dict:// to look-up words i was interested to see if anyone on the technicaly side would like to explore the possibility of either exporting the Wiktionary database as .dict format, or run a dictionary daemon that would access the wiktionary database server and return dict entries. It would be read-only, but it would be another interesting way to access the wiktionary besides the web interface.
Does anyone on the tech list know if this is even possible? I'm not asking you to do it (i can write the export), i was wondering if there is some sort of database schema available to extract the data into dict format, or are the entries too fragmented to even attempt an export?
-brian
Hoi, I read your mail with intrest. It made me look into what the .dict format is. It is described in RFC 2229. It allows for people to look up information from their computer and the information is delivered from one of the many hosts that may hold the requested information. As Angela said in her reply to your post, we are working on a new iteration of Wiktionary that is going by the name of "Ultimate Wiktionary". This will have a relational database at its heart. It is intended to have content in all languages and the first challenge is to make it work in the first place. The second challenge is to create a User Interface that translates to all these languages and the third challenge is to have an import and export mechanism, preferably using a standards based XML scheme.
We hope to show something at the Wikimania event. In your mail you want to export the wiktionary data. The consequence is that when you choose for export, you will have to do this continually as we hope to increase the content of Ultimate Wiktionary dramatically. As far as I understand the RFC, there is a need for responding to a request and providing a reply in a set format. There is no need to have a database in a specific format as long as the respons provided conforms to the RFC.
As the Ultimate Wiktionary is being designed at the moment and as Wikidata is being built, this is a time to consider what is needed to provide .dict functionality. This functionality will be included when someone does the programming or when someone finds the funds to strap the .dict functionality on top of Ultimate Wiktionary. At this time it is premature to think about exporting from UW as UW has not been built yet. It will however be possible to do so.
Some of the current wiktionaries can be parsed into information as they are highly structured. Some will prove an interesting challenge to convert to any other format. Because of its lack of structure and consistency it is as closed as any proprietary format would be. Even the names of an article is not necessarily the name of the associated word as some Wiktionaries still capitalise the first character of a word.
One problem I see with exporting content from Wiktionary is the GNU-FDL requirement to maintain the history of the contributors. For the UW, I think it can be solved by adding the history information on the talk page. The necessity for UW stems from the likelyhood that many wiktionaries, if not all, may merge into the Ultimate Wiktionary and be abandoned.
Thanks, GerardM.
I was working on parsing the English Wiktionary for some months, with a long-term goal of sharing translations for my translator, Linguaphile.
The free-form nature of Wiktionary articles is the biggest hurdle. While there is a format to add some structure to the data, it has many variations, many experimental new ones appearing all the time, and many inherent flaws which the contributors either haven't overcome, or haven't been able to agree to a single way to overcome.
I would've happily shared the code I had, but due to a grey-out (so I'm told), almost all of my computer's components were destroyed and I haven't been able to recover the data from the hard drive.
While a fully structured re-design would obviously help myself and people wanting to interchange Wiktionary data with .dict data, it would definitely make it much harder for the general user to contribute.
I've been thinking about a compromise solution where slightly more structure might be added, perhaps similar to HTML/CSS styles. This coupled with some kind of very flexible parsing might get a degree of success.
I'm still very interested in this field and may try to recreate my parser but losing that much work is very disheartening.
Andrew Dunbar (hippietrail)
On 5/17/05, Gerard Meijssen gerard.meijssen@gmail.com wrote:
Brian Suda wrote:
I have been on this list awhile, when i originally joined i was interesting in the possibility of exporting the wiktionary data as .dict format. Now that the newest version of OSX 10.4 has a built-in dictionary that uses the dict:// to look-up words i was interested to see if anyone on the technicaly side would like to explore the possibility of either exporting the Wiktionary database as .dict format, or run a dictionary daemon that would access the wiktionary database server and return dict entries. It would be read-only, but it would be another interesting way to access the wiktionary besides the web interface.
Does anyone on the tech list know if this is even possible? I'm not asking you to do it (i can write the export), i was wondering if there is some sort of database schema available to extract the data into dict format, or are the entries too fragmented to even attempt an export?
-brian
Hoi, I read your mail with intrest. It made me look into what the .dict format is. It is described in RFC 2229. It allows for people to look up information from their computer and the information is delivered from one of the many hosts that may hold the requested information. As Angela said in her reply to your post, we are working on a new iteration of Wiktionary that is going by the name of "Ultimate Wiktionary". This will have a relational database at its heart. It is intended to have content in all languages and the first challenge is to make it work in the first place. The second challenge is to create a User Interface that translates to all these languages and the third challenge is to have an import and export mechanism, preferably using a standards based XML scheme.
We hope to show something at the Wikimania event. In your mail you want to export the wiktionary data. The consequence is that when you choose for export, you will have to do this continually as we hope to increase the content of Ultimate Wiktionary dramatically. As far as I understand the RFC, there is a need for responding to a request and providing a reply in a set format. There is no need to have a database in a specific format as long as the respons provided conforms to the RFC.
As the Ultimate Wiktionary is being designed at the moment and as Wikidata is being built, this is a time to consider what is needed to provide .dict functionality. This functionality will be included when someone does the programming or when someone finds the funds to strap the .dict functionality on top of Ultimate Wiktionary. At this time it is premature to think about exporting from UW as UW has not been built yet. It will however be possible to do so.
Some of the current wiktionaries can be parsed into information as they are highly structured. Some will prove an interesting challenge to convert to any other format. Because of its lack of structure and consistency it is as closed as any proprietary format would be. Even the names of an article is not necessarily the name of the associated word as some Wiktionaries still capitalise the first character of a word.
One problem I see with exporting content from Wiktionary is the GNU-FDL requirement to maintain the history of the contributors. For the UW, I think it can be solved by adding the history information on the talk page. The necessity for UW stems from the likelyhood that many wiktionaries, if not all, may merge into the Ultimate Wiktionary and be abandoned.
Thanks, GerardM.
Wiktionary-l mailing list Wiktionary-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/wiktionary-l
Andrew Dunbar wrote:
While a fully structured re-design would obviously help myself and people wanting to interchange Wiktionary data with .dict data, it would definitely make it much harder for the general user to contribute.
Hoi, I am interested in knowing why it would be harder to contribute for a general user. I do agree that the current format of the Dutch Wiktionary is hard to get into. But I am really interested in learning why a new application would be harder for the general public to contribute to. The improvement of the user interface by localisation and the improvement of quality and qualitity by an integration of Wiktionaries and its communities will lead to substantial improvements and it will make the data much more accesible. There will also not be any doubt what can be filled in where as this will be prescribed by the user interface. And, yes there will be room to enter data that can not be entered in the structured UW mark I
I am sorry to learn that you lost all the hard work that you did in creating a parser for the en.wiktionary. It is truly a loss because the data that can be exported to a .dict format can also be exported to an Ultimate Wiktionary format.
Thanks, GerardM
Gerard Meijssen wrote:
Andrew Dunbar wrote:
While a fully structured re-design would obviously help myself and people wanting to interchange Wiktionary data with .dict data, it would definitely make it much harder for the general user to contribute.
I am interested in knowing why it would be harder to contribute for a general user.
I think he just made certain assumptions about the correlation between database back-end and UI front-end which simply don't hold in general.
Greetings, Timwi
... In your mail you want to export the wiktionary data. The consequence is that when you choose for export, you will have to do this continually as we hope to increase the content of Ultimate Wiktionary dramatically.
--- i agree, i personally and on the road alot and only have access to dial-up, a 'local copy' so i could search without getting online would be great, i know that there would be a drift in the content online and the local content, but that is acceptable for the ability to search offline. It was just a thought, i realise the implications on bandwidth, etc, but it is something to consider.
As far as I understand the RFC, there is a need for responding to a request and providing a reply in a set format. There is no need to have a database in a specific format as long as the respons provided conforms to the RFC.
--- Yes, i was just asking about the database to see if the structure was inplace to make an easy transition from RDB entry to a .dict response. I think you answered it as basically, no, since each page is formatted independantly, and Yes, since the new Ultimate Wiktionary will have a defined structure.
As the Ultimate Wiktionary is being designed at the moment and as Wikidata is being built, this is a time to consider what is needed to provide .dict functionality. This functionality will be included when someone does the programming or when someone finds the funds to strap the .dict functionality on top of Ultimate Wiktionary.
--- i would be willing to help and donate time to work on .dict functionality, when things get to that point in the Ultimate Wiktionary development please keep me in mind.
One problem I see with exporting content from Wiktionary is the GNU-FDL requirement to maintain the history of the contributors. For the UW, I think it can be solved by adding the history information on the talk page. The necessity for UW stems from the likelyhood that many wiktionaries, if not all, may merge into the Ultimate Wiktionary and be abandoned.
--- i agree, you don't want multiple forks of definitions floating around, i only suggested the export for offline reference, not for importing somewhere else and continuing from there.
-brian
Brian Suda wrote:
I have been on this list awhile, when i originally joined i was interesting in the possibility of exporting the wiktionary data as .dict format. Now that the newest version of OSX 10.4 has a built-in dictionary that uses the dict:// to look-up words i was interested to see if anyone on the technicaly side would like to explore the possibility of either exporting the Wiktionary database as .dict format, or run a dictionary daemon that would access the wiktionary database server and return dict entries. It would be read-only, but it would be another interesting way to access the wiktionary besides the web interface.
Does anyone on the tech list know if this is even possible? I'm not asking you to do it (i can write the export), i was wondering if there is some sort of database schema available to extract the data into dict format, or are the entries too fragmented to even attempt an export?
Brian, there is already something - I have to write a mail to Kasper about this - only: since it is going to be a longer mail I just waited for a moment - there are too many things around these days.
I'll see if I can do this later on tonight. Anyway he can convert wiktionaries into a .dict format as much as I understand.
Ciao, Sabine
___________________________________ Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB http://mail.yahoo.it
Brian Suda wrote:
I have been on this list awhile, when i originally joined i was interesting in the possibility of exporting the wiktionary data as .dict format. Now that the newest version of OSX 10.4 has a built-in dictionary that uses the dict:// to look-up words i was interested to see if anyone on the technicaly side would like to explore the possibility of either exporting the Wiktionary database as .dict format, or run a dictionary daemon that would access the wiktionary database server and return dict entries. It would be read-only, but it would be another interesting way to access the wiktionary besides the web interface.
Does anyone on the tech list know if this is even possible? I'm not asking you to do it (i can write the export), i was wondering if there is some sort of database schema available to extract the data into dict format, or are the entries too fragmented to even attempt an export?
-brian
In fact exporting the Wiktionaries is almost impossible right now. I once tried to write a script in Python to convert an English Wiktionary entry in some common, logical format. There are too many different possible ways an entry can be built up. Maybe I'll try it again one day, but my time has become very limited lately. I'm not saying it's entirely impossible, but any solution will need to have some manual input. It's very hard to automate it all the way. Another thing is that the Wiktionary content is mostly not ready yet. We've done an amazing amount of work already and some entries are already quite good, but most of the content needs a few (maybe 10 or 20) more years of work to be able to say before it will be usable by the public. We should be thinking about making it possible to export to various formats right now though. If all you want to to is look up a word and keep whatever is currently on its page and simply present that to the user. (or only grab the part describing the English word) That should be possible and would be very easy to do. Maybe I should look into this dict format, to know whether that would work. I'll go do that right now...
Jo
Brian Suda wrote:
I have been on this list awhile, when i originally joined i was interesting in the possibility of exporting the wiktionary data as .dict format. Now that the newest version of OSX 10.4 has a built-in dictionary that uses the dict:// to look-up words i was interested to see if anyone on the technicaly side would like to explore the possibility of either exporting the Wiktionary database as .dict format, or run a dictionary daemon that would access the wiktionary database server and return dict entries. It would be read-only, but it would be another interesting way to access the wiktionary besides the web interface.
Does anyone on the tech list know if this is even possible? I'm not asking you to do it (i can write the export), i was wondering if there is some sort of database schema available to extract the data into dict format, or are the entries too fragmented to even attempt an export?
Hi Brian,
I went over and read the dict RFC document. It only explains the protocol to talk with a dict server. It says nothing about what the contents should look like. Pure html would be OK and it would be possible to indicate that it is html by indicating the mime type.
So, this answers your question in a positive way. It would be almost trivial to provide the contents of the various wiktionaries through the dict protocol. All of Wiktionary is stored in MySQL databases. One for each project. It is possible to download dump files of these and load them in your own local MySQL server or it would be possible to set up a dict server that relays requests immediately to the Wikimedia servers. It is possible to retrieve the entries through xml rather than html (this means there is less overhead for the Wikimedia servers). For a way how to to that I suggest you have a look at PyWikipediaBot. They have module that does exacly that (and more, but that's not important for a read-only dict server)
Good luck and I hope you feel like implementing this.
Polyglot
wiktionary-l@lists.wikimedia.org