-------- Original Message -------- Subject: [Wikipedia] DICT protocol Date: Wed, 07 Sep 2005 19:12:12 +0200 From: kael kael@alussinan.org To: jwales@wikia.com
Dear Sir,
I would like to suggest you at Wikimedia to use the DICT protocol [1] which allows to query dictionaries and databases [2].
For example, thanks to the _Dict Firefox extension_ [3] it's possible to write an url like the following one : dict://dict.org/d:encycplopedia .
There is a DICT server which allows to query the french Wikipedia dict://mali.geekcorps.org/d:wikipedia e.g. [4].
It would be really great if we could query Wikimedia content with this protocol.
BTW, while searching for http://google.com/search?q=wikipedia+dict I found a lot of websites reproducing Wikipedia content with Google Ads. I am wondering if they comply with the GNU GFDL license.
[1] http://www.ietf.org/rfc/rfc2229.txt [2] http://www.dict.org [3] http://dict.mozdev.org [4] http://mali.geekcorps.org/mediawiki/index.php/Wik2dict
Cheers,
On Sun, Sep 11, 2005 at 05:29:57AM -0400, Jimmy Wales wrote:
Dear Sir,
I would like to suggest you at Wikimedia to use the DICT protocol [1] which allows to query dictionaries and databases [2].
For example, thanks to the _Dict Firefox extension_ [3] it's possible to write an url like the following one : dict://dict.org/d:encycplopedia .
There is a DICT server which allows to query the french Wikipedia dict://mali.geekcorps.org/d:wikipedia e.g. [4].
It would be really great if we could query Wikimedia content with this protocol.
BTW, while searching for http://google.com/search?q=wikipedia+dict I found a lot of websites reproducing Wikipedia content with Google Ads. I am wondering if they comply with the GNU GFDL license.
[1] http://www.ietf.org/rfc/rfc2229.txt [2] http://www.dict.org [3] http://dict.mozdev.org [4] http://mali.geekcorps.org/mediawiki/index.php/Wik2dict
The idea has been proposed many times.
Attached is a toy Polish Wiktionary->dictd gateway I once coded. It's not usable in the real world, but it may give people some ideas.
I've talked to folks who have worked on it, but still I am left with the feeling that if there is to be a server/client protocol, http is the best way to handle it. Of course, if you want to download the content and run your own dict server for others thats great..but how many are out there that will use it?
/Alterego
On 9/11/05, Tomasz Wegrzanowski taw@users.sf.net wrote:
On Sun, Sep 11, 2005 at 05:29:57AM -0400, Jimmy Wales wrote:
Dear Sir,
I would like to suggest you at Wikimedia to use the DICT protocol [1] which allows to query dictionaries and databases [2].
For example, thanks to the _Dict Firefox extension_ [3] it's possible to write an url like the following one : dict://dict.org/d:encycplopedia .
There is a DICT server which allows to query the french Wikipedia dict://mali.geekcorps.org/d:wikipedia e.g. [4].
It would be really great if we could query Wikimedia content with this protocol.
BTW, while searching for http://google.com/search?q=wikipedia+dict I found a lot of websites reproducing Wikipedia content with Google Ads. I am wondering if they comply with the GNU GFDL license.
[1] http://www.ietf.org/rfc/rfc2229.txt [2] http://www.dict.org [3] http://dict.mozdev.org [4] http://mali.geekcorps.org/mediawiki/index.php/Wik2dict
The idea has been proposed many times.
Attached is a toy Polish Wiktionary->dictd gateway I once coded. It's not usable in the real world, but it may give people some ideas.
Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
On Sun, Sep 11, 2005 at 04:35:39PM -0600, Brian wrote:
I've talked to folks who have worked on it, but still I am left with the feeling that if there is to be a server/client protocol, http is the best way to handle it. Of course, if you want to download the content and run your own dict server for others thats great..but how many are out there that will use it?
=== Why dictd ===
For wikipedia, dictd is not very useful. For wiktionaries it is - this way content from the wiktionary becomes just one of many dict data feeds your dict client may use.
Having dictionary data in a simple text format is usually much more useful than in html. Dictd protocol provides more features than naive http, it specifies how to select from multiple dictionaries, how to specify many match strategies etc. dict clients know about those features and may use them
We can de facto reimplement dictd over http, specifying an API for dictionary queries, but we would have to inform all the clients about it. Then either all dictioary servers would have to switch to our API, and all existing dict clients would have to switch to it, or all dict clients would have to understand both dictd and our API - not good.
=== Practical solution ===
Actually running a local dictd is pretty usual thing under Linux. The only thing we should do to become dict-compatible is to provide a nicely reformatted wiktionaries in a dictd-compatible format. It should not be too hard to generate them from the XML dumps, but the generator has to be aware of templates used in all Wiktionaries etc.
Tomasz Wegrzanowski wrote:
For wikipedia, dictd is not very useful. For wiktionaries it is - this way content from the wiktionary becomes just one of many dict data feeds your dict client may use.
What are the scalability issues if wiktionary.org were to run a central dictd server for the entire world to use? How many clients can connect simultaneously and how much traffic should be expected from the average client?
On Mon, Sep 12, 2005 at 04:05:33PM +0200, Lars Aronsson wrote:
Tomasz Wegrzanowski wrote:
For wikipedia, dictd is not very useful. For wiktionaries it is - this way content from the wiktionary becomes just one of many dict data feeds your dict client may use.
What are the scalability issues if wiktionary.org were to run a central dictd server for the entire world to use? How many clients can connect simultaneously and how much traffic should be expected from the average client?
dictd servers typically use read-only preprocessed databases, and are therefore easily replicable. Just dump wiktionary once a day, preprocess it to suitable format and copy over nfs or something.
The computational costs and traffic overhead are low (but unlike http no free gzip-on-fly) - it's a really just a very simple protocol with clients only reading data from servers.
The only operation that is any expensive are various advanced match strategies, like "match by regular expression" or "match by substring". However the RFC requires that we implement only exact match and prefix match, if it was any problem.
Maintaining one more daemon and the processing scripts would probably be the highest part of the cost.
Tomasz Wegrzanowski wrote:
dictd servers [...] The computational costs and traffic overhead are low (but unlike http no free gzip-on-fly) - it's a really just a very simple protocol with clients only reading data from servers.
Judging from the Linux dictd manpage (*), the current implementation seems to spawn one server process for each connecting client, non-threaded, which can then stay connected as long as it sees fit. Without having tried it, this kind of solution appears to scale to a few hundred simultaneous users (workgroup or intranet level), but not to the global web level. Old webserver programmers will smile and remember how webservers used to work in 1995, based on late W. Richard Stevens' book.
I think it could be interesting (sometime in the future) to deploy a global central spelling dictionary server, where users can update the dictionary in real time (results showing up in Wiktionary), but the dictd server software would probably have to be completely rewritten for that level of scalability. It might make sense to replace the dict protocol and its need for special servers with some kind of XML over HTTP, if only to benefit from the scalability already designed into Apache.
(*) This from http://www.die.net/doc/linux/man/man8/dictd.8.html : OPTIONS [...] --limit children Specifies the number of daemons that may be running simultaneously. Each daemon services a single connection. If the limit is exceeded, a (serialized) connection will be made by the server process, and a response code 420 (server temporarily unavailable) will be sent to the client. This parameter should be adjusted to prevent the server machine from being overloaded by dict clients, but should not be set so low that many clients are denied useful connections. The default is 100, but may be changed in the dictd.h file at compile time (DICT_DAEMON_LIMIT).
On Tue, Sep 13, 2005 at 02:07:50PM +0200, Lars Aronsson wrote:
Tomasz Wegrzanowski wrote:
dictd servers [...] The computational costs and traffic overhead are low (but unlike http no free gzip-on-fly) - it's a really just a very simple protocol with clients only reading data from servers.
Judging from the Linux dictd manpage (*), the current implementation seems to spawn one server process for each connecting client, non-threaded, which can then stay connected as long as it sees fit. Without having tried it, this kind of solution appears to scale to a few hundred simultaneous users (workgroup or intranet level), but not to the global web level. Old webserver programmers will smile and remember how webservers used to work in 1995, based on late W. Richard Stevens' book.
I think it could be interesting (sometime in the future) to deploy a global central spelling dictionary server, where users can update the dictionary in real time (results showing up in Wiktionary), but the dictd server software would probably have to be completely rewritten for that level of scalability. It might make sense to replace the dict protocol and its need for special servers with some kind of XML over HTTP, if only to benefit from the scalability already designed into Apache.
We're talking Unix here. Per-process overhead is just a few kBs, tcp/ip buffers are bigger than that.
The real point is of course the ease of replication. Just put 20 machines, with a dictd on each of them, and a dumb load balancer in front of them, and you can practically serve 20x more connections.
I don't think any dictd server actually serves thousands of simultaneous connections, so nobody cared to write a more efficient server. But the protocol is so simple that it shouldn't be any difficult, if it was actually needed. http://www.dict.org/links.html lists servers written in C, Perl, Java and Python.
The Perl server (jiten) has (not counting addons/ directory) about 500 LOC and is fork-less (it uses IO::Select).
wikitech-l@lists.wikimedia.org