Hi All,
I created a tool to extract translations from different editions of Wiktionary. Right now it supports 39 different Wiktionaries. It only extracts translations and ignores the rest.
Supported Wiktionaries: Azerbaijani, Bulgarian, Catalan, Czech, Danish, Greek, English, Esperanto, Spanish, Estonian, Basque, Finnish, French, Galician, Hebrew, Croatian, Hungarian, Indonesian, Italian, Georgian, Latin, Lithuanian, Malagasy, Dutch, Norwegian, Occitan, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Serbian, Swedish, Swahili, Turkish, Ukrainian, Vietnamese and Chinese.
Adding a new Wiktionary is done via a configuration file.
Right now the beta version is available for download at: https://github.com/juditacs/wikt2dict
Documentation is in progress, until then the README should be enough to get started.
Please test it and send me your feedback and bug reports.
Thanks, Judit Ács
Great,
Do you plane to add more functions, like generating misceleanous output (ebooks versions, "printable" pdf, etc.) from a dump? The main problem is probably to convert all templates…
Le 2013-07-12 13:19, Judit a écrit :
Hi All,
I created a tool to extract translations from different editions of Wiktionary. Right now it supports 39 different Wiktionaries. It only extracts translations and ignores the rest.
Supported Wiktionaries: Azerbaijani, Bulgarian, Catalan, Czech, Danish, Greek, English, Esperanto, Spanish, Estonian, Basque, Finnish, French, Galician, Hebrew, Croatian, Hungarian, Indonesian, Italian, Georgian, Latin, Lithuanian, Malagasy, Dutch, Norwegian, Occitan, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Serbian, Swedish, Swahili, Turkish, Ukrainian, Vietnamese and Chinese.
Adding a new Wiktionary is done via a configuration file.
Right now the beta version is available for download at: https://github.com/juditacs/wikt2dict
Documentation is in progress, until then the README should be enough to get started.
Please test it and send me your feedback and bug reports.
Thanks, Judit Ács _______________________________________________ Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Hi,
I don't plan to generate different output formats as the dictionaries by themselves are more suitable for automated usage than as a normal dictionary but it sounds interesting, I may do it in the future.
Since the first version I added a triangulating function that basically tries to build new translation pairs based on the ones extracted from the Wiktionaries. It works reasonably well (85%+ correct manually tested on a few language pairs) and yields many results. I plan to further improve these methods.
BTW the data is available on demand (e.g. you send me an email).
Judit
2013/7/12 Mathieu Stumpf psychoslave@culture-libre.org
Great,
Do you plane to add more functions, like generating misceleanous output (ebooks versions, "printable" pdf, etc.) from a dump? The main problem is probably to convert all templates…
Le 2013-07-12 13:19, Judit a écrit :
Hi All,
I created a tool to extract translations from different editions of Wiktionary. Right now it supports 39 different Wiktionaries. It only extracts translations and ignores the rest.
Supported Wiktionaries: Azerbaijani, Bulgarian, Catalan, Czech, Danish, Greek, English, Esperanto, Spanish, Estonian, Basque, Finnish, French, Galician, Hebrew, Croatian, Hungarian, Indonesian, Italian, Georgian, Latin, Lithuanian, Malagasy, Dutch, Norwegian, Occitan, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Serbian, Swedish, Swahili, Turkish, Ukrainian, Vietnamese and Chinese.
Adding a new Wiktionary is done via a configuration file.
Right now the beta version is available for download at: https://github.com/juditacs/**wikt2dicthttps://github.com/juditacs/wikt2dict
Documentation is in progress, until then the README should be enough to get started.
Please test it and send me your feedback and bug reports.
Thanks, Judit Ács ______________________________**_________________ Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.**org Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wiktionary-lhttps://lists.wikimedia.org/mailman/listinfo/wiktionary-l
-- Association Culture-Libre http://www.culture-libre.org/
______________________________**_________________ Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.**org Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wiktionary-lhttps://lists.wikimedia.org/mailman/listinfo/wiktionary-l
For those who are not aware of DBpedia Wiktionary [1] it also supports translations (among many other lexical information) i.e. http://wiktionary.dbpedia.org/page/german-English-Adjective-2en%5C
It's a little harder to fully configure a new language but you can get a lot more with that For now we support en, de, el, fr & ru and we will happily accept contributions for other languages
Best, Dimitris
[1] http://wiktionary.dbpedia.org/
On Fri, Jul 12, 2013 at 3:22 PM, Judit, Ács acs.judit@sztaki.hu wrote:
Hi,
I don't plan to generate different output formats as the dictionaries by themselves are more suitable for automated usage than as a normal dictionary but it sounds interesting, I may do it in the future.
Since the first version I added a triangulating function that basically tries to build new translation pairs based on the ones extracted from the Wiktionaries. It works reasonably well (85%+ correct manually tested on a few language pairs) and yields many results. I plan to further improve these methods.
BTW the data is available on demand (e.g. you send me an email).
Judit
2013/7/12 Mathieu Stumpf psychoslave@culture-libre.org
Great,
Do you plane to add more functions, like generating misceleanous output (ebooks versions, "printable" pdf, etc.) from a dump? The main problem is probably to convert all templates…
Le 2013-07-12 13:19, Judit a écrit :
Hi All,
I created a tool to extract translations from different editions of Wiktionary. Right now it supports 39 different Wiktionaries. It only extracts translations and ignores the rest.
Supported Wiktionaries: Azerbaijani, Bulgarian, Catalan, Czech, Danish, Greek, English,
Esperanto,
Spanish, Estonian, Basque, Finnish, French, Galician, Hebrew, Croatian, Hungarian, Indonesian, Italian, Georgian, Latin, Lithuanian, Malagasy, Dutch, Norwegian, Occitan, Polish, Portuguese, Romanian, Russian,
Slovak,
Slovenian, Serbian, Swedish, Swahili, Turkish, Ukrainian, Vietnamese and Chinese.
Adding a new Wiktionary is done via a configuration file.
Right now the beta version is available for download at: https://github.com/juditacs/**wikt2dict<
https://github.com/juditacs/wikt2dict%3E
Documentation is in progress, until then the README should be enough to get started.
Please test it and send me your feedback and bug reports.
Thanks, Judit Ács ______________________________**_________________ Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.**org Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wiktionary-l<
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l%3E
-- Association Culture-Libre http://www.culture-libre.org/
______________________________**_________________ Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.**org Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wiktionary-l<
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l%3E
Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Hi,
I added the support for German Wiktionary, it is available in the newest version. There is a quick test script that should get you 300k+ translations from the German Wiktionary in less than 15 minutes.
The dictionaries in 50 languages built using wikt2dict and other resources (parallel and comparable corpora) are available here: http://hlt.sztaki.hu/resources/index.html Please let me know if you find parsing errors.
I understand that DBPedia Wiktionary does a lot more than wikt2dict and I do not plan to compete with that. However, adding 35+ Wiktionaries would have been near impossible for me. This a quick (and dirty) way to extract the translations.
Cheers, Judit
2013/7/12 Judit, Ács acs.judit@sztaki.hu
Hi All,
I created a tool to extract translations from different editions of Wiktionary. Right now it supports 39 different Wiktionaries. It only extracts translations and ignores the rest.
Supported Wiktionaries: Azerbaijani, Bulgarian, Catalan, Czech, Danish, Greek, English, Esperanto, Spanish, Estonian, Basque, Finnish, French, Galician, Hebrew, Croatian, Hungarian, Indonesian, Italian, Georgian, Latin, Lithuanian, Malagasy, Dutch, Norwegian, Occitan, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Serbian, Swedish, Swahili, Turkish, Ukrainian, Vietnamese and Chinese.
Adding a new Wiktionary is done via a configuration file.
Right now the beta version is available for download at: https://github.com/juditacs/wikt2dict
Documentation is in progress, until then the README should be enough to get started.
Please test it and send me your feedback and bug reports.
Thanks, Judit Ács
wiktionary-l@lists.wikimedia.org