Dear Wiktionary Community,
We have been working on a triangulation method to expand existing dictionaries in many languages. We were able to parse translations from 40 Wiktionary editions and using these as seed dictionaries (appr. 3.6M translation pairs), we created an additional 16M pairs in 50 languages. It is possible to extend the number of languages.
While the automatically generated dictionary is not a 100% correct, with correct filtering, 90%+ can be reached.
One version of the parsed Wiktionaries and the generated pairs can be found here: https://www.dropbox.com/sh/r95tdr52o5rzzrw/a54Y66YGOJ We used dumps from August to create these. The software used to build dictionaries: https://github.com/juditacs/wikt2dict
Do you think there is a way to contribute this dictionary back to Wiktionary?
Best, Judit Ács
Hoi, I think it makes more sense to contribute it to the upcoming Wiktionary effort on Wikidata. Thanks, GerardM
On 8 October 2013 12:21, Judit, Ács acs.judit@sztaki.hu wrote:
Dear Wiktionary Community,
We have been working on a triangulation method to expand existing dictionaries in many languages. We were able to parse translations from 40 Wiktionary editions and using these as seed dictionaries (appr. 3.6M translation pairs), we created an additional 16M pairs in 50 languages. It is possible to extend the number of languages.
While the automatically generated dictionary is not a 100% correct, with correct filtering, 90%+ can be reached.
One version of the parsed Wiktionaries and the generated pairs can be found here: https://www.dropbox.com/sh/r95tdr52o5rzzrw/a54Y66YGOJ We used dumps from August to create these. The software used to build dictionaries: https://github.com/juditacs/wikt2dict
Do you think there is a way to contribute this dictionary back to Wiktionary?
Best, Judit Ács _______________________________________________ Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
On 10/09/2013 08:16 AM, Gerard Meijssen wrote:
I think it makes more sense to contribute it to the upcoming Wiktionary effort on Wikidata.
I strongly disagree. That 'effort' is still science fiction, and suggesting to wait for it, is just procrastination.
On 8 October 2013 12:21, Judit, Ács acs.judit@sztaki.hu wrote:
Do you think there is a way to contribute this dictionary back to Wiktionary?
Wiktionary is already half-full of bot-generated articles, and adding more is no harm. However, you need to address each user community on its own.
Judit, Ács, 08/10/2013 12:21:
Do you think there is a way to contribute this dictionary back to Wiktionary?
Sure! You could first of all upload the dataset with a free license somewhere, for instance archive.org. Actually, it's probably better if you choose CC-0 as "license", otherwise – being EU-based – you could add database rights which would be a nightmare. (Or CC-0 for your work + CC-BY-SA for any copyrightable text from Wiktionary, if there is any.)
Then, you can build upon one of out WebAPI clients to contribute it directly to Wiktionary: https://www.mediawiki.org/wiki/API:Client_code I say "you" because you are the ones knowing your own dataset better. You need local consensus of course, so you could proceed this way: 1) determine what Wiktionary editions has the biggest overlap with your entries (i.e. which would require less page creation; adding to existing pages is less controversial than adding new ones); 2) propose to those editions, or wait for the most interested to ask you, and get local green light (ideally a not-so-huge one to start with); 3) run on your own a bot on that language and identify what's the kind and amount of needed work; 4) share the code and information from (3) to let others continue on other editions. Of course someone else could do 1-3 too, but it would be a disproportionate effort for them compared to you; peer review of the code at (3) should also help make the coding of the bot a shared effort.
Nemo
Thanks for the very helpful answers.
I will look at the possibilities for uploading (and licensing) the data sets.
Meanwhile I have another question. Currently I don't parse any information other than the words or expressions, meaning gender and other language-specific information is ignored, even though they might appear in the translation tables. This is probably a huge problem for large Wiktionaries (e.g. I doubt that the enwiktionary would accept French nouns without their gender). Adding this functionality would be very tedious and probably impossible for languages I can't even read. Should I try it anyway or can the data be useful without these?
2013/10/9 Federico Leva (Nemo) nemowiki@gmail.com
Judit, Ács, 08/10/2013 12:21:
Do you think there is a way to contribute this dictionary back to
Wiktionary?
Sure! You could first of all upload the dataset with a free license somewhere, for instance archive.org. Actually, it's probably better if you choose CC-0 as "license", otherwise – being EU-based – you could add database rights which would be a nightmare. (Or CC-0 for your work + CC-BY-SA for any copyrightable text from Wiktionary, if there is any.)
Then, you can build upon one of out WebAPI clients to contribute it directly to Wiktionary: https://www.mediawiki.org/**wiki/API:Client_codehttps://www.mediawiki.org/wiki/API:Client_code I say "you" because you are the ones knowing your own dataset better. You need local consensus of course, so you could proceed this way:
- determine what Wiktionary editions has the biggest overlap with your
entries (i.e. which would require less page creation; adding to existing pages is less controversial than adding new ones); 2) propose to those editions, or wait for the most interested to ask you, and get local green light (ideally a not-so-huge one to start with); 3) run on your own a bot on that language and identify what's the kind and amount of needed work; 4) share the code and information from (3) to let others continue on other editions. Of course someone else could do 1-3 too, but it would be a disproportionate effort for them compared to you; peer review of the code at (3) should also help make the coding of the bot a shared effort.
Nemo
wiktionary-l@lists.wikimedia.org