Hello everyone,
First of all I would like to apologise for the cross-post, finding the correct place to send this is somewhat difficult.
I'd like to present a legal scenario (disclaimer, IANAL, although I'm sure that will become painfully clear) that I am hoping to get resolved. I will try and present it in the shortest and clearest way possible.
I work on machine translation software,¹ focussing on lesser-used and under-resourced languages.² One of the things that is needed for our software is bilingual dictionaries. A usable way of getting bilingual dictionaries is to harvest Wikipedia interwiki links.³
This much is straightforward. The legal scenario comes with the licensing issues involved.
Our software, composed of an engine, and language pair packages are under the GPL. Our language pairs, which represent both programmatic elements (rules, scripts etc.), and non-programmatic elements (tagged wordlists) etc. Both of these elements are tightly coupled. It is _not_ practical to distribute them separately. Furthermore, many of the linguistic sub-resources we come across, spellcheckers, dictionaries, etc. are released under the GPL, which would make decoupling the two parts un-achievable, or at the very least, un-maintainable.
Wikipedia is under the GFDL. This covers everything that is user-contributed. GFDL content cannot be included in GPL programs. Here is my problem.
Now, I've been told that interwiki links do not have the level of originality required for copyright, many of them being created by bot. I'm not sure that this is the case, as some of them are done by people and choosing the correct article has at least some level of work. Besides, this would be a cop-out, if we for example wanted to sense disambiguate the terms extracted using the first paragraph of the article, this would still be a licence violation.
So, is there any way to resolve this? I understand that probably it is on no-ones high list of priorities. On the other hand, I understand that the FSF is considering to update the GFDL to make it compatible with the Creative Commons CC-BY-SA licence.
Would it also be possible at the same time to add some kind of clause making GFDL content usable in GPL licensed linguistic data for machine translation systems?
Many thanks for your time, and I'm sorry if this problem has been bought up before and I've missed the discussion. Any questions you have can be directed to myself, or our mailing list: apertium-stuff@lists.sourceforge.net
Fran
¹ http://www.apertium.org ² For example, we have systems to translate between Spanish-Occitan, and Spanish-Catalan. These systems generate pretty good translations (needing only superficial post-editting) and have been used on the two Wikipedias in question. See: http://xixona.dlsi.ua.es/wiki/index.php/Evaluating_with_Wikipedia ³ This would probably also apply to data extracted from Wiktionary, but for the moment lets concentrate on Wikipedia as that is what I have been doing.