[Wikipedia-l] The use of Wikipedia extracted wordlists in GPL machine translation systems

2 Feb 2008

Hello everyone,

First of all I would like to apologise for the cross-post, finding the
correct place to send this is somewhat difficult. 

I'd like to present a legal scenario (disclaimer, IANAL, although I'm
sure that will become painfully clear) that I am hoping to get resolved.
I will try and present it in the shortest and clearest way possible.

I work on machine translation software,¹ focussing on lesser-used and
under-resourced languages.² One of the things that is needed for our
software is bilingual dictionaries. A usable way of getting bilingual
dictionaries is to harvest Wikipedia interwiki links.³

This much is straightforward. The legal scenario comes with the
licensing issues involved.

Our software, composed of an engine, and language pair packages are
under the GPL. Our language pairs, which represent both programmatic
elements (rules, scripts etc.), and non-programmatic elements (tagged
wordlists) etc. Both of these elements are tightly coupled. It is _not_
practical to distribute them separately. Furthermore, many of the
linguistic sub-resources we come across, spellcheckers, dictionaries,
etc. are released under the GPL, which would make decoupling the two
parts un-achievable, or at the very least, un-maintainable.

Wikipedia is under the GFDL. This covers everything that is
user-contributed. GFDL content cannot be included in GPL programs. Here
is my problem.

Now, I've been told that interwiki links do not have the level of
originality required for copyright, many of them being created by bot.
I'm not sure that this is the case, as some of them are done by people
and choosing the correct article has at least some level of work.
Besides, this would be a cop-out, if we for example wanted to sense
disambiguate the terms extracted using the first paragraph of the
article, this would still be a licence violation.

So, is there any way to resolve this? I understand that probably it is
on no-ones high list of priorities. On the other hand, I understand that
the FSF is considering to update the GFDL to make it compatible with the
Creative Commons CC-BY-SA licence. 

Would it also be possible at the same time to add some kind of clause
making GFDL content usable in GPL licensed linguistic data for machine
translation systems? 

Many thanks for your time, and I'm sorry if this problem has been bought
up before and I've missed the discussion. Any questions you have can be
directed to myself, or our mailing list:
apertium-stuff(a)lists.sourceforge.net 

Fran

¹ http://www.apertium.org
² For example, we have systems to translate between Spanish-Occitan, and
Spanish-Catalan. These systems generate pretty good translations
(needing only superficial post-editting) and have been used on the two
Wikipedias in question. See:
http://xixona.dlsi.ua.es/wiki/index.php/Evaluating_with_Wikipedia
³ This would probably also apply to data extracted from Wiktionary, but
for the moment lets concentrate on Wikipedia as that is what I have been
doing.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

[Wikipedia-l] The use of Wikipedia extracted wordlists in GPL machine translation systems