Re: [Wikipedia-l] The use of Wikipedia extracted wordlists in GPL machine translation systems

2 Feb 2008


      Francis Tyers wrote:
...
I work on machine translation software,¹ focussing on lesser-used and
under-resourced languages.² One of the things that is needed for our
software is bilingual dictionaries. A usable way of getting bilingual
dictionaries is to harvest Wikipedia interwiki links.³
While they are helpful, it would be a mistake to consider these as fully 
reliable.  The disambiguation policies of the separate projects are also 
a factor to consider.
...
Now, I've been told that interwiki links do not have the level of
originality required for copyright, many of them being created by bot.
I'm not sure that this is the case, as some of them are done by people
and choosing the correct article has at least some level of work.
Besides, this would be a cop-out, if we for example wanted to sense
disambiguate the terms extracted using the first paragraph of the
article, this would still be a licence violation.
I would question the copyrightability of any dictionary entry on the 
basis of the merger principle.  We copyright forms of expression rather 
than ideas.  If the idea is indistinguishable from the form there is a 
strong likelihood that it is not copyrightable.  A dictionary is not 
reliable if it seeks to inject originality in its definition.  Seeking 
new ways to define words means that we encourage definitions that may 
deviate from the original intention of the words.  What is copyrightable 
in a dictionary then is more in the level of global selection and 
presentation.
...
So, is there any way to resolve this? I understand that probably it is
on no-ones high list of priorities. On the other hand, I understand that
the FSF is considering to update the GFDL to make it compatible with the
Creative Commons CC-BY-SA licence.
Would it also be possible at the same time to add some kind of clause
making GFDL content usable in GPL licensed linguistic data for machine
translation systems?
What either of those licences say is not within the control of any 
Wikimedia project. Perhaps you should be discussing this with FSF.
Ec

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: [Wikipedia-l] The use of Wikipedia extracted wordlists in GPL machine translation systems