Re: [Wikipedia-l] The use of Wikipedia extracted wordlists in GPL machine translation systems

2 Feb 2008


      El sáb, 02-02-2008 a las 22:13 +0100, Gerard Meijssen escribió:
...
Hoi,
The Apertium software needs information in an unambiguous way. This is
to ensure that the software is able to run with the data. The notion
that the information needed by Apertium is not of relevance in other
environments is simply wrong. The information is of use outside of
Apertium and as a consequence the choise for the GPL license is
unfortunate. You concentrate for now on Wikipedia but you indicate
that consider using the Wiktionary data as well.
The choice of the GPL licence is perfect for including machine
translation in other free software, the overwhelming majority of which
is licensed under the GPL.
The fact that our linguistic data can be used separately is an aside.
And as a note, it can be re-used for software like grammar checkers,
spell-checkers, etc. which are under the GPL. The question is really why
_not_ use the GPL.
...
Where you state that Apertium needs information in a very tightly
controlled way, is this what you copyright? Or in other words, do you
copyright the information in order to control this specific type of
application? If not, what is the objective of choosing the GPL for
data?
To the other list members: yes this is off-topic, so I'll try and keep
it short.
The objective of choosing GPL for the data is:
* to make it compatible with the engine/other tools in case anything
needs to be moved between the packages, 
* to make it unambiguously able to be included in Debian,
* to make it compatible with other lexical resources that are GPL (of
which there are many),
* because the transfer rules and scripts are copyrightable works, as are
the rules for morphological analysis. As I mentioned in the previous
email it is not possible to decouple the two. If you want further
information as to the originality and copyright status of the data,
please consider looking at one of the packages,
* to ensure that if people take one of our original language pairs the
community has the guarantees of the GPL that changes and improvements
will be released under the same licence, whether this be increased
vocabulary, better transfer rules, a special program to deal with a
language feature etc.
Fran
...
Thanks,
     GerardM
On Feb 2, 2008 9:38 PM, Francis Tyers spectre@ivixor.net wrote:
        El sáb, 02-02-2008 a las 12:10 -0800, Ray Saintonge escribió:
        > Francis Tyers wrote:
        > > I work on machine translation software,¹ focussing on
        lesser-used and
        > > under-resourced languages.² One of the things that is
        needed for our
        > > software is bilingual dictionaries. A usable way of
        getting bilingual
        > > dictionaries is to harvest Wikipedia interwiki links.³
        > >
        > While they are helpful, it would be a mistake to consider
        these as fully
        > reliable.  The disambiguation policies of the separate
        projects are also
        > a factor to consider.
    Needless to say I've done an analysis of how useful this is
    before
    mentioning it. I can send you the results if you would be
    interested.

    > > Now, I've been told that interwiki links do not have the
    level of
    > > originality required for copyright, many of them being
    created by bot.
    > > I'm not sure that this is the case, as some of them are
    done by people
    > > and choosing the correct article has at least some level
    of work.
    > > Besides, this would be a cop-out, if we for example wanted
    to sense
    > > disambiguate the terms extracted using the first paragraph
    of the
    > > article, this would still be a licence violation.
    > >
    > I would question the copyrightability of any dictionary
    entry on the
    > basis of the merger principle.  We copyright forms of
    expression rather
    > than ideas.  If the idea is indistinguishable from the form
    there is a
    > strong likelihood that it is not copyrightable.  A
    dictionary is not
    > reliable if it seeks to inject originality in its
    definition.  Seeking
    > new ways to define words means that we encourage definitions
    that may
    > deviate from the original intention of the words.  What is
    copyrightable
    > in a dictionary then is more in the level of global
    selection and
    > presentation.


    This is what I also have been lead to believe. But when you're
    in the
    habit of commercially distributing stuff -- especially free
    software
    that everyone can see inside -- you like to be sure :)

    > > So, is there any way to resolve this? I understand that
    probably it is
    > > on no-ones high list of priorities. On the other hand, I
    understand that
    > > the FSF is considering to update the GFDL to make it
    compatible with the
    > > Creative Commons CC-BY-SA licence.
    > >
    > > Would it also be possible at the same time to add some
    kind of clause
    > > making GFDL content usable in GPL licensed linguistic data
    for machine
    > > translation systems?
    > >
    > What either of those licences say is not within the control
    of any
    > Wikimedia project. Perhaps you should be discussing this
    with FSF.


    I was intending to do that after I received replies back from
    here. I
    understand that the WMF/Wikipedia has some clout with respect
    to
    licensing at the FSF, for example:

    http://wikimediafoundation.org/wiki/Resolution:License_update

    Of course moving to CC-BY-SA won't solve the GPL compatibility
    problem.

    Fran


    _______________________________________________
    Wikipedia-l mailing list
    Wikipedia-l@lists.wikimedia.org
    http://lists.wikimedia.org/mailman/listinfo/wikipedia-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: [Wikipedia-l] The use of Wikipedia extracted wordlists in GPL machine translation systems