The use of Wikipedia extracted wordlists in GPL machine translation systems

List overview All Threads
Download

newer

older

Wikipedia statistics and speakers...

Wikimania 2008: Call for...

Francis Tyers

2 Feb 2008 2 Feb '08

3:27 p.m.

Hello everyone, First of all I would like to apologise for the cross-post, finding the correct place to send this is somewhat difficult. I'd like to present a legal scenario (disclaimer, IANAL, although I'm sure that will become painfully clear) that I am hoping to get resolved. I will try and present it in the shortest and clearest way possible. I work on machine translation software,¹ focussing on lesser-used and under-resourced languages.² One of the things that is needed for our software is bilingual dictionaries. A usable way of getting bilingual dictionaries is to harvest Wikipedia interwiki links.³ This much is straightforward. The legal scenario comes with the licensing issues involved. Our software, composed of an engine, and language pair packages are under the GPL. Our language pairs, which represent both programmatic elements (rules, scripts etc.), and non-programmatic elements (tagged wordlists) etc. Both of these elements are tightly coupled. It is _not_ practical to distribute them separately. Furthermore, many of the linguistic sub-resources we come across, spellcheckers, dictionaries, etc. are released under the GPL, which would make decoupling the two parts un-achievable, or at the very least, un-maintainable. Wikipedia is under the GFDL. This covers everything that is user-contributed. GFDL content cannot be included in GPL programs. Here is my problem. Now, I've been told that interwiki links do not have the level of originality required for copyright, many of them being created by bot. I'm not sure that this is the case, as some of them are done by people and choosing the correct article has at least some level of work. Besides, this would be a cop-out, if we for example wanted to sense disambiguate the terms extracted using the first paragraph of the article, this would still be a licence violation. So, is there any way to resolve this? I understand that probably it is on no-ones high list of priorities. On the other hand, I understand that the FSF is considering to update the GFDL to make it compatible with the Creative Commons CC-BY-SA licence. Would it also be possible at the same time to add some kind of clause making GFDL content usable in GPL licensed linguistic data for machine translation systems? Many thanks for your time, and I'm sorry if this problem has been bought up before and I've missed the discussion. Any questions you have can be directed to myself, or our mailing list: apertium-stuff(a)lists.sourceforge.net Fran ¹ http://www.apertium.org ² For example, we have systems to translate between Spanish-Occitan, and Spanish-Catalan. These systems generate pretty good translations (needing only superficial post-editting) and have been used on the two Wikipedias in question. See: http://xixona.dlsi.ua.es/wiki/index.php/Evaluating_with_Wikipedia ³ This would probably also apply to data extracted from Wiktionary, but for the moment lets concentrate on Wikipedia as that is what I have been doing.

Show replies by thread

Ray Saintonge

2 Feb 2 Feb

10:10 p.m.

New subject: The use of Wikipedia extracted wordlists in GPL machine translation systems

Francis Tyers wrote:

...

I work on machine translation software,¹ focussing on lesser-used and under-resourced languages.² One of the things that is needed for our software is bilingual dictionaries. A usable way of getting bilingual dictionaries is to harvest Wikipedia interwiki links.³

While they are helpful, it would be a mistake to consider these as fully reliable. The disambiguation policies of the separate projects are also a factor to consider.

...

Now, I've been told that interwiki links do not have the level of originality required for copyright, many of them being created by bot. I'm not sure that this is the case, as some of them are done by people and choosing the correct article has at least some level of work. Besides, this would be a cop-out, if we for example wanted to sense disambiguate the terms extracted using the first paragraph of the article, this would still be a licence violation.

I would question the copyrightability of any dictionary entry on the basis of the merger principle. We copyright forms of expression rather than ideas. If the idea is indistinguishable from the form there is a strong likelihood that it is not copyrightable. A dictionary is not reliable if it seeks to inject originality in its definition. Seeking new ways to define words means that we encourage definitions that may deviate from the original intention of the words. What is copyrightable in a dictionary then is more in the level of global selection and presentation.

...

So, is there any way to resolve this? I understand that probably it is on no-ones high list of priorities. On the other hand, I understand that the FSF is considering to update the GFDL to make it compatible with the Creative Commons CC-BY-SA licence. Would it also be possible at the same time to add some kind of clause making GFDL content usable in GPL licensed linguistic data for machine translation systems?

What either of those licences say is not within the control of any Wikimedia project. Perhaps you should be discussing this with FSF. Ec

Francis Tyers

10:38 p.m.

New subject: The use of Wikipedia extracted wordlists in GPL machine translation systems

El sáb, 02-02-2008 a las 12:10 -0800, Ray Saintonge escribió:

...

Francis Tyers wrote:

While they are helpful, it would be a mistake to consider these as fully reliable. The disambiguation policies of the separate projects are also a factor to consider.

Needless to say I've done an analysis of how useful this is before mentioning it. I can send you the results if you would be interested.

...

This is what I also have been lead to believe. But when you're in the habit of commercially distributing stuff -- especially free software that everyone can see inside -- you like to be sure :)

...

What either of those licences say is not within the control of any Wikimedia project. Perhaps you should be discussing this with FSF.

Gerard Meijssen

11:13 p.m.

New subject: The use of Wikipedia extracted wordlists in GPL machine translation systems

Hoi, The Apertium software needs information in an unambiguous way. This is to ensure that the software is able to run with the data. The notion that the information needed by Apertium is not of relevance in other environments is simply wrong. The information is of use outside of Apertium and as a consequence the choise for the GPL license is unfortunate. You concentrate for now on Wikipedia but you indicate that consider using the Wiktionary data as well. Where you state that Apertium needs information in a very tightly controlled way, is this what you copyright? Or in other words, do you copyright the information in order to control this specific type of application? If not, what is the objective of choosing the GPL for data? Thanks, GerardM On Feb 2, 2008 9:38 PM, Francis Tyers <spectre(a)ivixor.net> wrote:

...

El sáb, 02-02-2008 a las 12:10 -0800, Ray Saintonge escribió:

Francis Tyers wrote:

While they are helpful, it would be a mistake to consider these as fully reliable. The disambiguation policies of the separate projects are also a factor to consider.

Needless to say I've done an analysis of how useful this is before mentioning it. I can send you the results if you would be interested.

This is what I also have been lead to believe. But when you're in the habit of commercially distributing stuff -- especially free software that everyone can see inside -- you like to be sure :)

> So, is there any way to resolve this? I understand that probably it is > on no-ones high list of priorities. On the other hand, I understand

that

> the FSF is considering to update the GFDL to make it compatible with

the

Creative Commons CC-BY-SA licence. Would it also be possible at the same time to add some kind of clause making GFDL content usable in GPL licensed linguistic data for machine translation systems?

What either of those licences say is not within the control of any Wikimedia project. Perhaps you should be discussing this with FSF.

I was intending to do that after I received replies back from here. I understand that the WMF/Wikipedia has some clout with respect to licensing at the FSF, for example: http://wikimediafoundation.org/wiki/Resolution:License_update Of course moving to CC-BY-SA won't solve the GPL compatibility problem. Fran _______________________________________________ Wikipedia-l mailing list Wikipedia-l(a)lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikipedia-l

Francis Tyers

11:28 p.m.

New subject: The use of Wikipedia extracted wordlists in GPL machine translation systems

El sáb, 02-02-2008 a las 22:13 +0100, Gerard Meijssen escribió:

...

The choice of the GPL licence is perfect for including machine translation in other free software, the overwhelming majority of which is licensed under the GPL. The fact that our linguistic data can be used separately is an aside. And as a note, it can be re-used for software like grammar checkers, spell-checkers, etc. which are under the GPL. The question is really why _not_ use the GPL.

...

Where you state that Apertium needs information in a very tightly controlled way, is this what you copyright? Or in other words, do you copyright the information in order to control this specific type of application? If not, what is the objective of choosing the GPL for data?

To the other list members: yes this is off-topic, so I'll try and keep it short. The objective of choosing GPL for the data is: * to make it compatible with the engine/other tools in case anything needs to be moved between the packages, * to make it unambiguously able to be included in Debian, * to make it compatible with other lexical resources that are GPL (of which there are many), * because the transfer rules and scripts are copyrightable works, as are the rules for morphological analysis. As I mentioned in the previous email it is not possible to decouple the two. If you want further information as to the originality and copyright status of the data, please consider looking at one of the packages, * to ensure that if people take one of our original language pairs the community has the guarantees of the GPL that changes and improvements will be released under the same licence, whether this be increased vocabulary, better transfer rules, a special program to deal with a language feature etc. Fran

...

Thanks, GerardM On Feb 2, 2008 9:38 PM, Francis Tyers <spectre(a)ivixor.net> wrote: El sáb, 02-02-2008 a las 12:10 -0800, Ray Saintonge escribió:

Francis Tyers wrote: > I work on machine translation software,¹ focussing on

lesser-used and

> under-resourced languages.² One of the things that is

needed for our

> software is bilingual dictionaries. A usable way of

getting bilingual

dictionaries is to harvest Wikipedia interwiki links.³

While they are helpful, it would be a mistake to consider

these as fully

reliable. The disambiguation policies of the separate

projects are also

a factor to consider.

Needless to say I've done an analysis of how useful this is before mentioning it. I can send you the results if you would be interested.

> Now, I've been told that interwiki links do not have the

level of

> originality required for copyright, many of them being

created by bot.

> I'm not sure that this is the case, as some of them are

done by people

> and choosing the correct article has at least some level

of work.

> Besides, this would be a cop-out, if we for example wanted

to sense

> disambiguate the terms extracted using the first paragraph

of the

article, this would still be a licence violation.

I would question the copyrightability of any dictionary

entry on the

basis of the merger principle. We copyright forms of

expression rather

than ideas. If the idea is indistinguishable from the form

there is a

strong likelihood that it is not copyrightable. A

dictionary is not

reliable if it seeks to inject originality in its

definition. Seeking

new ways to define words means that we encourage definitions

that may

deviate from the original intention of the words. What is

copyrightable

in a dictionary then is more in the level of global

selection and

presentation.

This is what I also have been lead to believe. But when you're in the habit of commercially distributing stuff -- especially free software that everyone can see inside -- you like to be sure :)

> So, is there any way to resolve this? I understand that

probably it is

> on no-ones high list of priorities. On the other hand, I

understand that

> the FSF is considering to update the GFDL to make it

compatible with the

> Creative Commons CC-BY-SA licence. > > Would it also be possible at the same time to add some

kind of clause

> making GFDL content usable in GPL licensed linguistic data

for machine

translation systems?

What either of those licences say is not within the control

of any

Wikimedia project. Perhaps you should be discussing this

with FSF. I was intending to do that after I received replies back from here. I understand that the WMF/Wikipedia has some clout with respect to licensing at the FSF, for example: http://wikimediafoundation.org/wiki/Resolution:License_update Of course moving to CC-BY-SA won't solve the GPL compatibility problem. Fran _______________________________________________ Wikipedia-l mailing list Wikipedia-l(a)lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikipedia-l

Mark Williamson

11:31 p.m.

New subject: The use of Wikipedia extracted wordlists in GPL machine translation systems

Fran, very intriguing! I'd actually been thinking of something along these lines for the past couple of weeks. Would it be possible to do a statistical analysis of articles, as well? I would imagine that in longer articles, there would be certain words that would have similar frequencies, whether or not they are direct translations. On 02/02/2008, Francis Tyers <spectre(a)ivixor.net> wrote:

...

El sáb, 02-02-2008 a las 12:10 -0800, Ray Saintonge escribió:

Francis Tyers wrote:

While they are helpful, it would be a mistake to consider these as fully reliable. The disambiguation policies of the separate projects are also a factor to consider.

Needless to say I've done an analysis of how useful this is before mentioning it. I can send you the results if you would be interested.

This is what I also have been lead to believe. But when you're in the habit of commercially distributing stuff -- especially free software that everyone can see inside -- you like to be sure :)

What either of those licences say is not within the control of any Wikimedia project. Perhaps you should be discussing this with FSF.

-- Refije dirije lanmè yo paske nou posede pwòp bato.

Francis Tyers

11:38 p.m.

New subject: The use of Wikipedia extracted wordlists in GPL machine translation systems

El sáb, 02-02-2008 a las 14:31 -0700, Mark Williamson escribió:

...

In short, check out this link: http://citeseer.ist.psu.edu/509449.html which gives a nice overview of various techniques. If you'd like further details, please feel free to contact me off-list so we don't clog up the works. Fran

Gerard Meijssen

11:53 p.m.

New subject: The use of Wikipedia extracted wordlists in GPL machine translation systems

Hoi, The GPL is perfectly ok with using data and operating on data that is not licensed by the GPL. The GPL is about software not data. When your point of view is only on programming and using your data exclusively for machine consumption, it makes perfect sense to use the GPL. If your data is usable for human consumption, having a license for the functional data like the GFDL, the CC-by or CC-by-sa is still perfectly legal for use with the Apertium engine. In essence you have the same problem with the Wikipedia / Wiktionary data. It is licensed with a license that is incompatible for you while in reality the license of the data is irrelevant for the status of the GPL software. With the program algorithms copyrighted and licensed, you have all the protection that I expect you need. By being able to use data of whatever license, you improve the reach of your software dramatically and as a consequence will be better able to provide services for less and least resourced languages. Thanks, GerardM On Feb 2, 2008 10:28 PM, Francis Tyers <spectre(a)ivixor.net> wrote:

...

El sáb, 02-02-2008 a las 22:13 +0100, Gerard Meijssen escribió:

Thanks, GerardM On Feb 2, 2008 9:38 PM, Francis Tyers <spectre(a)ivixor.net> wrote: El sáb, 02-02-2008 a las 12:10 -0800, Ray Saintonge escribió:

Francis Tyers wrote: > I work on machine translation software,¹ focussing on

lesser-used and

> under-resourced languages.² One of the things that is

needed for our

> software is bilingual dictionaries. A usable way of

getting bilingual

dictionaries is to harvest Wikipedia interwiki links.³

While they are helpful, it would be a mistake to consider

these as fully

reliable. The disambiguation policies of the separate

projects are also

a factor to consider.

Needless to say I've done an analysis of how useful this is before mentioning it. I can send you the results if you would be interested.

> Now, I've been told that interwiki links do not have the

level of

> originality required for copyright, many of them being

created by bot.

> I'm not sure that this is the case, as some of them are

done by people

> and choosing the correct article has at least some level

of work.

> Besides, this would be a cop-out, if we for example wanted

to sense

> disambiguate the terms extracted using the first paragraph

of the

article, this would still be a licence violation.

I would question the copyrightability of any dictionary

entry on the

basis of the merger principle. We copyright forms of

expression rather

than ideas. If the idea is indistinguishable from the form

there is a

strong likelihood that it is not copyrightable. A

dictionary is not

reliable if it seeks to inject originality in its

definition. Seeking

new ways to define words means that we encourage definitions

that may

deviate from the original intention of the words. What is

copyrightable

in a dictionary then is more in the level of global

selection and

presentation.

This is what I also have been lead to believe. But when you're in the habit of commercially distributing stuff -- especially free software that everyone can see inside -- you like to be sure :)

> So, is there any way to resolve this? I understand that

probably it is

> on no-ones high list of priorities. On the other hand, I

understand that

> the FSF is considering to update the GFDL to make it

compatible with the

> Creative Commons CC-BY-SA licence. > > Would it also be possible at the same time to add some

kind of clause

> making GFDL content usable in GPL licensed linguistic data

for machine

translation systems?

What either of those licences say is not within the control

of any

Wikimedia project. Perhaps you should be discussing this

_______________________________________________ Wikipedia-l mailing list Wikipedia-l(a)lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikipedia-l

Francis Tyers

11:59 p.m.

New subject: The use of Wikipedia extracted wordlists in GPL machine translation systems

El sáb, 02-02-2008 a las 22:53 +0100, Gerard Meijssen escribió:

...

This shows a misunderstanding of what is meant by data, as was outlined by my first post. Indeed, you seem to be trying to separate the "machine parts" from the "human parts" something I stated is not possible. I would welcome you to try and show otherwise [patches welcome]. Fran

Thomas Dalton

3 Feb 3 Feb

12:03 a.m.

New subject: The use of Wikipedia extracted wordlists in GPL machine translation systems

...

If you put together a wordlist based on interwiki links, that will be separable from the software that uses the list (the easiest way to do it would involve a separate file containing the list). You can release the software under GPL and the wordlist (if it's even copyrightable) under GFDL.

Francis Tyers

12:07 a.m.

New subject: The use of Wikipedia extracted wordlists in GPL machine translation systems

El sáb, 02-02-2008 a las 22:03 +0000, Thomas Dalton escribió:

...

That is not possible due to the way in which the software works. As I mentioned in my first email it is not possible to decouple the "wordlist" part from the non-wordlist part and distribute them as separate packages. Believe me, I had thought of that. Fran

Thomas Dalton

12:10 a.m.

New subject: The use of Wikipedia extracted wordlists in GPL machine translation systems

On 02/02/2008, Francis Tyers <spectre(a)ivixor.net> wrote:

...

El sáb, 02-02-2008 a las 22:03 +0000, Thomas Dalton escribió:

I don't believe you. Unless you are going through the list manually writing appropriate code for each word (which would take years), you must have some form of automated process with goes through the wordlist. There is nothing stopping you running that automated code on an external file.

Francis Tyers

12:17 a.m.

New subject: The use of Wikipedia extracted wordlists in GPL machine translation systems

El sáb, 02-02-2008 a las 22:10 +0000, Thomas Dalton escribió:

...

On 02/02/2008, Francis Tyers <spectre(a)ivixor.net> wrote:

El sáb, 02-02-2008 a las 22:03 +0000, Thomas Dalton escribió:

You're right, thinking more it could be done with a diff that each user individually patches. Of course this would make distributing binary packages difficult (the language data is compiled into a binary representation before used). Although a binary diff could be done. But then, when was the last time an end-user had to apply a binary diff to their free software? Personally I don't consider this reasonable or maintainable for a large number of language pairs. Which was my original point. Fran

Thomas Dalton

12:21 a.m.

New subject: The use of Wikipedia extracted wordlists in GPL machine translation systems

...

You can distribute the binary representation under GFDL. You would need a plain text version as well, to satisfy the license, but there's nothing stopping you having two versions of the text, one for the benefit of the user (and the lawyers) and one for the benefit of the software.

Francis Tyers

12:24 a.m.

New subject: The use of Wikipedia extracted wordlists in GPL machine translation systems

El sáb, 02-02-2008 a las 22:21 +0000, Thomas Dalton escribió:

...

It would still require the users to perform a binary patch. Something which is not particularly within the realm of normal or reasonable end-user experience. Fran

Thomas Dalton

12:26 a.m.

New subject: The use of Wikipedia extracted wordlists in GPL machine translation systems

...

It would still require the users to perform a binary patch. Something which is not particularly within the realm of normal or reasonable end-user experience.

I don't see why, you can import the file at runtime.

Francis Tyers

12:31 a.m.

New subject: The use of Wikipedia extracted wordlists in GPL machine translation systems

El sáb, 02-02-2008 a las 22:26 +0000, Thomas Dalton escribió:

...

It would still require the users to perform a binary patch. Something which is not particularly within the realm of normal or reasonable end-user experience.

I don't see why, you can import the file at runtime.

It isn't object code. This paper gives details of the format http://www.sepln.org/revistaSEPLN/revista/35/07.pdf If, after further investigation, you think you have a solution please by all means post it to our mailing list: apertium-stuff(a)lists.sourceforge.net This is rapidly becoming more off-topic :) Fran

Gerard Meijssen

12:43 a.m.

New subject: The use of Wikipedia extracted wordlists in GPL machine translation systems

Hoi, Under the GPL you are expected to provide source code. When you create a binary file by compiling language information, the language information IS effectively source code. This means that you have to provide this data in a human readable form as well as a machine readable form. Consequently the data should be made available to the end user in a human readable format anyway under the GPL. The notion that a typical end user will not have a look at the source data and code is irrelevant for the requirements of the GPL. It makes no difference for you if you provide the data in multiple formats or licenses. It will allow people to use the data for other purposes when they choose to do so. The point of all this; when the data is treated seperately from the program, you can have the Wikipedia, the Wiktionary, the OmegaWiki data and use it. It is not affected by the license of the software, it just needs to fit the required format. Giving your starting question this is really relevant. PS you can use the OmegaWiki data anyway, its license is liberal enough for that :) Thanks, GerardM On Feb 2, 2008 11:24 PM, Francis Tyers <spectre(a)ivixor.net> wrote:

...

El sáb, 02-02-2008 a las 22:21 +0000, Thomas Dalton escribió:

> You're right, thinking more it could be done with a diff that each

user

> individually patches. > > Of course this would make distributing binary packages difficult (the > language data is compiled into a binary representation before used). > Although a binary diff could be done. > > But then, when was the last time an end-user had to apply a binary

diff

> to their free software? Personally I don't consider this reasonable or > maintainable for a large number of language pairs. Which was my

original

point.

It would still require the users to perform a binary patch. Something which is not particularly within the realm of normal or reasonable end-user experience. Fran _______________________________________________ Wikipedia-l mailing list Wikipedia-l(a)lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikipedia-l

Francis Tyers

12:49 a.m.

New subject: The use of Wikipedia extracted wordlists in GPL machine translation systems

El sáb, 02-02-2008 a las 23:43 +0100, Gerard Meijssen escribió:

...

Hoi,

<snip>

...

The point of all this; when the data is treated seperately from the program, you can have the Wikipedia, the Wiktionary, the OmegaWiki data and use it. It is not affected by the license of the software, it just needs to fit the required format. Giving your starting question this is really relevant.

We have some fine documentation describing the layout of the language data so that it will be possible to understand better: http://xixona.dlsi.ua.es/~fran/apertium2-documentation.pdf I thought I'd explained it sufficiently well in the first post, but it appears not. Perhaps you could present your idea for how to re-organise the data so that the programmatic parts are entirely separate from the non-programmatic parts and present it to our mailing list.

...

PS you can use the OmegaWiki data anyway, its license is liberal enough for that :)

When it works ;) Fran

...

Thanks, GerardM On Feb 2, 2008 11:24 PM, Francis Tyers <spectre(a)ivixor.net> wrote:

El sáb, 02-02-2008 a las 22:21 +0000, Thomas Dalton escribió:

> You're right, thinking more it could be done with a diff that each

user

diff

> to their free software? Personally I don't consider this reasonable or > maintainable for a large number of language pairs. Which was my

original

point.

_______________________________________________ Wikipedia-l mailing list Wikipedia-l(a)lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikipedia-l

Lars Aronsson

4 Feb 4 Feb

4:04 a.m.

New subject: The use of Wikipedia extracted wordlists in GPL machine translation systems

Francis Tyers wrote:

...

Now, I've been told that interwiki links do not have the level of originality required for copyright, many of them being created by bot.

Bot or not, it is a widely held view that (tabular) data extracted from Wikipedia is in the public domain. At least that's what I believe. I'm sorry that I have no sources to cite. You might want to look at other projects that reuse data extracted from Wikipedia dumps, such as dbpedia.org. Traditional copyright doesn't apply to extracted data, so the GFDL is not applicable. In various countries, "catalog rights" or "database rights" might be applicable to such data, but that right then belongs to those who compiled the table of data (catalog, database), not to the original authors of articles*. I strongly doubt that you could claim such rights if you just extracted interwiki links from the XML dumps published by the Wikimedia Foundation. *Similar examples are Bible concordances or scientific cross indexes, for which catalog rights belong to the indexers, rather than to the original authors of the texts. I am not a lawyer. You might get better answers from Mike Godwin, who can speak for the Wikimedia Foundation. -- Lars Aronsson (lars(a)aronsson.se) Aronsson Datateknik - http://aronsson.se

5920

days inactive

5922

days old

wikipedia-l@lists.wikimedia.org

Manage subscription

19 comments

6 participants

tags (0)

participants (6)

Francis Tyers
Gerard Meijssen
Lars Aronsson
Mark Williamson
Ray Saintonge
Thomas Dalton