Fran, very intriguing! I'd actually been thinking of something along these lines for the past couple of weeks.
Would it be possible to do a statistical analysis of articles, as well? I would imagine that in longer articles, there would be certain words that would have similar frequencies, whether or not they are direct translations.
On 02/02/2008, Francis Tyers spectre@ivixor.net wrote:
El sáb, 02-02-2008 a las 12:10 -0800, Ray Saintonge escribió:
Francis Tyers wrote:
I work on machine translation software,¹ focussing on lesser-used and under-resourced languages.² One of the things that is needed for our software is bilingual dictionaries. A usable way of getting bilingual dictionaries is to harvest Wikipedia interwiki links.³
While they are helpful, it would be a mistake to consider these as fully reliable. The disambiguation policies of the separate projects are also a factor to consider.
Needless to say I've done an analysis of how useful this is before mentioning it. I can send you the results if you would be interested.
Now, I've been told that interwiki links do not have the level of originality required for copyright, many of them being created by bot. I'm not sure that this is the case, as some of them are done by people and choosing the correct article has at least some level of work. Besides, this would be a cop-out, if we for example wanted to sense disambiguate the terms extracted using the first paragraph of the article, this would still be a licence violation.
I would question the copyrightability of any dictionary entry on the basis of the merger principle. We copyright forms of expression rather than ideas. If the idea is indistinguishable from the form there is a strong likelihood that it is not copyrightable. A dictionary is not reliable if it seeks to inject originality in its definition. Seeking new ways to define words means that we encourage definitions that may deviate from the original intention of the words. What is copyrightable in a dictionary then is more in the level of global selection and presentation.
This is what I also have been lead to believe. But when you're in the habit of commercially distributing stuff -- especially free software that everyone can see inside -- you like to be sure :)
So, is there any way to resolve this? I understand that probably it is on no-ones high list of priorities. On the other hand, I understand that the FSF is considering to update the GFDL to make it compatible with the Creative Commons CC-BY-SA licence.
Would it also be possible at the same time to add some kind of clause making GFDL content usable in GPL licensed linguistic data for machine translation systems?
What either of those licences say is not within the control of any Wikimedia project. Perhaps you should be discussing this with FSF.
I was intending to do that after I received replies back from here. I understand that the WMF/Wikipedia has some clout with respect to licensing at the FSF, for example:
http://wikimediafoundation.org/wiki/Resolution:License_update
Of course moving to CC-BY-SA won't solve the GPL compatibility problem.
Fran
Wikipedia-l mailing list Wikipedia-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikipedia-l