Question on Unicode/Unilex license compatibility for MediaWiki/Wikimedia - Mediawiki-i18n

30 Dec 2023


      I was pinged on a Unicode repository 
https://github.com/unicode-org/unilex/issues/10#issuecomment-1872496490 
asking for a WMF perspective on license compatibility. I gave my 
personal answer but I'm notifying the list in case someone does want to 
answer in name of WMF as Unicode member/user. (Also cc'ing Hugo, Stephen 
and Richard as I mentioned them.)
As my answer turned out to be rather long I'll copy it here for the 
archives' benefit.
----
@srl295 Thanks for the ping. I wasn't aware of this issue but I'll give 
a quick reply. I've only read the discussion above and the README. I 
can't speak for WMF, let alone Unicode (I don't remember whether WMF is 
even a member now), but I can tell about the usage of Unicode components 
in MediaWiki software and Wikimedia wikis.
The issue description highlights some confusion on the licensing of this 
project. Meanwhile the LICENSE has been updated to the Unicode license 
v3 which has been recently approved by OSI on 2023-11-17: 
https://opensource.org/license/unicode-license-v3/ . So there's no doubt 
this repository is opensource. Maybe this can be explicitly mentioned on 
the README, as not everyone is able to recognize the license text as its 
own OSI-approved Unicode v3 license.
MediaWiki can and does use software under Unicode license all the time, 
for example in the [CLDR 
extension](https://www.mediawiki.org/wiki/Extension:CLDR), which is 
primarily GPLv2, under the understanding that the CLDR data inside was 
under a BSD-like license. (Apertium linguistic data is also 
[usually](https://wiki.apertium.org/wiki/Contributing_to_an_existing_pair#Consider_con...) 
under GPL.) As long as Unilex can be used in GPL software, there are 
probably ways it can benefit all Wikimedia wikis through MediaWiki.
However @hugolpz seems most concerned about usage in Wikidata and other 
Wikimedia wikis _content_. From the README it sounds like this 
repository mostly wants to collect uncopyrightable factual information. 
In the EU, there might still be problems with database rights. A general 
opinion from the WMF on how to handle these is at 
https://meta.wikimedia.org/wiki/Wikilegal/Database_Rights . In short, 
it's complicated, and it's easier to incorporate a dataset into Wikidata 
when it's already under CC-0. If there's some doubt on whether the data/ 
directory here as a whole is a dataset
If you want to cooperate with Wikidata lexemes in the future, it's worth 
considering how to make it easier. As for LinguaLibre, as far I 
understand it helps produce some recording which might be considered 
copyrightable, and it wants its outputs to be available under CC BY-SA, 
so it benefits from its sources being as permissive as possible.
Finally, I see that [many 
files](https://github.com/search?q=repo%3Aunicode-org%2Funilex+SPDX-License-Identif...) 
carry a `SPDX-License-Identifier: Unicode-DFS-2016` header, which makes 
it easier to follow the [Reuse](https://reuse.software/) guidelines. 
Note Richard Fontana's suggestion for trivial files at 
https://github.com/fsfe/reuse-docs/issues/62#issuecomment-1200305896 
(and my personal opinion below it).
So in conclusion my personal suggestions are:
* mark the repository even more clearly as being under OSI-approved 
license Unicode v3;
* keep marking the individual files copyright status, and consider even 
more permissive licenses like MIT-0 (or 0BSD or CC-0) when adding 
uncopyrightable files;
* keep in mind possible copyright needs for Wikidata and Wikimedia 
Commons in the future, and ask help from WMF legal (legal@wikimedia.org) 
on any possible/needed clarifications for CC-0 and CC BY-SA 
compatibility (fyi @slaporte).
----
Cheers,
    Federico