New subject: [Wikimedia-l] An answer to Lydia Pintscher regarding its considerations on Wikidata and CC-0

29 Nov 2017

Saluton ĉiuj,

I forward here the message I initially posted on the Meta Tremendous
Wiktionary User Group talk page
<https://meta.wikimedia.org/wiki/Talk:Wiktionary/Tremendous_Wiktionary_User_Group#An_answer_to_Lydia_general_thinking_about_Wikidata_and_CC-0>,
because I'm interested to have a wider feedback of the community on this
point. Whether you think that my view is completely misguided or that I
might have a few relevant points, I'm extremely interested to know it,
so please be bold.

Before you consider digging further in this reading, keep in mind that I
stay convinced that Wikidata is a wonderful project and I wish it a
bright future full of even more amazing things than what it already
brung so far. My sole concern is really a license issue.

Bellow is a copy/paste of the above linked message:

Thank you Lydia Pintscher
<https://meta.wikimedia.org/wiki/User:Lydia_Pintscher_%28WMDE%29> for
taking the time to answer. Unfortunately this answer
<https://www.wikidata.org/wiki/User:Lydia_Pintscher_%28WMDE%29/CC-0>
miss too many important points to solve all concerns which have been raised.

Notably, there is still no beginning of hint in it about where the
decision of using CC0 exclusively for Wikidata came from. But as this
inquiry on the topic
<https://en.wikiversity.org/wiki/fr:Recherche:La_licence_CC-0_de_Wikidata,_origine_du_choix,_enjeux,_et_prospections_sur_les_aspects_de_gouvernance_communautaire_et_d%E2%80%99%C3%A9quit%C3%A9_contributive>
advance, an answer is emerging from it. It seems that Wikidata choice
toward CC0 was heavily influenced by Denny Vrandečić, who – to make it
short – is now working in the Google Knowledge Graph team. Also it worth
noting that Google funded a quarter of the initial development work.
Another quarter came from the Gordon and Betty Moore Foundation,
established by Intel co-founder. And half the money came from Microsoft
co-founder Paul Allen's Institute for Artificial Intelligence (AI2)[1]
<https://meta.wikimedia.org/wiki/Talk:Wiktionary/Tremendous_Wiktionary_User_Group#cite_note-1>.
To state it shortly in a conspirational fashion, Wikidata is the puppet
trojan horse of big tech hegemonic companies into the realm of
Wikimedia. For a less tragic, more argumentative version, please see the
research project (work in progress, only chapter 1 is in good enough
shape, and it's only available in French so far). Some proofs that this
claim is completely wrong are welcome, as it would be great that in fact
that was the community that was the driving force behind this single
license choice and that it is the best choice for its future, not the
future of giant tech companies. This would be a great contribution to
bring such a happy light on this subject, so we can all let this issue
alone and go back contributing in more interesting topics.

Now let's examine the thoughts proposed by Lydia.

Wikidata is here to give more people more access to more knowledge.
    So far, it makes it matches Wikimedia movement stated goal. 
This means we want our data to be used as widely as possible.
    Sure, as long as it rhymes with equity. As in /Our strategic
    direction: Service and //*Equity*/

<https://meta.wikimedia.org/wiki/Strategy/Wikimedia_movement/2017/Direction/Endorsement#Our_strategic_direction:_Service_and_Equity>.
    Just like we want freedom for everybody as widely as possible. That
    is, starting where it confirms each others freedom. Because under
    this level, freedom of one is murder and slavery of others. 
CC-0 is one step towards that.
    That's a thesis, you can propose to defend it but no one have to
    agree without some convincing proof. 
Data is different from many other things we produce in Wikimedia in that
it is aggregated, combined, mashed-up, filtered, and so on much more
extensively.
    No it's not. From a data processing point of view, everything is
    data. Whether it's stored in a wikisyntax, in a relational database
    or engraved in stone only have a commodity side effect. Whether it's
    a random stream of bit generated by a dumb chipset or some encoded
    prose of Shakespeare make no difference. So from this point of view,
    no, what Wikidata store is not different from what is produced
    anywhere else in Wikimedia projects. 
    Sure, the way it's structured does extremely ease many things. But
    this is not because it's data, when elsewhere there would be no
    data. It's because it enforce data to be stored in a way that ease
    aggregation, combination, mashing-up, filtering and so on. 

Our data lives from being able to write queries over millions of
statements, putting it into a mobile app, visualizing parts of it on a
map and much more.
    Sure. It also lives from being curated from millions[2]

<https://meta.wikimedia.org/wiki/Talk:Wiktionary/Tremendous_Wiktionary_User_Group#cite_note-2>
    of benevolent contributors, or it would be just a useless pile of
    random bytes. 
This means, if we require attribution, in a huge number of cases
attribution would need to go back to potentially millions of editors and
sources (even if that data is not visible in the end result but only
helped to get the result).
    No, it doesn't mean that. 
    First let's recall a few basics as it seems the whole answer makes
    confusion between attribution and distribution of contributions
    under the same license as the original. Attribution is crucial for
    traceability and so for reliable and trusted knowledge that we are
    targeting within the Wikimedia movement. The "same license" is the
    sole legal guaranty of equity contributors have. That's it, trusted
    knowledge and equity are requirements for the Wikimedia movement
    goals. That means withdrawing this requirements is withdrawing this
    goals. 
    Now, what would be the additional cost of storing sources in
    Wikidata? Well, zero cost. Actually, it's already here as the
    "reference" attribute is part of the Wikibase item structure. So
    attribution is not a problem, you don't have to put it in front of
    your derived work, just look at a Wikipedia article: until you go to
    history, you have zero attribution visible, and it's ok. It's also
    have probably zero or negligible computing cost, as it doesn't have
    to be included in all computations, it just need to be retrievable
    on demand. 
    What would be the additional cost of storing licenses for each item
    based on its source? Well, adding a license attribute might help,
    but actually if your reference is a work item, I guess it might
    comes with a "license" statement, so zero additional cost. Now for
    letting user specify under which free licenses they publish their
    work, that would just require an additional attribute, a ridiculous
    weight when balanced with equity concerns it resolves. 
    Could that prevent some uses for some actors? Yes, that's actually
    the point, preventing abuse of those who doesn't want to act
    equitably. For all other actors a "distribute under same condition"
    is fine. 
This is potentially computationally hard to do and and depending on
where the data is used very inconvenient (think of a map with hundreds
of data points in a mobile app).
    OpenStreetMap which use ODbL, a copyleft attributive license, do
    exactly that too, doesn't it? By the way, allowing a license by item
    would enable to include OpenStreetMap data in WikiData, which is
    currently impossible due to the CC0 single license policy of the
    project. Too bad, it could be so useful to have this data accessible
    for Wikimedia projects, but who cares? 
This is a burden on our re-users that I do not want to impose on them.
    Wait, which re-users? Surely one might expect that Wikidata would
    care first of re-users which are in the phase with Wikimedia goal,
    so surely needs of Wikimedia community in particular and Free/Libre
    Culture in general should be considered. Do this re-users would be
    penalized by a copyleft license? Surely no, or they wouldn't use it
    extensively as they do. So who are this re-users for who it's
    thought preferable, without consulting the community, to not annoy
    with questions of equity and traceability? 
It would make it significantly harder to re-use our data and be in
direct conflict with our goal of spreading knowledge.
    No, technically it would be just as easy as punching a button on a
    computer to do that rather than this. What is in direct conflict
    with our clearly stated goals emerging from the 2017 community
    consultation is going against equity and traceability. You propose
    to discard both to satisfy exogenous demands which should have next
    to no weight in decision impacting so deeply the future of our
    community. 
Whether data can be protected in this way at all or not depends on the
jurisdiction we are talking about. See this Wikilegal on on database
rights <https://meta.wikimedia.org/wiki/Wikilegal/Database_Rights> for
more details.
    It says basically that it's applicable in United States and Europe
    on different legal bases and extents. And for the rest of the world,
    it doesn't say it doesn't say nothing can apply, it states nothing. 
So even if we would have decided to require attribution it would only be
enforceable in some jurisdictions.
    What kind of logic is that? Maybe it might not be applicable in some
    country, so let's withdraw the few rights we have. 
Ambiguity, when it comes to legal matters, also unfortunately often
means that people refrain from what they want to to for fear of legal
repercussions. This is directly in conflict with our goal of spreading
knowledge.
    Economic inequality, social inequity and legal imbalance might also
    refrain people from doing what they want, as they fear practical
    repercussions. CC0 strengthen this discrimination factors by
    enforcing people to withdraw the few rights they have to weight
    against the growing asymmetry that social structures are
    concomitantly building. So CC0 as unique license choice is in direct
    conflict with our goal of *equitably* spreading knowledge. 
    Also it seems like this statement suggest that releasing our
    contributions only under CC0 is the sole solution to diminish legal
    doubts. Actually any well written license would do an equal job
    regarding this point, including many copyleft licenses out there. So
    while associate a clear license to each data item might indeed
    diminish legal uncertainty, it's not an argument at all for
    enforcing CC0 as sole license available to contributors. 
    Moreover, just putting a license side by side with a work does not
    ensure that the person who made the association was legally allowed
    to do so. To have a better confidence in the legitimacy of a
    statement that a work is covered by a certain license, there is once
    again a traceability requirement. For example, Wikidata currently
    include many items which were imported from misc. Wikipedia
    versions, and claim that the derived work obtained – a set of items
    and statements – is under CC0. That is a hugely doubtful statement
    and it alarmingly looks like license laundering
    <https://en.wikipedia.org/wiki/license_laundering>. This is true for
    Wikipedia, but it's also true for any source on which a large scale
    extraction and import are operated, whether through bots or crowd
    sourcing. 
    So the Wikidata project is currently extremely misplaced to give
    lessons on legal ambiguity, as it heavily plays with legal blur and
    the hope that its shady practises won't fall under too much scrutiny. 
Licenses that require attribution are often used as a way to try to make
it harder for big companies to profit from openly available resources.
    No there are not. They are used as /a way to try to make it harder
    for big companies to profit from openly available resources/ *in
    inequitable manners*. That's completely different. Copyleft licenses
    give the same rights to big companies and individuals in a manner
    that lower socio-economic inequalities which disproportionally
    advantage the former. 
The thing is there seems to be no indication of this working.
    Because it's not trying to enforce what you pretend, so of course
    it's not working for this goal. But for the goal that copyleft
    licenses aims at, there are clear evidences that yes it works. 
Big companies have the legal and engineering resources to handle both
the legal minefield and the technical hurdles easily.
    There is no pitfall in copyleft licenses. Using war material analogy
    is disrespectful. That's true that copyleft licenses might come with
    some constraints that non-copyleft free licenses don't have, but
    that the price for fostering equity. And it's a low price, that even
    individuals can manage, it might require a very little extra time on
    legal considerations, but on the other hand using the free work is
    an immensely vast gain that worth it. In Why you shouldn't use the
    Lesser GPL for your next library
    <https://www.gnu.org/licenses/why-not-lgpl.html> is stated
    /proprietary software developers have the advantage of money; free
    software developers need to make advantages for each other/. This
    might be generalised as /big companies have the advantage of money;
    free/libre culture contributors need to make advantages for each
    other/. So at odd with what pretend this fallacious claims against
    copyleft licenses, they are not a "minefield and the technical
    hurdles" that only big companies can handle. All the more, let's
    recall who financed the initial development of Wikidata: only actors
    which are related to big companies. 
Who it is really hurting is the smaller start-up, institution or hacker
who can not deal with it.
    If this statement is about copyleft licenses, then this is just
    plainly false. Smaller actors have more to gain in preserving mutual
    benefit of the common ecosystem that a copyleft license fosters. 
With Wikidata we are making structured data about the world available
for everyone.
    And that's great. But that doesn't require CC0 as sole license to be
    achieved. 
We are leveling the playing field to give those who currently don’t have
access to the knowledge graphs of the big companies a chance to build
something amazing.
    And that's great. But that doesn't require CC0 as sole license.
    Actually CC0 makes it a less sustainable project on this point, as
    it allows unfair actors to take it all, add some interesting added
    value that our community can not afford, reach/reinforce an
    hegemonic position in the ecosystem with their own closed solution.
    And, ta ta, Wikidata can be discontinued quietly, just like Google
    did with the defunct Freebase which was CC-BY-SA before they bought
    the company that was running it, and after they imported it under
    CC0 in Wikidata as a new attempt to gather a larger community of
    free curators. And when it will have performed license laundering of
    all Wikimedia projects works with shady mass extract and import,
    Wikimedia can disappear as well. Of course big companies benefits
    more of this possibilities than actors with smaller financial
    support and no hegemonic position. 
Thereby we are helping more people get access to knowledge from more
places than just the few big ones.
    No, with CC0 you are certainly helping big companies to reinforce
    their position in which they can distribute information manipulated
    as they wish, without consideration for traceability and equity
    considerations. Allowing contributors to also use copyleft licenses
    would be far more effective to /collect and use different forms of
    free, trusted knowledge/ that /focus efforts on the knowledge and
    communities that have been left out by structures of power and
    privilege/, as stated in /Our strategic direction: Service and Equity/. 

CC-0 is becoming more and more common.
    Just like economic inequality
    <https://en.wikipedia.org/wiki/economic_inequality>. But that is not
    what we are aiming to foster in the Wikimedia movement. 
Many organisations are releasing their data under CC-0 and are happy
with the experience. Among them are the European Union, Europeana, the
National Library of Sweden and the Metropolitan Museum of Modern Arts.
    Good for them. But they are not the Wikimedia community, they have
    their own goals and plan to be sustainable that does not necessarily
    meet what our community can follow. Different contexts require
    different means. States and their institutions can count on tax
    revenue, and if taxpayers ends up in public domain works, that's
    great and seems fair. States are rarely threatened by companies,
    they have legal lever to pressure that kind of entity, although
    conflict of interest and lobbying can of course mitigate this
    statement. 
    Importing that kind of data with proper attribution and license is
    fine, be it CC0 or any other free license. But that's not an
    argument in favour of enforcing on benevolent a systematic withdraw
    of all their rights as single option to contribute. 
All this being said we do encourage all re-users of our data to give
attribution to Wikidata because we believe it is in the interest of all
parties involved.
    That's it, zero legal hope of equity. 
And our experience shows that many of our re-users do give credit to
Wikidata even if they are not forced to.
    Experience also show that some prominent actors like Google won't
    credit the Wikimedia community anymore when generating directly
    answer based on, inter alia, information coming from Wikidata, which
    is itself performing license laundering of Wikipedia data. 
Are there no downsides to this? No, of course not. Some people chose not
to participate, some data can't be imported and some re-users do not
attribute us. But the benefits I have seen over the years for Wikidata
and the larger open knowledge ecosystem far outweigh them.
    This should at least backed with some solid statistics that it had a
    positive impact in term of audience and contribution in Wikimedia
    project as a whole. Maybe the introduction of Wikidata did have a
    positive effect on the evolution of total number of contributors, or
    maybe so far it has no significant correlative effect, or maybe it
    is correlative with a decrease of the total number of active
    contributors. Some plots would be interesting here. Mere personal
    feelings of benefits and hindrances means nothing here, mine
    included of course. 
    Plus, there is not even the beginning of an attempt to A/B test with
    a second Wikibase instant that allow users to select which licenses
    its contributions are released under, so there is no possible way to
    state anything backed on relevant comparison. The fact that they are
    some people satisfied with the current state of things doesn't mean
    they would not be even more satisfied with a more equitable solution
    that allows contributors to chose a free license set for their
    publications. All the more this is all about the sustainability and
    fostering of our community and reaching its goals, not immediate
    feeling of satisfaction for some people. 

  *

    [1] Wikipedia Signpost 2015, 2nd december

<https://en.wikipedia.org/wiki/en:Wikipedia:Wikipedia_Signpost/2015-12-02/Op-ed>

  *

    [2] according to the next statement of Lydia

Once again, I recall this is not a manifesto against Wikidata. The
motivation behind this message is a hope that one day one might
participate in Wikidata with the same respect for equity and
traceability that is granted in other Wikimedia projects.

Kun multe da vikiamo,
mathieu