Re: [Wikimedia-l] Quality issues

29 Nov 2015

Hoi,
It would be a gross violation of trust to bring Wikidata under a different
license. When an external source is willing to share its data, it can do
so. With explicit agreement we can copy data in from them in this way. Even
when this is not possible for whatever reason, we can still contribute
because we can compare data and on the basis of differences in existing
data curate our data and enable them to share our findings.

I am amused by your fear for manipulation. Yes, data can be manipulated but
once we see it happen, we can take measures when it affects the data we
hold. Provenance of data is at this stage something we at Wikidata wish
for. Arguably it does not make sense to make it a priority for all of our
data because it would stifle Wikidata and it is utterly against the wiki
spirit.

The best way to guard against manipulation is to cooperate widely and take
any difference in data as serious. It is in the differences where we want
to know why the differences and why they exist. Focussing on known issues
helps us identify systemic issues and when we do we can expose such
manipulation with proof. In this way we are using a SMART methodology. No I
would never use the license as a weapon, it is how manipulation is
justified.

Importing data from Wikipedia is a sensible thing to do. Its data is
relatively well known for its quality. It has its issues but its basis is
NPOV. When people are alarmed about importing from Wikipedia, it tells us
more of what they think of the quality of Wikipedia than of the quality of
Wikidata. When people are alarmed because they cannot control it, ask
yourself what is their problem and how do their arguments enable the notion
of Wikidata as a wiki? When imported data is wrong, there are tools to
remove content quite delicately. So identify an issue and it can be dealt
with.

When you argue that Wikidata cannot be used as a central storage. Fine, do
not use it. In the mean time quality of specific sets of data is of higher
quality than any Wikipedia. This is a proven fact. The question if Wikidata
is useful as a central datarepository at this time can only be answered as
NO when it means it is about all of Wikidata. When it is about specific
subsets of data the answer is clearly yes. It is also obvious that as time
goes on more subsets of data will be of a higher quality than any Wikipedia
(when thinking in terms of sets of data - there will always items where a
Wikipedia has an edge).

FYI I am in contact with a German university that is likely to use Wikidata
internally for its research data. It needs Reasonator type of functionality
to make it useful. It wants to share its data with Wikidata and wants two
way RSS feeds in order to include new information

When we set up cooperatation with statistical offices, we CAN attribute
easily by having bots import data on their behalf using THEIR user id and
adding sources to the new data. We can also provide data from their website
in applications.. It is not the license that means anything it is what we
agree to do. When we have sourced data in this way, you are silly to change
it. False attributions are not permitted under any license.

When we are afraid about a Seigenthaler type of event based on Wikidata,
rest assured there is plenty wrong in either Wikipedia or Wikidata tha
makes it possible for it to happen. The most important thing is to deal
with it responsibly. Just being afraid will not help us in any way. Yes we
need quality and quantity. As long as we make a best effort to improve our
data, we will do well.

As to the Wikipedian is residence, that is his opinion. At the same time
the article on ebola has been very important. It may not be science but it
certainly encyclopaedic. At the same time this Wikipedian in residence is
involved, makes a positive contribution and while he may make mistakes he
is part of the solution.

I am happy that you propose that work is to be done. What have you done but
more importantly what are you going to do? For me there is "Number of edits:
2,088,923" <https://www.wikidata.org/wiki/Special:Contributions/GerardM>
Thanks,
     GerardM

On 29 November 2015 at 15:10, Andreas Kolbe &lt;jayen466(a)gmail.com&gt; wrote:

...
  Gergo,

 On Sun, Nov 29, 2015 at 12:36 AM, Gergo Tisza &lt;gtisza(a)wikimedia.org&gt;
 wrote:

  By the same logic, to the extent Wikipedia takes
its facts from non-free
 external source, its free license would be a copyright violation. Luckily
 for us, that's not how copyright works. 

 I'm aware that facts are not copyrightable. By the same logic, Wikidata
 being offered under a CC BY-SA license, say, would not prevent anyone from
 extracting facts -- knowledge -- from it, and it would enable Wikidata to
 import a lot of data it presently cannot, because of licence
 incompatibilities.

  Statements of facts can not be
 copyrighted; large-scale arrangements of facts (ie. a full database)
 probably can, but CC does not prevent others from using them without
 attribution, just distributing them (again, it's like the GPL/Affero
 difference); 

 Distribution is the issue here – large-scale distribution and viral
 propagation of data with a well-documented potential for manipulation and
 error, in a way that makes the provenance of these data a closed book to
 the end user.

 Do you accept that this is a potential problem, and if so, how would you
 guard against it, if not through the licence?

  there are sui generis database rights in some
countries but
 not in the USA where both Wikipedia and most proprietary
 reusers/compatitors are located, so relying on neighbouring rights would
 not help there but cause legal uncertainty for reusers (e.g. OSM which  has
  lots of legal trouble importing coordinates due
to being EU-based).

 It seems noteworthy that Freebase specifically said, with regard to loading
 structured data, "If a data source is under CC-BY, you can load it into
 Freebase as long as you provide attribution."[1]

 Wikidata practice seems to have taken a different path regarding licence
 compatibility, given its systematic imports from Wikipedia.

 Interestingly enough, it's been pointed out to me that Denny said in
 2012,[2]

 ---o0o---

 Alexrk2, it is true that Wikidata under CC0 would not be allowed to import
 content from a Share-Alike data source. Wikidata does not plan to extract
 content out of Wikipedia at all. Wikidata will *provide* data that can be
 reused in the Wikipedias. And a CC0 source can be used by a Share-Alike
 project, be it either Wikipedia or OSM. But not the other way around. Do we
 agree on this understanding? --Denny Vrandečić (WMDE)
 <https://meta.wikimedia.org/wiki/User:Denny_Vrande%C4%8Di%C4%87_(WMDE)> (
 talk
 <
 https://meta.wikimedia.org/wiki/User_talk:Denny_Vrande%C4%8Di%C4%87_(WMDE)
 )  12:39, 4 July 2012 (UTC)

 ---o0o---

 The key sentence here is "Wikidata does not plan to extract content out of
 Wikipedia at all."

 That doesn't seem to be how things have turned out, because today we have
 people on Wikidata raising alarms about mass imports from Wikipedia:[3]

 ---o0o---

 Reliable Bot imports from wikipedias?

 In a Wikipedia discussion I came by chance across a link to the following
 discussion:

    - Wikidata:Project_chat/Archive/2015/10#STOP_with_bot_import
    <
 https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2015/10#STOP_wi…

 [...] To provide an outside perspective as Wikipedian (and a potential
 use[r] of WD in the future). I wholeheartedly agree with Snipre, in fact
 "bots [ar]e running wild" and the uncontrolled import of data/information
 from Wikipedias is one of the main reasons for some Wikipedias developing
 an increasingly hostile attitude towards WD and its usage in Wikipedias.
 *If* WD is ever to function as a central data storage for various Wikimedia
 projects and in particular Wikipedia as well (in analogy to Commons),
 *then*
  quality has to take the driver's seat over quantity. A central storage
 needs a much better data integrity than the projects using it, because one
 mistake in its data will multiply throughout the projects relying on WD,
 which may cause all sorts of problems. For crude comparison think of a
 virus placed on a central server than on a single client.The consequences
 are much more severe and nobody in their right mind would run the server
 with even less protection/restrictions than the client.

 Another thin[g] is, that if you envision users of other Wikimedia projects
 such as Wikipedia or even 3rd party external projects to eventually help
 with data maintenance when they start using WD, then you might find them
 rather unwilling to do so, if not enough attention is paid to quality,
 instead they probably just dump WD from their projects.

 In general all the advantages of the central data storage depend on the
 quality (reliability) of data. If that is not given to reasonable high
 degree, there is no point to have central data storage at all. All the
 great application become useless if they operate on false data.--Kmhkmh
 <
 https://www.wikidata.org/w/index.php?title=User:Kmhkmh&action=edit&…
    (talk
<https://www.wikidata.org/wiki/User_talk:Kmhkmh>) 12:00, 19
 November
 2015 (UTC)

 ---o0o---

 (I was unaware of that post by Kmhkmh when I started contributing to this
 discussion, but it obviously echoes some of my own concerns.)

 I've been told on the German Wikipedia that the Wikidata CC0 licence has
 long been a controversial issue, subject to recurrent discussion,
 especially with regard to official population statistics in Europe, whose
 publishers often require attribution, making their wholesale import in
 Wikidata's CC0 environment problematic.[4]

 In reviewing these discussions, I couldn't help but be reminded of
 Flickrwashing schemes by some contributors' lines of thought: how -- via
 which intermediary steps -- can we get the info into our CC0 project
 without being seen to fall foul of the original publishers' licenses?

 As I understand it, the intent is to bully other data publishers into
 making their data available under CC0 as well. I understand this from an
 open-content perspective, and I can see how it might benefit Google's and
 other information platforms' bottom line, but I reiterate -- there are
 very, very significant downsides to having a central database subject to
 anonymous manipulation by all comers whose data is automatically propagated
 by major search engines. There are many autocratic regimes in the world
 today who spend a lot of money and effort to achieve this kind of uniform
 media response in their countries.

 In my opinion, it creates a significant vulnerability in the global
 information infrastructure. If, in more troubled times ahead, people are
 fed the same unattributed lie by all major online outlets, because they are
 all automatically propagating the content of Wikimedia's CC0 database, then
 this could potentially alter the course of history, and not in a good way.

 I am happy to hear ideas about how to address this that do not involve
 licensing. We need more transparency about data provenance.

 You may argue that Wikidata is still in its early days, and has nowhere
 near the amount of data, nowhere near the reach and impact today to justify
 such an effort. Maybe it never will, and I'm worrying for nothing.

 But we thought much the same about Wikipedia around the time of the
 Seigenthaler incident. Before we knew it, Wikipedia had become the world's
 dominant information resource, with increasing numbers of government
 officials, judges, journalists and academics happy to accept its word
 uncritically – in a way that horrifies most Wikipedians, who are well aware
 of the system's weaknesses.

 Last month for example the Wikipedian in Residence at NIOSH (National
 Institute for Occupational Safety and Health) said on Wikidata that he
 would "cringe" at the thought of using Wikipedia as a source and personally
 refrained from it:[5]

 ---o0o---

    - As a note, I do semi-automated edits on my work account
    <https://www.wikidata.org/wiki/User:James_Hare_(NIOSH)>, and I plan on
    doing some as a volunteer as well. I don't use Wikipedia as a source (as
    a Wikipedian of 11 years, I cringe at the thought ;), but if any batch
    edits I do manage to screw something up despite my meticulous planning,
    please let me know immediately. I will take responsibility for my own
    messes. Harej <https://www.wikidata.org/wiki/User:Harej> (talk
    <https://www.wikidata.org/wiki/User_talk:Harej>) 17:38, 27 October 2015
    (UTC)

 ---o0o---

 If Wikidata were to acquire the global reach its makers and sponsors hope
 for, then we would have done well to build a robust system that minimises
 harm, and cannot become a victim of its own success. I propose that there
 is work to be done here.

 Coming back briefly to the legal licensing situation, it seems to be fairly
 complex even in the US, according to the relevant Wikilegal page on
 Meta[6], with much depending on the amount of material extracted, as you
 pointed out above.

 Things are more complicated still in the EU, given that European law
 protects databases created by EU citizens or residents (which includes a
 good number of Wikimedians), with that protection extending to "sweat of
 the brow" (unprotected in the US). EU law even prohibits the "repeated and
 systematic extraction" of "insubstantial parts of the contents" of a
 database (where the term "database" is defined broadly enough to include a
 Wikipedia).

 There's not much point in my saying more about the legal aspects of
 licensing; even the advice from the Foundation's legal professionals says
 it's rarely easy to predict how a court might rule under either EU or US
 law.[6]

 Andreas

 [1] http://wiki.freebase.com/wiki/License_compatibility
 [2]

 https://meta.wikimedia.org/wiki/Talk:Wikidata#Is_CC_the_right_license_for_d…
 [3]

https://www.wikidata.org/wiki/Wikidata:Project_chat#Reliable_Bot_imports_fr…
 [4]
 https://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00178.html

https://www.wikidata.org/w/index.php?title=Wikidata:General_disclaimer&…
 http://osdir.com/ml/general/2012-11/msg31088.html
 http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg03088.html

https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2014/04#Modifyi…

https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2014/04#Data_re…

https://www.wikidata.org/wiki/Wikidata:Contact_the_development_team/Archive…
  http://www.gossamer-threads.com/lists/wiki/foundation/450291#450291

https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2014/05#Populat…

 https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2014/04#Data_ow…

 [5]

https://www.wikidata.org/w/index.php?title=Wikidata:Project_chat&diff=p…

 [6] https://meta.wikimedia.org/wiki/Wikilegal/Database_Rights
 _______________________________________________
 Wikimedia-l mailing list, guidelines at:
 https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
 Wikimedia-l(a)lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Wikimedia-l] Quality issues