Re: [Wikimedia-l] Quality issues

27 Nov 2015

Hoi,
When a benefit is "Wikimedia specific" and thereby dismissed, you miss much
of what is going on. Exactly because of this link most items are well
defined as to what they are about. It is not perfect but it is good.
Consequently Wikidata is able to link Wikipedia in any language to sources
external to Wikipedia. This is a big improvement over linking external
sources to a Wikipedia. The disambiguation of subjects is done at the
Wikidata end.

You make Wikidata to be a "default reference source". Given its current
state, it is a bit much. Wikidata does not have the maturity to function as
such. The best pointer to this fact is that 50% of all items has two or
fewer statements.

When you compare the quality of Wikipedias with what en.wp used to be you
are comparing apples and oranges. The Myanmar Wikipedia is better informed
on Myanmar than en.wp etc.

When you qualify a Wikipedia as fascist, it does not follow that the data
is suspect. Certainly when data in a source that you so easily dismiss is
typically the same, there is not much meaning in what you say from a
Wikidata point of view.

I am thrilled that sources are so important to the Wikimedia movement and
again, I am wondering what you hope to achieve by this pronouncement. Be
realistic what is it that you want to achieve? Is quality important to you
and, how do you define it and more importantly how do you want to achieve
it. Have you seen the statistics on sources [1]? Then have a better look
and you will find that real sources are mostly absent. Adding sources one
statement at a time will not significantly improve quality because that is
a numbers game and it is easier to achieve quality in a different way.

When a librarian says that many sources copy each others data and that this
is a problem, the bigger problem is missed. The bigger problem is not where
they agree but where they disagree. Arguably they are the statements where
quality is more likely an issue. Now ask your librarian what is likely to
improve Wikidata more either find Sources for the statements that differ of
find Sources where the statements agree. Wikidata is not authoritative but
when our community starts researching such issues both Wikidata and other
sources will improve rapidly their quality. This is not to say that in the
end you want both Sources where sources agree and disagree.

Then ask your librarian if there is a problem with missing data  We can
import data from sources and consequently be more informative or we do not
import more data and people have to magically combine information that
exists in many sources to get a composite view. We could see Wikidata as a
place where data is combined and compared with other sources, Do tell your
librarian that the process mentioned above should be iterative and it will
be easily understood that comparing with just one additional source will
improve the focus on likely issues even more.

PS What does your librarian think when she knows that the Dutch National
Library is inclined to provide us with software so that books can be
ordered at Dutch libraries from Wikidata data (and by inference from
Wikipedias)?

When some see Wikidata as a source of reference, they will increasingly be
served a better product. At this moment it is not good at all.

When German Wikimedians have concerns about quality.WONDERFUL but what have
they done to improve things? Do they apply Wikipedia standards and how does
that help?

You wonder why have "bad" data in the first place... Our data IS bad and
there is not enough of it for it to be really useful. We can easily add
more data and have a more useful result We can easily compare sources and
ask people to concentrate on differences. However you can not tell me to
add Sources to the data that I add. I will tell you to do it yourself. I am
happy to improve on quality but on my terms, not yours.

You mention the propagation of errors.. How would that work. You indicate
that there are not enough people to fix all the issues. With bots like
Kian, we have probability in adding data. We have people add data where the
software is not certain.  You doubt technology but you do not know where we
are, what is already done.

In short my feeling is that you do not know what you are talking about.
There is real scholarship in the approach that I described, My take is in
applying set theory. Kian is AI. For all I care yours is FUD.

Your notion of accountability is one of a consumer, it is not the
accountability needed for a project that is immature and is not at all at a
stage where you should imply that it is good enough and that quality is
assured. There are domains in Wikidata that I will not touch because in my
opinion it is wrong in its principles. At the same time I know that it can
be fixed in time and leave it at that,

I disagree with Heather Ford and Mark Graham. As long as Wikidata does not
have the power of a Reasonator, the data is just that. It does not make
itself in information and consequently it is awful. When there is one thing
the Wikidata engineers do not do, it is considering the use of the data and
the workflows to improve the data and the quality.

The data needs to be CC-0 because it is how we ensure that everybody will
be happy and willing to participate. As more participation happens as more
collaboration occurs we will see Wikidata increase in the amount of data
that it holds and at the same time we will see quality improve.

Yes, Wikidata could do more in the way of adding sources to data. As long
as the "primary sources tool" does not add the sources it knows, what do
you expect from anybody else.
Thanks,
     GerardM

[1] https://tools.wmflabs.org/wikidata-todo/stats.php?reverse

On 27 November 2015 at 12:08, Andreas Kolbe &lt;jayen466(a)gmail.com&gt; wrote:

...
  Gerard,

 On Tue, Nov 24, 2015 at 7:15 AM, Gerard Meijssen <
 gerard.meijssen(a)gmail.com&gt;
 wrote:

  Hoi,
 To start of, results from the past are no indications of results in the
 future. It is the disclaimer insurance companies have to state in all  their
  adverts in the Netherlands. When you continue and
make it a "theological"
 issue, you lose me because I am not of this faith, far from it. Wikidata  is
  its own project and it is utterly dissimilar from
Wikipedia.To start of
 Wikidata has been a certified success from the start. The improvement it
 brought by bringing all interwiki links together is enormous.That alone
 should be a pointer that Wikipedia think is not realistic.

 These benefits are internal to Wikimedia and a completely separate issue
 from third-party re-use of Wikidata content as a default reference source,
 which is the issue of concern here.

 To continue, people have been importing data into Wikidata from the start.
  They are the statements you know and, it was
possible  to import them  from
  Wikipedia because of these interwiki links. So
when you call for sources,
 it is fairly save to assume that those imports are supported by the  quality
  of the statements of the Wikipedias 

 The quality of three-quarters of the 280+ Wikipedia language versions is
 about at the level the English Wikipedia had reached in 2002.

 Even some of the larger Wikipedias have significant problems. The Kazakh
 Wikipedia for example is controlled by functionaries of an oppressive
 regime[1], and the Croatian one is reportedly[2] controlled by fascists
 rewriting history (unless things have improved markedly in the Croatian
 Wikipedia since that report, which would be news to me). The Azerbaijani
 Wikipedia seems to have problems as well.

 The Wikimedia movement has always had an important principle: that all
 content should be traceable to a "reliable source". Throughout the first
 decade of this movement and beyond, Wikimedia content has never been
 considered a reliable source. For example, you can't use a Wikipedia
 article as a reference in another Wikipedia article.

 Another important principle has been the disclaimer: pointing out to people
 that the data is anonymously crowdsourced, and that there is no guarantee
 of reliability or fitness for use.

 Both of these principles are now being jettisoned.

 Wikipedia content is considered a reliable source in Wikidata, and Wikidata
 content is used as a reliable source by Google, where it appears without
 any indication of its provenance. This is a reflection of the fact that
 Wikidata, unlike Wikipedia, comes with a CC0 licence. That decision was, I
 understand, made by Denny, who is both a Google employee and a WMF board
 member.

 The benefit to Google is very clear: this free, unattributed content adds
 value to Google's search engine result pages, and improves Google's revenue
 (currently running at about $10 million an hour, much of it from ads).

 But what is the benefit to the end user? The end user gets information of
 undisclosed provenance, which is presented to them as authoritative, even
 though it may be compromised. In what sense is that an improvement for
 society?

 To me, the ongoing information revolution is like the 19th century
 industrial revolution done over. It created whole new categories of abuse,
 which it took a century to (partly) eliminate. But first, capitalists had a
 field day, and the people who were screwed were the common folk. Could we
 not try to learn from history?

  and if anything, that is also where
 they typically fail because many assumptions at Wikipedia are plain wrong
 at Wikidata. For instance a listed building is not the organisation the
 building is known for. At Wikidata they each need their own item and
 associated statements.

 Wikidata is already a success for other reasons. VIAF no longer links to
 Wikipedia but to Wikidata. The biggest benefit of this move is for people
 who are not interested in English.  Because of this change VIAF links
 through Wikidata to all Wikipedias not only en.wp. Consequently people  may
  find through VIAF Wikipedia articles in their own
language through their
 library systems.

 At the recent Wikiconference USA, a Wikimedia veteran and professional
 librarian expressed the view to me that

 * circular referencing between VIAF and Wikidata will create a humongous
 muddle that nobody will be able to sort out again afterwards, because –
 unlike wiki mishaps in other topic areas – here it's the most authoritative
 sources that are being corrupted by circular referencing;

 * third parties are using Wikimedia content as a *reference standard *when
 that was never the intention (see above).

 I've seen German Wikimedians express concerns that quality assurance
 standards have dropped alarmingly since the project began, with bot users
 mass-importing unreliable data.

  So do not forget about Wikipedia and the lessons
learned. These lessons  are
  important to Wikipedia. However, they do not
necessarily apply to  Wikidata
  particularly when you approach Wikidata as an
opportunity to do things  in a
  different way. Set theory, a branch of
mathematics, is exactly what we
 need. When we have data at Wikidata of a given quality.. eg 90% and we  have
  data at another source with a given quality eg
90%, we can compare the  two
  and find a subset where the two sources do not
match. When we curate the
 differences, it is highly likely that we improve quality at Wikidata or  at
  the other source. 

 This sounds like "Let's do it quick and dirty and worry about the problems
 later".

 I sometimes get the feeling software engineers just love a programming
 challenge, because that's where they can hone and display their skills.
 Dirty data is one of those challenges: all the clever things one can do to
 clean up the data! There is tremendous optimism about what can be done. But
 why have bad data in the first place, starting with rubbish and then
 proving that it can be cleaned up a bit using clever software?

 The effort will make the engineer look good, sure, but there will always be
 collateral damage as errors propagate before they are fixed. The engineer's
 eyes are not typically on the content, but on their software. The content
 their bots and programs manipulate at times seems almost incidental,
 something for "others" to worry about – "others" who don't
necessarily
 exist in sufficient numbers to ensure quality.

 In short, my feeling is that the engineering enthusiasm and expertise
 applied to Wikidata aren't balanced by a similar level of commitment to
 scholarship in generating the data, and getting them right first time.

 We've seen where that approach can lead with Wikipedia. Wikipedia hoaxes
 and falsehoods find their way into the blogosphere, the media, even the
 academic literature. The stakes with Wikidata are potentially much higher,
 because I fear errors in Wikidata stand a good chance of being massively
 propagated by Google's present and future automated information delivery
 mechanisms, which are completely opaque. Most internet users aren't even
 aware to what extent the Google Knowledge Graph relies on anonymously
 compiled, crowdsourced data; they will just assume that if Google says it,
 it must be true.

 In addition to honest mistakes, transcription errors, outdated info etc.,
 the whole thing is a propagandist's wet dream. Anonymous accounts!
 Guaranteed identity protection! Plausible deniability! No legal liability!
 Automated import and dissemination without human oversight! Massive impact
 on public opinion![3]

 If information is power, then this provides the best chance of a power grab
 humanity has seen since the invention of the newspaper. In the media
 landscape, you at least have right-wing, centrist and left-wing
 publications each presenting their version of the truth, and you know who's
 publishing what and what agenda they follow. You can pick and choose,
 compare and contrast, read between the lines. We won't have that online.
 Wikimedia-fuelled search engines like Google and Bing dominate the
 information supply.

 The right to enjoy a pluralist media landscape, populated by players who
 are accountable to the public, was hard won in centuries past. Some
 countries still don't enjoy that luxury today. Are we now blithely giving
 it away, in the name of progress, and for the greater glory of technocrats?

 I don't trust the way this is going. I see a distinct possibility that
 we'll end up with false information in Wikidata (or, rather, the Google
 Knowledge Graph) being used to "correct" accurate information in other
 sources, just because the Google/Wikidata content is ubiquitous. If you
 build circular referencing loops fuelled by spurious data, you don't
 provide access to knowledge, you destroy it. A lie told often enough etc.

 To quote Heather Ford and Mark Graham, "We know that the engineers and
 developers, volunteers and passionate technologists are often trying to do
 their best in difficult circumstances. But there need to be better attempts
 by people working on these platforms to explain how decisions are made
 about what is represented. These may just look like unimportant lines of
 code in some system somewhere, but they have a very real impact on the
 identities and futures of people who are often far removed from the
 conversations happening among engineers."

 I agree with that. The "what" should be more important than the
"how", and
 at present it doesn't seem to be.

 It's well worth thinking about, and having a debate about what can be done
 to prevent the worst from happening.

 In particular, I would like to see the decision to publish Wikidata under a
 CC0 licence revisited. The public should know where the data it gets comes
 from; that's a basic issue of transparency.

 Andreas

 [1]
 https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2015-10-07/Op-ed
 [2]

http://www.dailydot.com/politics/croatian-wikipedia-fascist-takeover-contro…
 [3]

http://www.politico.com/magazine/story/2015/08/how-google-could-rig-the-201…
 _______________________________________________
 Wikimedia-l mailing list, guidelines at:
 https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
 Wikimedia-l(a)lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Wikimedia-l] Quality issues