Re: [Wikimedia-l] Quality issues

27 Nov 2015

Hoi,

Sources are important. When we do not have data at Wikidata and we add it
from anywhere, we have the basis to do some good. At this time we do not
really add source information. It is too cumbersome and as long as the
"primary sources tool", an "official" tool does not do it, why
bother?

My point about sources is very much that when one source does not agree,
there is a likely quality issue. When five sources agree, there is nothing
that marks them as suspect and there is no reason why I would look for a
Source for that statement anytime soon. When you work on quality, you do
not care about sources that agree, you care about those that do not. When
multiple sources copied each others data, it means that they all provide
information. That is superior  to wave hands, not being informative and not
include missing data because of a lack of a Source

We have to ask ourselves, what is our aim. To share in the sum of all
knowledge and occasionally be wrong or keeping a lot of knowledge from the
people and pretend that it is all correct
Thanks,
      GerardM

PS we are wiki based.

On 27 November 2015 at 20:14, Lila Tretikov &lt;lila(a)wikimedia.org&gt; wrote:

...
  Hoi Gerard,

 What I hear in email from Andreas and Liam is not as much the propagation
 of the error (which I am sure happens with some % of the cases), but the
 fact that the original source is obscured and therefore it is hard to
 identify and correct errors, biases, etc. Because if the source of error is
 obscured, that error is that much harder to find and to correct. In fact,
 we see this even on Wikipedia articles today (wrong dates of births sourced
 from publications that don't do enough fact checking is something I came
 across personally). It is a powerful and important principle on Wikipedia,
 but with content re-use it gets lost. Public domain/CC0 in combination with
 AI lands our content for slicing and dicing and re-arranging by others,
 making it something entirely new, but also detached from our process of
 validation and verification. I am curious to hear if people think it is a
 problem. It definitely worries me.

 We have been looking very closely at Wikidata and the possibilities it
 offers. I am curious to understand more about your note on Resonator:

 "As long as Wikidata does not
 have the power of a Reasonator, the data is just that. It does not make
 itself in information and consequently it is awful. When there is one thing
 the Wikidata engineers do not do, it is considering the use of the data and
 the workflows to improve the data and the quality."

 Am I understanding you saying that until the data sees the light of day it
 will not become of high quality?

 Thanks,
 Lila

 On Fri, Nov 27, 2015 at 10:26 AM, Gerard Meijssen <
 gerard.meijssen(a)gmail.com
  wrote: 
  Hoi,
 When a benefit is "Wikimedia specific" and thereby dismissed, you miss 
much
  of what is going on. Exactly because of this link
most items are well
 defined as to what they are about. It is not perfect but it is good.
 Consequently Wikidata is able to link Wikipedia in any language to  sources
  external to Wikipedia. This is a big improvement
over linking external
 sources to a Wikipedia. The disambiguation of subjects is done at the
 Wikidata end.

 You make Wikidata to be a "default reference source". Given its current
 state, it is a bit much. Wikidata does not have the maturity to function  as
  such. The best pointer to this fact is that 50%
of all items has two or
 fewer statements.

 When you compare the quality of Wikipedias with what en.wp used to be you
 are comparing apples and oranges. The Myanmar Wikipedia is better  informed
  on Myanmar than en.wp etc.

 When you qualify a Wikipedia as fascist, it does not follow that the data
 is suspect. Certainly when data in a source that you so easily dismiss is
 typically the same, there is not much meaning in what you say from a
 Wikidata point of view.

 I am thrilled that sources are so important to the Wikimedia movement and
 again, I am wondering what you hope to achieve by this pronouncement. Be
 realistic what is it that you want to achieve? Is quality important to  you
  and, how do you define it and more importantly
how do you want to achieve
 it. Have you seen the statistics on sources [1]? Then have a better look
 and you will find that real sources are mostly absent. Adding sources one
 statement at a time will not significantly improve quality because that  is
  a numbers game and it is easier to achieve
quality in a different way.

 When a librarian says that many sources copy each others data and that  this
  is a problem, the bigger problem is missed. The
bigger problem is not  where
  they agree but where they disagree. Arguably they
are the statements  where
  quality is more likely an issue. Now ask your
librarian what is likely to
 improve Wikidata more either find Sources for the statements that differ  of
  find Sources where the statements agree. Wikidata
is not authoritative  but
  when our community starts researching such issues
both Wikidata and other
 sources will improve rapidly their quality. This is not to say that in  the
  end you want both Sources where sources agree and
disagree.

 Then ask your librarian if there is a problem with missing data  We can
 import data from sources and consequently be more informative or we do  not
  import more data and people have to magically
combine information that
 exists in many sources to get a composite view. We could see Wikidata as  a
  place where data is combined and compared with
other sources, Do tell  your
  librarian that the process mentioned above should
be iterative and it  will
  be easily understood that comparing with just one
additional source will
 improve the focus on likely issues even more.

 PS What does your librarian think when she knows that the Dutch National
 Library is inclined to provide us with software so that books can be
 ordered at Dutch libraries from Wikidata data (and by inference from
 Wikipedias)?

 When some see Wikidata as a source of reference, they will increasingly  be
  served a better product. At this moment it is not
good at all.

 When German Wikimedians have concerns about quality.WONDERFUL but what  have
  they done to improve things? Do they apply
Wikipedia standards and how  does
  that help?

 You wonder why have "bad" data in the first place... Our data IS bad and
 there is not enough of it for it to be really useful. We can easily add
 more data and have a more useful result We can easily compare sources and
 ask people to concentrate on differences. However you can not tell me to
 add Sources to the data that I add. I will tell you to do it yourself. I  am
  happy to improve on quality but on my terms, not
yours.

 You mention the propagation of errors.. How would that work. You indicate
 that there are not enough people to fix all the issues. With bots like
 Kian, we have probability in adding data. We have people add data where  the
  software is not certain.  You doubt technology
but you do not know where  we
  are, what is already done.

 In short my feeling is that you do not know what you are talking about.
 There is real scholarship in the approach that I described, My take is in
 applying set theory. Kian is AI. For all I care yours is FUD.

 Your notion of accountability is one of a consumer, it is not the
 accountability needed for a project that is immature and is not at all  at a
  stage where you should imply that it is good
enough and that quality is
 assured. There are domains in Wikidata that I will not touch because in  my
  opinion it is wrong in its principles. At the
same time I know that it  can
  be fixed in time and leave it at that,

 I disagree with Heather Ford and Mark Graham. As long as Wikidata does  not
  have the power of a Reasonator, the data is just
that. It does not make
 itself in information and consequently it is awful. When there is one  thing
  the Wikidata engineers do not do, it is
considering the use of the data  and
  the workflows to improve the data and the
quality.

 The data needs to be CC-0 because it is how we ensure that everybody will
 be happy and willing to participate. As more participation happens as  more
 > collaboration occurs we will see Wikidata increase in the amount of data
 > that it holds and at the same time we will see quality improve.
 >
 > Yes, Wikidata could do more in the way of adding sources to data. As long
 > as the "primary sources tool" does not add the sources it knows, what do
 > you expect from anybody else.
 > Thanks,
 >      GerardM
 >
 >
 > [1] https://tools.wmflabs.org/wikidata-todo/stats.php?reverse
 >
 >
 >
 > On 27 November 2015 at 12:08, Andreas Kolbe &lt;jayen466(a)gmail.com&gt; wrote:
 >
 > > Gerard,
 > >
 > > On Tue, Nov 24, 2015 at 7:15 AM, Gerard Meijssen <
 > > gerard.meijssen(a)gmail.com&gt;
 >  wrote:  > >
 > > > Hoi,
 > > > To start of, results from the past are no indications of results in
 the
    future. It is the disclaimer insurance companies have
to state in all  their
 > adverts in the Netherlands. When you continue and make it a 
"theological"
  > issue, you lose me because I am not of this
faith, far from it.  Wikidata
 > is
 > > its own project and it is utterly dissimilar from Wikipedia.To start  of
   >
Wikidata has been a certified success from the start. The improvement  it
 > > brought by bringing all interwiki links together is enormous.That  alone
  > > should be a pointer that Wikipedia
think is not realistic.
 > >
 >
 >
 > These benefits are internal to Wikimedia and a completely separate  issue
   from
third-party re-use of Wikidata content as a default reference  source,
  which is the issue of concern here.

 To continue, people have been importing data into Wikidata from the  start.
   They are
the statements you know and, it was possible  to import them  from
 > Wikipedia because of these interwiki links. So when you call for  sources,
 > > it is fairly save to assume that those imports are supported by the
 > quality
 > > of the statements of the Wikipedias
 >
 >
 >
 > The quality of three-quarters of the 280+ Wikipedia language versions  is
  > about at the level the English Wikipedia had
reached in 2002.
 >
 > Even some of the larger Wikipedias have significant problems. The  Kazakh
  > Wikipedia for example is controlled by
functionaries of an oppressive
 > regime[1], and the Croatian one is reportedly[2] controlled by fascists
 > rewriting history (unless things have improved markedly in the Croatian
 > Wikipedia since that report, which would be news to me). The  Azerbaijani
  > Wikipedia seems to have problems as well.
 >
 > The Wikimedia movement has always had an important principle: that all
 > content should be traceable to a "reliable source". Throughout the 
first
   decade of
this movement and beyond, Wikimedia content has never been
 considered a reliable source. For example, you can't use a Wikipedia
 article as a reference in another Wikipedia article.

 Another important principle has been the disclaimer: pointing out to  people
 > that the data is anonymously crowdsourced, and that there is no  guarantee
   of
reliability or fitness for use.

 Both of these principles are now being jettisoned.

 Wikipedia content is considered a reliable source in Wikidata, and  Wikidata
 > content is used as a reliable source by Google, where it appears  without
  > any indication of its provenance. This is a
reflection of the fact that
 > Wikidata, unlike Wikipedia, comes with a CC0 licence. That decision  was,
  I
 > understand, made by Denny, who is both a Google employee and a WMF  board
  > member.
 >
 > The benefit to Google is very clear: this free, unattributed content  adds
   value to
Google's search engine result pages, and improves Google's  revenue
 > (currently running at about $10 million an hour, much of it from ads).
 >
 > But what is the benefit to the end user? The end user gets information  of
  > undisclosed provenance, which is presented
to them as authoritative,  even
   though it
may be compromised. In what sense is that an improvement for
 society?

 To me, the ongoing information revolution is like the 19th century
 industrial revolution done over. It created whole new categories of  abuse,
  which it took a century to (partly) eliminate.
But first, capitalists  had a
 > field day, and the people who were screwed were the common folk. Could  we
   not try
to learn from history?

 > and if anything, that is also where
 > they typically fail because many assumptions at Wikipedia are plain  wrong
 > > at Wikidata. For instance a listed building is not the organisation  the
   >
building is known for. At Wikidata they each need their own item and
 > associated statements.
 >
 > Wikidata is already a success for other reasons. VIAF no longer links  to
  > Wikipedia but to Wikidata. The biggest
benefit of this move is for  people
 > > who are not interested in English.  Because of this change VIAF links
 > > through Wikidata to all Wikipedias not only en.wp. Consequently  people
   may
 > find through VIAF Wikipedia articles in their own language through  their
 > > library systems.
 > >
 >
 >
 > At the recent Wikiconference USA, a Wikimedia veteran and professional
 > librarian expressed the view to me that
 >
 > * circular referencing between VIAF and Wikidata will create a  humongous
   muddle
that nobody will be able to sort out again afterwards, because –
 unlike wiki mishaps in other topic areas – here it's the most  authoritative
  sources that are being corrupted by circular
referencing;

 * third parties are using Wikimedia content as a *reference standard  *when
 > that was never the intention (see above).
 >
 > I've seen German Wikimedians express concerns that quality assurance
 > standards have dropped alarmingly since the project began, with bot  users
  > mass-importing unreliable data.
 >
 >
 >
 > > So do not forget about Wikipedia and the lessons learned. These  lessons
  > are
 > > important to Wikipedia. However, they do not necessarily apply to
 > Wikidata
 > > particularly when you approach Wikidata as an opportunity to do  things
  > in a
 > > different way. Set theory, a branch of mathematics, is exactly what  we
  > > need. When we have data at Wikidata of
a given quality.. eg 90% and  we
  > have
 > > data at another source with a given quality eg 90%, we can compare  the
   two
 > and find a subset where the two sources do not match. When we curate  the
 > > differences, it is highly likely that we improve quality at Wikidata  or
   at
  the other source. 

 This sounds like "Let's do it quick and dirty and worry about the 
problems
  later".

 I sometimes get the feeling software engineers just love a programming
 challenge, because that's where they can hone and display their skills.
 Dirty data is one of those challenges: all the clever things one can do  to
  clean up the data! There is tremendous optimism
about what can be done.  But
 > why have bad data in the first place, starting with rubbish and then
 > proving that it can be cleaned up a bit using clever software?
 >
 > The effort will make the engineer look good, sure, but there will  always
  be
  collateral damage as errors propagate before they
are fixed. The  engineer's
 > eyes are not typically on the content, but on their software. The  content
  > their bots and programs manipulate at times
seems almost incidental,
 > something for "others" to worry about – "others" who don't
necessarily
 > exist in sufficient numbers to ensure quality.
 >
 > In short, my feeling is that the engineering enthusiasm and expertise
 > applied to Wikidata aren't balanced by a similar level of commitment to
 > scholarship in generating the data, and getting them right first time.
 >
 > We've seen where that approach can lead with Wikipedia. Wikipedia  hoaxes
   and
falsehoods find their way into the blogosphere, the media, even the
 academic literature. The stakes with Wikidata are potentially much  higher,
 > because I fear errors in Wikidata stand a good chance of being  massively
  > propagated by Google's present and
future automated information  delivery
  > mechanisms, which are completely opaque.
Most internet users aren't  even
   aware to
what extent the Google Knowledge Graph relies on anonymously
 compiled, crowdsourced data; they will just assume that if Google says  it,
 > it must be true.
 >
 > In addition to honest mistakes, transcription errors, outdated info  etc.,
   the whole
thing is a propagandist's wet dream. Anonymous accounts!
 Guaranteed identity protection! Plausible deniability! No legal  liability!
  Automated import and dissemination without human
oversight! Massive  impact
  on public opinion![3]

 If information is power, then this provides the best chance of a power  grab
  humanity has seen since the invention of the
newspaper. In the media
 landscape, you at least have right-wing, centrist and left-wing
 publications each presenting their version of the truth, and you know  who's
 > publishing what and what agenda they follow. You can pick and choose,
 > compare and contrast, read between the lines. We won't have that  online.
  > Wikimedia-fuelled search engines like Google
and Bing dominate the
 > information supply.
 >
 > The right to enjoy a pluralist media landscape, populated by players  who
  > are accountable to the public, was hard won
in centuries past. Some
 > countries still don't enjoy that luxury today. Are we now blithely 
giving
   it away,
in the name of progress, and for the greater glory of  technocrats?
 >
 > I don't trust the way this is going. I see a distinct possibility that
 > we'll end up with false information in Wikidata (or, rather, the Google
 > Knowledge Graph) being used to "correct" accurate information in other
 > sources, just because the Google/Wikidata content is ubiquitous. If you
 > build circular referencing loops fuelled by spurious data, you don't
 > provide access to knowledge, you destroy it. A lie told often enough  etc.

 To quote Heather Ford and Mark Graham, "We know that the engineers and
 developers, volunteers and passionate technologists are often trying to  do
  their best in difficult circumstances. But there
need to be better  attempts
 > by people working on these platforms to explain how decisions are made
 > about what is represented. These may just look like unimportant lines  of
   code in
some system somewhere, but they have a very real impact on the
 identities and futures of people who are often far removed from the
 conversations happening among engineers."

 I agree with that. The "what" should be more important than the
"how",  and
  at present it doesn't seem to be.

 It's well worth thinking about, and having a debate about what can be  done
  to prevent the worst from happening.

 In particular, I would like to see the decision to publish Wikidata  under a
  CC0 licence revisited. The public should know
where the data it gets  comes
  from; that's a basic issue of transparency.

 Andreas

 [1]

https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2015-10-07/Op-ed
   [2]

http://www.dailydot.com/politics/croatian-wikipedia-fascist-takeover-contro…
   [3]

http://www.politico.com/magazine/story/2015/08/how-google-could-rig-the-201…

_______________________________________________
 Wikimedia-l mailing list, guidelines at:
 https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
 Wikimedia-l(a)lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
  _______________________________________________
 Wikimedia-l mailing list, guidelines at:
 https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
 Wikimedia-l(a)lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
  _______________________________________________
 Wikimedia-l mailing list, guidelines at:
 https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
 Wikimedia-l(a)lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Wikimedia-l] Quality issues