Gerard,
On Tue, Nov 24, 2015 at 7:15 AM, Gerard Meijssen <gerard.meijssen(a)gmail.com>
wrote:
> Hoi,
> To start of, results from the past are no indications of results in the
> future. It is the disclaimer insurance companies have to state in all their
> adverts in the Netherlands. When you continue and make it a "theological"
> issue, you lose me because I am not of this faith, far from it. Wikidata is
> its own project and it is utterly dissimilar from Wikipedia.To start of
> Wikidata has been a certified success from the start. The improvement it
> brought by bringing all interwiki links together is enormous.That alone
> should be a pointer that Wikipedia think is not realistic.
>
These benefits are internal to Wikimedia and a completely separate issue
from third-party re-use of Wikidata content as a default reference source,
which is the issue of concern here.
To continue, people have been importing data into Wikidata from the start.
> They are the statements you know and, it was possible to import them from
> Wikipedia because of these interwiki links. So when you call for sources,
> it is fairly save to assume that those imports are supported by the quality
> of the statements of the Wikipedias
The quality of three-quarters of the 280+ Wikipedia language versions is
about at the level the English Wikipedia had reached in 2002.
Even some of the larger Wikipedias have significant problems. The Kazakh
Wikipedia for example is controlled by functionaries of an oppressive
regime[1], and the Croatian one is reportedly[2] controlled by fascists
rewriting history (unless things have improved markedly in the Croatian
Wikipedia since that report, which would be news to me). The Azerbaijani
Wikipedia seems to have problems as well.
The Wikimedia movement has always had an important principle: that all
content should be traceable to a "reliable source". Throughout the first
decade of this movement and beyond, Wikimedia content has never been
considered a reliable source. For example, you can't use a Wikipedia
article as a reference in another Wikipedia article.
Another important principle has been the disclaimer: pointing out to people
that the data is anonymously crowdsourced, and that there is no guarantee
of reliability or fitness for use.
Both of these principles are now being jettisoned.
Wikipedia content is considered a reliable source in Wikidata, and Wikidata
content is used as a reliable source by Google, where it appears without
any indication of its provenance. This is a reflection of the fact that
Wikidata, unlike Wikipedia, comes with a CC0 licence. That decision was, I
understand, made by Denny, who is both a Google employee and a WMF board
member.
The benefit to Google is very clear: this free, unattributed content adds
value to Google's search engine result pages, and improves Google's revenue
(currently running at about $10 million an hour, much of it from ads).
But what is the benefit to the end user? The end user gets information of
undisclosed provenance, which is presented to them as authoritative, even
though it may be compromised. In what sense is that an improvement for
society?
To me, the ongoing information revolution is like the 19th century
industrial revolution done over. It created whole new categories of abuse,
which it took a century to (partly) eliminate. But first, capitalists had a
field day, and the people who were screwed were the common folk. Could we
not try to learn from history?
> and if anything, that is also where
> they typically fail because many assumptions at Wikipedia are plain wrong
> at Wikidata. For instance a listed building is not the organisation the
> building is known for. At Wikidata they each need their own item and
> associated statements.
>
> Wikidata is already a success for other reasons. VIAF no longer links to
> Wikipedia but to Wikidata. The biggest benefit of this move is for people
> who are not interested in English. Because of this change VIAF links
> through Wikidata to all Wikipedias not only en.wp. Consequently people may
> find through VIAF Wikipedia articles in their own language through their
> library systems.
>
At the recent Wikiconference USA, a Wikimedia veteran and professional
librarian expressed the view to me that
* circular referencing between VIAF and Wikidata will create a humongous
muddle that nobody will be able to sort out again afterwards, because –
unlike wiki mishaps in other topic areas – here it's the most authoritative
sources that are being corrupted by circular referencing;
* third parties are using Wikimedia content as a *reference standard *when
that was never the intention (see above).
I've seen German Wikimedians express concerns that quality assurance
standards have dropped alarmingly since the project began, with bot users
mass-importing unreliable data.
> So do not forget about Wikipedia and the lessons learned. These lessons are
> important to Wikipedia. However, they do not necessarily apply to Wikidata
> particularly when you approach Wikidata as an opportunity to do things in a
> different way. Set theory, a branch of mathematics, is exactly what we
> need. When we have data at Wikidata of a given quality.. eg 90% and we have
> data at another source with a given quality eg 90%, we can compare the two
> and find a subset where the two sources do not match. When we curate the
> differences, it is highly likely that we improve quality at Wikidata or at
> the other source.
This sounds like "Let's do it quick and dirty and worry about the problems
later".
I sometimes get the feeling software engineers just love a programming
challenge, because that's where they can hone and display their skills.
Dirty data is one of those challenges: all the clever things one can do to
clean up the data! There is tremendous optimism about what can be done. But
why have bad data in the first place, starting with rubbish and then
proving that it can be cleaned up a bit using clever software?
The effort will make the engineer look good, sure, but there will always be
collateral damage as errors propagate before they are fixed. The engineer's
eyes are not typically on the content, but on their software. The content
their bots and programs manipulate at times seems almost incidental,
something for "others" to worry about – "others" who don't necessarily
exist in sufficient numbers to ensure quality.
In short, my feeling is that the engineering enthusiasm and expertise
applied to Wikidata aren't balanced by a similar level of commitment to
scholarship in generating the data, and getting them right first time.
We've seen where that approach can lead with Wikipedia. Wikipedia hoaxes
and falsehoods find their way into the blogosphere, the media, even the
academic literature. The stakes with Wikidata are potentially much higher,
because I fear errors in Wikidata stand a good chance of being massively
propagated by Google's present and future automated information delivery
mechanisms, which are completely opaque. Most internet users aren't even
aware to what extent the Google Knowledge Graph relies on anonymously
compiled, crowdsourced data; they will just assume that if Google says it,
it must be true.
In addition to honest mistakes, transcription errors, outdated info etc.,
the whole thing is a propagandist's wet dream. Anonymous accounts!
Guaranteed identity protection! Plausible deniability! No legal liability!
Automated import and dissemination without human oversight! Massive impact
on public opinion![3]
If information is power, then this provides the best chance of a power grab
humanity has seen since the invention of the newspaper. In the media
landscape, you at least have right-wing, centrist and left-wing
publications each presenting their version of the truth, and you know who's
publishing what and what agenda they follow. You can pick and choose,
compare and contrast, read between the lines. We won't have that online.
Wikimedia-fuelled search engines like Google and Bing dominate the
information supply.
The right to enjoy a pluralist media landscape, populated by players who
are accountable to the public, was hard won in centuries past. Some
countries still don't enjoy that luxury today. Are we now blithely giving
it away, in the name of progress, and for the greater glory of technocrats?
I don't trust the way this is going. I see a distinct possibility that
we'll end up with false information in Wikidata (or, rather, the Google
Knowledge Graph) being used to "correct" accurate information in other
sources, just because the Google/Wikidata content is ubiquitous. If you
build circular referencing loops fuelled by spurious data, you don't
provide access to knowledge, you destroy it. A lie told often enough etc.
To quote Heather Ford and Mark Graham, "We know that the engineers and
developers, volunteers and passionate technologists are often trying to do
their best in difficult circumstances. But there need to be better attempts
by people working on these platforms to explain how decisions are made
about what is represented. These may just look like unimportant lines of
code in some system somewhere, but they have a very real impact on the
identities and futures of people who are often far removed from the
conversations happening among engineers."
I agree with that. The "what" should be more important than the "how", and
at present it doesn't seem to be.
It's well worth thinking about, and having a debate about what can be done
to prevent the worst from happening.
In particular, I would like to see the decision to publish Wikidata under a
CC0 licence revisited. The public should know where the data it gets comes
from; that's a basic issue of transparency.
Andreas
[1]
https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2015-10-07/Op-ed
[2]
http://www.dailydot.com/politics/croatian-wikipedia-fascist-takeover-contro…
[3]
http://www.politico.com/magazine/story/2015/08/how-google-could-rig-the-201…
In another thread, we are discussing the preponderance of problematic
merges of gene/protein items. One of the hypotheses raised to explain the
volume and nature of these merges (which are often by fairly inexperienced
editors and/or people that seem to only do merges) was that they were
coming from the wikidata game. It seems to me that anything like the
wikidata game that has the potential to generate a very large volume of
edits - especially from new editors - ought to tag its contributions so
that they can easily be tracked by the system. It should be easy to answer
the question of whether an edit came from that game (or any of what I hope
to be many of its descendants). This will make it possible to debug what
could potentially be large swathes of problems and to make it
straightforward to 'reward' game/other developers with information about
the volume of the edits that they have enabled directly from the system (as
opposed to their own tracking data).
Please don't misunderstand me. I am a big fan of the wikidata game and
actually am pushing for our group to make a bio-specific version of it that
will build on that code. I see a great potential here - but because of the
potential scale of edits this could quickly generate, we (the whole
wikidata community) need ways to keep an eye on what is going on.
-Ben
I'm creating an app that provides Wikidata info in slide-out menus on
top of Wikipedia pages. Here's a video of a prototype:
https://vimeo.com/146061825
Much of the app will be implemented as REST services in the cloud, and
one item of functionality required will be a REST service that returns
the Q id given a Wikipedia URL (in any language). Another REST service
required will return a Wikipedia URL given a Wikidata Q id and language
code (e.g. "en" or "pt-br").
Does anything like this currently exist?
Regards,
James Weaver
http://JavaFXpert.comhttp://CulturedEar.com
The gene wiki people are hosting a tutorial on wikidata in Cambridge, UK
next Monday [1]. In the interest of making the best tutorial in the least
amount of preparation time.. I was wondering if anyone on the list had
content (slides, handouts, cheatsheets) that they had already used
successfully and might want to share? We are assembling the structure of
the 90 minute session in a google doc [2], feel free to chime in there !
And of course everything we generate for that will be available online as
soon as it exists.
cheers
-Ben
[1] http://www.swat4ls.org/workshops/cambridge2015/programme/tutorials/
[2]
https://docs.google.com/document/d/1dSgm90SbQBpHqEMa17t5zQL0PB2waIKD3LKTPPk…
Dear all,
I create wikipedia articles using wikidata properties. I have not had this
problem before. Today, when I create article I got this error:
"*Lua error in Module:WikidataCoords at line 44: attempt to call field
'formatProperty' (a nil value)." *(https://az.wikipedia.org/wiki/Aalen)
I checked my old articles, this error appeared on them also (
https://az.wikipedia.org/wiki/Boxum). I did not insert property number for
the values which Lua gave error. Can you please help me to solve this
issue? Otherwise, my community will ban me from our wikipedia page for
creating mess.
Thanks a lot.
--
Best regards,
Ali Ismayilov
Hello, You may know ORES <https://www.wikidata.org/wiki/Wikidata:ORES>, We
use ORES to build anti-vandalism tools (Learn more
<https://meta.wikimedia.org/wiki/ORES/What>). Based on automatic revert
detection we were able to build an MVP and we have some high quality
classifiers online you can use (WD:ORES
<https://www.wikidata.org/wiki/Wikidata:ORES>).
In order to improve the anti-vandalism classifier we need you to go through
some edits and determine whether they are damaging to Wikidata and if they
are ill-intended edits or they are just newbies/honest mistakes.
This would help us distinguish between newbies and vandals and also
improves our data to make precise and adequate vandalism detection
classifier. Please go to Wikidata:Edit labels
<https://www.wikidata.org/wiki/Wikidata:Edit_labels> install the gadget and
do a workset.
Thanks
Best
Hey all,
I've created a very rough REST API for Wikidata and am looking for your
feedback.
* About this API: http://queryr.wmflabs.org
* Documentation: http://queryr.wmflabs.org/about/docs
* API root: http://queryr.wmflabs.org/api
At present this is purely a demo. The data it serves is stale and
potentially incomplete, the endpoints and formats they use are very much
liable to change, the server setup is not reliable and I'm not 100% sure
I'll continue with this little project.
The main thing I'm going for with this API compared to the existing one is
greater ease of use for common use cases. Several factors make this a lot
easier to do in a new API than in the existing one: no need the serve all
use cases, no need to retain compatibility with existing users and no
framework imposed restrictions. You can read more about the difference on
the website.
You are invited to comment on the concept and on the open questions
mentioned on the website.
Cheers
--
Jeroen De Dauw - http://www.bn2vs.com
Software craftsmanship advocate
~=[,,_,,]:3