Re: [Wikimedia-l] September 28: Strategy update - Final draft of movement direction and endorsement process (#25)

10 Oct 2017

Hi Erik,

Really meaty post. Great stuff. Comments below.

On Tue, Oct 10, 2017 at 1:44 AM, Erik Moeller &lt;eloquence(a)gmail.com&gt; wrote:

...
  On Sat, Oct 7, 2017 at 1:00 PM, Andreas Kolbe
&lt;jayen466(a)gmail.com&gt; wrote:

  ... and it will all become one free mush everyone
copies to make a buck.  We
  are already in a situation today where anyone
asking Siri, the Amazon  Echo,
  Google or Bing about a topic is likely to get the
same answer from all of
 them, because they all import Wikimedia content, which comes free of
 charge. 
 I wouldn't call information from Wikimedia projects a "mush", but I
 think it's a good term for the proprietary amalgamation of information
 and data from many sources, often without any regard for the
 reliability of the source. 

In my view, whether it's a mush or not largely depends on how it is used,
and to what extent it mixes solid and flaky, verifiable and non-verifiable
content.

Wikidata has its own problems in that regard that have triggered ongoing
discussions and concerns on the English Wikipedia.[1] Wikidata does not
require users to cite sources. It contains millions of statements sourced
only to some Wikipedia language version, without identification of the
article, article version, or source originally cited in that Wikipedia (if
any) at the time of import. It lacks effective Verifiability and BLP
policies.

...
  Google is the king of such gooey
 amalgamation. Its home assistant has been known to give answers like
 this, sourced to "secretsofthefed.com":

      "According to details exposed in Western Center for Journalism's
       exclusive video, not only could Obama be in be in bed with the
       communist Chinese, but Obama may in fact be planning a
       communist coup d'état at the end of his term in 2016."

 See, e.g., this article

       https://theoutline.com/post/1192/google-s-featured-
 snippets-are-worse-than-fake-news

 for other egregious examples specifically from Google's featured responses.

Thanks for the link to that article. Really important. I'm in complete
agreement with you on that.

...
  For Google, I suggest a query like "when was
slavery abolished?"
 followed by exploring the auto-suggested questions. In my case, the
 first 10 questions point to snippets from:

 - pbs.org (twice)
 - USA Today
 - Reuters
 - archives.gov
 - Wikipedia (twice)
 - infoplease.com
 - ourdocuments.gov
 - nationalarchives.gov.uk 

Being on the other side of the pond, I got slightly different results. Here
they are, just for fun: Wikipedia is in the answer box, and 4 of the first
10 suggested questions link to Wikipedia:

– makewav.es
– Reuters
– archives.gov
– Wikipedia
– nationalarchives.gov.uk
– Wikipedia
– abolition.e2bn.org
– Wikipedia
– USA Today
– Wikipedia

(The 11th linked to Wikibooks.)

...
  It's the universe of linked open data
(Wikipedia/Wikidata,
 OpenStreetMap, and other open datasets) that keeps the space at least
 somewhat competitive, by giving players without much of a foothold a
 starting point from which to build. If Wikimedia did not exist, a
 smaller number of commercial players would wield greater power, due to
 the higher relative payoff of large investments in data mining and AI.

Yes, arguably so, although various ways remain in which Wikimedia might
become a victim of its own success, depending on the amount of ubiquity its
content achieves. The more ubiquitous it is, the higher the stakes, and the
higher the pressure on volunteers will become.

...
   I find that
worrying, because as an information delivery system,
 it’s not robust. You change one source, and all the other sources
 change as well. 
 As noted above, this is not actually what is happening. Commercial
 players don't want to limit themselves to free/open data; they want to
 use AI to extract as much information about the world as possible so
 they can answer as many queries as possible.

To the far-from-negligible extent that they all do and will regurgitate
Wikimedia content, it will happen.

By the same token, their drawing on alternative sources as well as
Wikimedia content, even proprietary ones, is also potentially a good thing.
It increases diversity.

...
  And for most of the sources amalgamated in this
manner, if provenance
 is indicated at all, we don't find any of the safeguards we have for
 Wikimedia content (revisioning, participatory decision-making,
 transparent policies, etc.). Editability, while opening the floodgate
 to a category of problems other sources don't have, is in fact also a
 safeguard: making it possible to fix mistakes instead of going through
 a "feedback" form that ends up who knows where.

Indeed, but it helps if re-users indicate provenance. If a digital voice
assistant propagates a Wikimedia mistake without telling users where it got
its information from, then there is not even a feedback form. Editability
is of no help at all if people can't find the source.

This is similar to the problem of vandalised Wikidata descriptions being
displayed in Wikipedia mobile views: people can't figure out where the
nonsense comes from, and where to change it.

...
  With an eye to 2030 and WMF's long-term direction,
I do think it's
 worth thinking about Wikidata's centrality, and I would agree with you
 at least that the phrase "the essential infrastructure of the
 ecosystem" does overstate what I think WMF should aspire to (the
 "essential infrastructure" should consist of many open components
 maintained by different groups). But beyond that I think you're
 reading stuff into the statement that isn't there.

I'm not sure I read much more into it than that – you've summarised my main
concern.

...
  Wikidata in particular is best seen not as the
singular source of
 truth, but as an important hub in a network of open data providers --
 primarily governments, public institutions, nonprofits. This is
 consistent with recent developments around Wikidata such as query
 federation.

As Wikidata imports all that material, however, there is a risk that
re-users (answer engines, digital assistants) will simply focus on mining
Wikidata.

There is also the risk of circular relationships – Wikidata importing
content from databases that in turn import some of their own content from
Wikidata. You can end up with databases all agreeing with each other, and
all being wrong. :/

...
  Wikidata will often provide a shallow first level of
information about
 a subject, while other linked sources provide deeper information. The
 more structured the information, the easier it becomes to validate in
 an automatic fashion that, for example, the subset of country
 population time series data represented in Wikidata is an accurate
 representation of the source material. Even when a large source
 dataset is mirrored by Wikimedia (for low-latency visualization, say),
 you can hash it, digitally sign it, and restrict modifiability of
 copies.

Interesting, though I'm not aware of that being done at present.

...
  If we expose the history, provenance and structure of
information, and
 the connections between sources, we can actually make the information
 more resilient against manipulation than if it is merely a piece of
 text in an article, some number in an {{infobox}} template or some
 "factoid" in a proprietary knowledge graph.

Yes, provenance – traceability – is key. But as things stand, I have seen
no evidence that WMF has a strong desire to encourage or force re-users to
provide it.

...
   is it just
that some of the world's most profitable companies earn  billions
  from volunteers' work, gaining political
power in the process, while
 volunteers actually pay to go online and access or purchase the sources
 they need to do their work? Yes or no? 
 I don't accept your framing. Search the way it used to be (with
 algorithms primarily tuned for relevance of results) was a fair deal
 for everyone involved: you put stuff on the web, it gets indexed and
 people are able to find it; the search engines make money by putting
 ads on the search result page. The amalgamation of information into
 knowledge graphs that deliver concise answers directly (however
 inadequate) changes the dynamic significantly.

 It accords ever greater power to the maintainers of these proprietary
 graphs which, I hasten to repeat, incorporate information well beyond
 just Wikimedia's, and which frequently fail to indicate provenance in
 an adequate manner. And, as the example at the beginning of this
 message shows, it leads to "information pollution", with fake news,
 conspiracy theories and pseudoscience leaking into semi-authoritative
 instant answers.

 I don't think the social justice problem here is that these companies
 make a profit, but that they function more and more as gatekeepers and
 curators of knowledge, a role for which they're ill-equipped and which
 civil society should be reluctant to give them.

I'm in violent agreement with you on that one. :)

...
  But the proprietary knowledge graphs are valuable to
users in ways
 that the previous generation of search engines was not. Interacting
 with a device like you would with a human being ("Alexa/Google/Siri,
 is yarrow edible?") makes knowledge more accessible and usable,
 including to people who have difficulty reading long texts, or who are
 not literate at all. In this sense I don't think WMF should ever find
 itself in the position to argue _against_ inclusion of information
 from Wikimedia projects in these applications.

There is a distinct likelihood that they will make reading Wikipedia
articles progressively obsolete, just like the availability of Googling has
dissuaded many people from sitting down and reading a book. All the more so
if these applications fail to make their users aware that the information
comes from Wikimedia projects.

...
  The applications themselves are not the problem; the
centralized
 gatekeeper control is. Knowledge as an open service (and network) is
 actually the solution to that root problem. It's how we weaken and
 perhaps even break the control of the gatekeepers. Your critique seems
 to boil down to "Let's ask Google for more crumbs". In spite of all
 your anti-corporate social justice rhetoric, that seems to be the path
 to developing a one-sided dependency relationship.

I considered that, but in the end felt that given the extent to which
Google profited from volunteers' work, it wasn't an unfair ask.

...
  To be clear, I'm in favor of corporations giving
more to the commons,
 though in my ideal world, that would happen through aggressive
 taxation and greater public investment (especially in schools,
 universities and GLAMs). I have every confidence that WMF does in fact
 ask for as much as it can be expected to in conversations with
 corporations, but it's not clear what you're suggesting should happen
 if the corporations say no. 

Publicise the fact that Google and others profit from volunteer work, and
give very little back. The world could do with more articles like this:

https://www.washingtonpost.com/news/the-intersect/wp/2015/07/22/you-dont-kn…

I once did a very rough back-of-an-envelope calculation based on Google's
staggering quarterly profits and its large reliance on Wikimedia content
for many of its search engine's most attractive features driving its ad
revenue. I estimated that the average Wikipedia edit (in any namespace)
brings Google something in the order of 10 cents of revenue.[2]

Again, thanks for an engaging and thought-provoking post.

Best,
Andreas

[1]
https://en.wikipedia.org/wiki/Wikipedia_talk:Wikidata/2017_State_of_affairs
(including its copious archives)
[2]
https://en.wikipedia.org/wiki/Wikipedia_talk:Wikipedia_Signpost/2015-07-22/…

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Wikimedia-l] September 28: Strategy update - Final draft of movement direction and endorsement process (#25)