Hi Erik,
Really meaty post. Great stuff. Comments below.
On Tue, Oct 10, 2017 at 1:44 AM, Erik Moeller <eloquence(a)gmail.com> wrote:
On Sat, Oct 7, 2017 at 1:00 PM, Andreas Kolbe
<jayen466(a)gmail.com> wrote:
... and it will all become one free mush everyone
copies to make a buck.
We
are already in a situation today where anyone
asking Siri, the Amazon
Echo,
Google or Bing about a topic is likely to get the
same answer from all of
them, because they all import Wikimedia content, which comes free of
charge.
I wouldn't call information from Wikimedia projects a "mush", but I
think it's a good term for the proprietary amalgamation of information
and data from many sources, often without any regard for the
reliability of the source.
In my view, whether it's a mush or not largely depends on how it is used,
and to what extent it mixes solid and flaky, verifiable and non-verifiable
content.
Wikidata has its own problems in that regard that have triggered ongoing
discussions and concerns on the English Wikipedia.[1] Wikidata does not
require users to cite sources. It contains millions of statements sourced
only to some Wikipedia language version, without identification of the
article, article version, or source originally cited in that Wikipedia (if
any) at the time of import. It lacks effective Verifiability and BLP
policies.
Google is the king of such gooey
amalgamation. Its home assistant has been known to give answers like
this, sourced to "secretsofthefed.com":
"According to details exposed in Western Center for Journalism's
exclusive video, not only could Obama be in be in bed with the
communist Chinese, but Obama may in fact be planning a
communist coup d'état at the end of his term in 2016."
See, e.g., this article
https://theoutline.com/post/1192/google-s-featured-
snippets-are-worse-than-fake-news
for other egregious examples specifically from Google's featured responses.
Thanks for the link to that article. Really important. I'm in complete
agreement with you on that.
For Google, I suggest a query like "when was
slavery abolished?"
followed by exploring the auto-suggested questions. In my case, the
first 10 questions point to snippets from:
-
pbs.org (twice)
- USA Today
- Reuters
-
archives.gov
- Wikipedia (twice)
-
infoplease.com
-
ourdocuments.gov
- nationalarchives.gov.uk
Being on the other side of the pond, I got slightly different results. Here
they are, just for fun: Wikipedia is in the answer box, and 4 of the first
10 suggested questions link to Wikipedia:
– makewav.es
– Reuters
–
archives.gov
– Wikipedia
– nationalarchives.gov.uk
– Wikipedia
–
abolition.e2bn.org
– Wikipedia
– USA Today
– Wikipedia
(The 11th linked to Wikibooks.)
It's the universe of linked open data
(Wikipedia/Wikidata,
OpenStreetMap, and other open datasets) that keeps the space at least
somewhat competitive, by giving players without much of a foothold a
starting point from which to build. If Wikimedia did not exist, a
smaller number of commercial players would wield greater power, due to
the higher relative payoff of large investments in data mining and AI.
Yes, arguably so, although various ways remain in which Wikimedia might
become a victim of its own success, depending on the amount of ubiquity its
content achieves. The more ubiquitous it is, the higher the stakes, and the
higher the pressure on volunteers will become.
I find that
worrying, because as an information delivery system,
it’s not robust. You change one source, and all the other sources
change as well.
As noted above, this is not actually what is happening. Commercial
players don't want to limit themselves to free/open data; they want to
use AI to extract as much information about the world as possible so
they can answer as many queries as possible.
To the far-from-negligible extent that they all do and will regurgitate
Wikimedia content, it will happen.
By the same token, their drawing on alternative sources as well as
Wikimedia content, even proprietary ones, is also potentially a good thing.
It increases diversity.
And for most of the sources amalgamated in this
manner, if provenance
is indicated at all, we don't find any of the safeguards we have for
Wikimedia content (revisioning, participatory decision-making,
transparent policies, etc.). Editability, while opening the floodgate
to a category of problems other sources don't have, is in fact also a
safeguard: making it possible to fix mistakes instead of going through
a "feedback" form that ends up who knows where.
Indeed, but it helps if re-users indicate provenance. If a digital voice
assistant propagates a Wikimedia mistake without telling users where it got
its information from, then there is not even a feedback form. Editability
is of no help at all if people can't find the source.
This is similar to the problem of vandalised Wikidata descriptions being
displayed in Wikipedia mobile views: people can't figure out where the
nonsense comes from, and where to change it.
With an eye to 2030 and WMF's long-term direction,
I do think it's
worth thinking about Wikidata's centrality, and I would agree with you
at least that the phrase "the essential infrastructure of the
ecosystem" does overstate what I think WMF should aspire to (the
"essential infrastructure" should consist of many open components
maintained by different groups). But beyond that I think you're
reading stuff into the statement that isn't there.
I'm not sure I read much more into it than that – you've summarised my main
concern.
Wikidata in particular is best seen not as the
singular source of
truth, but as an important hub in a network of open data providers --
primarily governments, public institutions, nonprofits. This is
consistent with recent developments around Wikidata such as query
federation.
As Wikidata imports all that material, however, there is a risk that
re-users (answer engines, digital assistants) will simply focus on mining
Wikidata.
There is also the risk of circular relationships – Wikidata importing
content from databases that in turn import some of their own content from
Wikidata. You can end up with databases all agreeing with each other, and
all being wrong. :/
Wikidata will often provide a shallow first level of
information about
a subject, while other linked sources provide deeper information. The
more structured the information, the easier it becomes to validate in
an automatic fashion that, for example, the subset of country
population time series data represented in Wikidata is an accurate
representation of the source material. Even when a large source
dataset is mirrored by Wikimedia (for low-latency visualization, say),
you can hash it, digitally sign it, and restrict modifiability of
copies.
Interesting, though I'm not aware of that being done at present.
If we expose the history, provenance and structure of
information, and
the connections between sources, we can actually make the information
more resilient against manipulation than if it is merely a piece of
text in an article, some number in an {{infobox}} template or some
"factoid" in a proprietary knowledge graph.
Yes, provenance – traceability – is key. But as things stand, I have seen
no evidence that WMF has a strong desire to encourage or force re-users to
provide it.
is it just
that some of the world's most profitable companies earn
billions
from volunteers' work, gaining political
power in the process, while
volunteers actually pay to go online and access or purchase the sources
they need to do their work? Yes or no?
I don't accept your framing. Search the way it used to be (with
algorithms primarily tuned for relevance of results) was a fair deal
for everyone involved: you put stuff on the web, it gets indexed and
people are able to find it; the search engines make money by putting
ads on the search result page. The amalgamation of information into
knowledge graphs that deliver concise answers directly (however
inadequate) changes the dynamic significantly.
It accords ever greater power to the maintainers of these proprietary
graphs which, I hasten to repeat, incorporate information well beyond
just Wikimedia's, and which frequently fail to indicate provenance in
an adequate manner. And, as the example at the beginning of this
message shows, it leads to "information pollution", with fake news,
conspiracy theories and pseudoscience leaking into semi-authoritative
instant answers.
I don't think the social justice problem here is that these companies
make a profit, but that they function more and more as gatekeepers and
curators of knowledge, a role for which they're ill-equipped and which
civil society should be reluctant to give them.
I'm in violent agreement with you on that one. :)
But the proprietary knowledge graphs are valuable to
users in ways
that the previous generation of search engines was not. Interacting
with a device like you would with a human being ("Alexa/Google/Siri,
is yarrow edible?") makes knowledge more accessible and usable,
including to people who have difficulty reading long texts, or who are
not literate at all. In this sense I don't think WMF should ever find
itself in the position to argue _against_ inclusion of information
from Wikimedia projects in these applications.
There is a distinct likelihood that they will make reading Wikipedia
articles progressively obsolete, just like the availability of Googling has
dissuaded many people from sitting down and reading a book. All the more so
if these applications fail to make their users aware that the information
comes from Wikimedia projects.
The applications themselves are not the problem; the
centralized
gatekeeper control is. Knowledge as an open service (and network) is
actually the solution to that root problem. It's how we weaken and
perhaps even break the control of the gatekeepers. Your critique seems
to boil down to "Let's ask Google for more crumbs". In spite of all
your anti-corporate social justice rhetoric, that seems to be the path
to developing a one-sided dependency relationship.
I considered that, but in the end felt that given the extent to which
Google profited from volunteers' work, it wasn't an unfair ask.
To be clear, I'm in favor of corporations giving
more to the commons,
though in my ideal world, that would happen through aggressive
taxation and greater public investment (especially in schools,
universities and GLAMs). I have every confidence that WMF does in fact
ask for as much as it can be expected to in conversations with
corporations, but it's not clear what you're suggesting should happen
if the corporations say no.
Publicise the fact that Google and others profit from volunteer work, and
give very little back. The world could do with more articles like this:
https://www.washingtonpost.com/news/the-intersect/wp/2015/07/22/you-dont-kn…
I once did a very rough back-of-an-envelope calculation based on Google's
staggering quarterly profits and its large reliance on Wikimedia content
for many of its search engine's most attractive features driving its ad
revenue. I estimated that the average Wikipedia edit (in any namespace)
brings Google something in the order of 10 cents of revenue.[2]
Again, thanks for an engaging and thought-provoking post.
Best,
Andreas
[1]
https://en.wikipedia.org/wiki/Wikipedia_talk:Wikidata/2017_State_of_affairs
(including its copious archives)
[2]
https://en.wikipedia.org/wiki/Wikipedia_talk:Wikipedia_Signpost/2015-07-22/…