Hi Erik,
Really meaty post. Great stuff. Comments below.
On Tue, Oct 10, 2017 at 1:44 AM, Erik Moeller eloquence@gmail.com wrote:
On Sat, Oct 7, 2017 at 1:00 PM, Andreas Kolbe jayen466@gmail.com wrote:
... and it will all become one free mush everyone copies to make a buck.
We
are already in a situation today where anyone asking Siri, the Amazon
Echo,
Google or Bing about a topic is likely to get the same answer from all of them, because they all import Wikimedia content, which comes free of charge.
I wouldn't call information from Wikimedia projects a "mush", but I think it's a good term for the proprietary amalgamation of information and data from many sources, often without any regard for the reliability of the source.
In my view, whether it's a mush or not largely depends on how it is used, and to what extent it mixes solid and flaky, verifiable and non-verifiable content.
Wikidata has its own problems in that regard that have triggered ongoing discussions and concerns on the English Wikipedia.[1] Wikidata does not require users to cite sources. It contains millions of statements sourced only to some Wikipedia language version, without identification of the article, article version, or source originally cited in that Wikipedia (if any) at the time of import. It lacks effective Verifiability and BLP policies.
Google is the king of such gooey amalgamation. Its home assistant has been known to give answers like this, sourced to "secretsofthefed.com":
"According to details exposed in Western Center for Journalism's exclusive video, not only could Obama be in be in bed with the communist Chinese, but Obama may in fact be planning a communist coup d'état at the end of his term in 2016."
See, e.g., this article
https://theoutline.com/post/1192/google-s-featured-
snippets-are-worse-than-fake-news
for other egregious examples specifically from Google's featured responses.
Thanks for the link to that article. Really important. I'm in complete agreement with you on that.
For Google, I suggest a query like "when was slavery abolished?" followed by exploring the auto-suggested questions. In my case, the first 10 questions point to snippets from:
- pbs.org (twice)
- USA Today
- Reuters
- archives.gov
- Wikipedia (twice)
- infoplease.com
- ourdocuments.gov
- nationalarchives.gov.uk
Being on the other side of the pond, I got slightly different results. Here they are, just for fun: Wikipedia is in the answer box, and 4 of the first 10 suggested questions link to Wikipedia:
– makewav.es – Reuters – archives.gov – Wikipedia – nationalarchives.gov.uk – Wikipedia – abolition.e2bn.org – Wikipedia – USA Today – Wikipedia
(The 11th linked to Wikibooks.)
It's the universe of linked open data (Wikipedia/Wikidata, OpenStreetMap, and other open datasets) that keeps the space at least somewhat competitive, by giving players without much of a foothold a starting point from which to build. If Wikimedia did not exist, a smaller number of commercial players would wield greater power, due to the higher relative payoff of large investments in data mining and AI.
Yes, arguably so, although various ways remain in which Wikimedia might become a victim of its own success, depending on the amount of ubiquity its content achieves. The more ubiquitous it is, the higher the stakes, and the higher the pressure on volunteers will become.
I find that worrying, because as an information delivery system, it’s not robust. You change one source, and all the other sources change as well.
As noted above, this is not actually what is happening. Commercial players don't want to limit themselves to free/open data; they want to use AI to extract as much information about the world as possible so they can answer as many queries as possible.
To the far-from-negligible extent that they all do and will regurgitate Wikimedia content, it will happen.
By the same token, their drawing on alternative sources as well as Wikimedia content, even proprietary ones, is also potentially a good thing. It increases diversity.
And for most of the sources amalgamated in this manner, if provenance is indicated at all, we don't find any of the safeguards we have for Wikimedia content (revisioning, participatory decision-making, transparent policies, etc.). Editability, while opening the floodgate to a category of problems other sources don't have, is in fact also a safeguard: making it possible to fix mistakes instead of going through a "feedback" form that ends up who knows where.
Indeed, but it helps if re-users indicate provenance. If a digital voice assistant propagates a Wikimedia mistake without telling users where it got its information from, then there is not even a feedback form. Editability is of no help at all if people can't find the source.
This is similar to the problem of vandalised Wikidata descriptions being displayed in Wikipedia mobile views: people can't figure out where the nonsense comes from, and where to change it.
With an eye to 2030 and WMF's long-term direction, I do think it's worth thinking about Wikidata's centrality, and I would agree with you at least that the phrase "the essential infrastructure of the ecosystem" does overstate what I think WMF should aspire to (the "essential infrastructure" should consist of many open components maintained by different groups). But beyond that I think you're reading stuff into the statement that isn't there.
I'm not sure I read much more into it than that – you've summarised my main concern.
Wikidata in particular is best seen not as the singular source of truth, but as an important hub in a network of open data providers -- primarily governments, public institutions, nonprofits. This is consistent with recent developments around Wikidata such as query federation.
As Wikidata imports all that material, however, there is a risk that re-users (answer engines, digital assistants) will simply focus on mining Wikidata.
There is also the risk of circular relationships – Wikidata importing content from databases that in turn import some of their own content from Wikidata. You can end up with databases all agreeing with each other, and all being wrong. :/
Wikidata will often provide a shallow first level of information about a subject, while other linked sources provide deeper information. The more structured the information, the easier it becomes to validate in an automatic fashion that, for example, the subset of country population time series data represented in Wikidata is an accurate representation of the source material. Even when a large source dataset is mirrored by Wikimedia (for low-latency visualization, say), you can hash it, digitally sign it, and restrict modifiability of copies.
Interesting, though I'm not aware of that being done at present.
If we expose the history, provenance and structure of information, and the connections between sources, we can actually make the information more resilient against manipulation than if it is merely a piece of text in an article, some number in an {{infobox}} template or some "factoid" in a proprietary knowledge graph.
Yes, provenance – traceability – is key. But as things stand, I have seen no evidence that WMF has a strong desire to encourage or force re-users to provide it.
is it just that some of the world's most profitable companies earn
billions
from volunteers' work, gaining political power in the process, while volunteers actually pay to go online and access or purchase the sources they need to do their work? Yes or no?
I don't accept your framing. Search the way it used to be (with algorithms primarily tuned for relevance of results) was a fair deal for everyone involved: you put stuff on the web, it gets indexed and people are able to find it; the search engines make money by putting ads on the search result page. The amalgamation of information into knowledge graphs that deliver concise answers directly (however inadequate) changes the dynamic significantly.
It accords ever greater power to the maintainers of these proprietary graphs which, I hasten to repeat, incorporate information well beyond just Wikimedia's, and which frequently fail to indicate provenance in an adequate manner. And, as the example at the beginning of this message shows, it leads to "information pollution", with fake news, conspiracy theories and pseudoscience leaking into semi-authoritative instant answers.
I don't think the social justice problem here is that these companies make a profit, but that they function more and more as gatekeepers and curators of knowledge, a role for which they're ill-equipped and which civil society should be reluctant to give them.
I'm in violent agreement with you on that one. :)
But the proprietary knowledge graphs are valuable to users in ways that the previous generation of search engines was not. Interacting with a device like you would with a human being ("Alexa/Google/Siri, is yarrow edible?") makes knowledge more accessible and usable, including to people who have difficulty reading long texts, or who are not literate at all. In this sense I don't think WMF should ever find itself in the position to argue _against_ inclusion of information from Wikimedia projects in these applications.
There is a distinct likelihood that they will make reading Wikipedia articles progressively obsolete, just like the availability of Googling has dissuaded many people from sitting down and reading a book. All the more so if these applications fail to make their users aware that the information comes from Wikimedia projects.
The applications themselves are not the problem; the centralized gatekeeper control is. Knowledge as an open service (and network) is actually the solution to that root problem. It's how we weaken and perhaps even break the control of the gatekeepers. Your critique seems to boil down to "Let's ask Google for more crumbs". In spite of all your anti-corporate social justice rhetoric, that seems to be the path to developing a one-sided dependency relationship.
I considered that, but in the end felt that given the extent to which Google profited from volunteers' work, it wasn't an unfair ask.
To be clear, I'm in favor of corporations giving more to the commons, though in my ideal world, that would happen through aggressive taxation and greater public investment (especially in schools, universities and GLAMs). I have every confidence that WMF does in fact ask for as much as it can be expected to in conversations with corporations, but it's not clear what you're suggesting should happen if the corporations say no.
Publicise the fact that Google and others profit from volunteer work, and give very little back. The world could do with more articles like this:
https://www.washingtonpost.com/news/the-intersect/wp/2015/07/22/you-dont-kno...
I once did a very rough back-of-an-envelope calculation based on Google's staggering quarterly profits and its large reliance on Wikimedia content for many of its search engine's most attractive features driving its ad revenue. I estimated that the average Wikipedia edit (in any namespace) brings Google something in the order of 10 cents of revenue.[2]
Again, thanks for an engaging and thought-provoking post.
Best, Andreas
[1] https://en.wikipedia.org/wiki/Wikipedia_talk:Wikidata/2017_State_of_affairs (including its copious archives) [2] https://en.wikipedia.org/wiki/Wikipedia_talk:Wikipedia_Signpost/2015-07-22/I...