If we are running dedicated services for free on behalf of a major search engine as part of our symbiotic relationship with them, then kudos to Lila for putting it on the trustees agenda and getting some discussion in the movement.
I can understand how we get into a situation where a known major search engine is allowed a level of web crawling that would be treated as a denial of service attack if it came from elsewhere.
Echoing Andreas and Denny I can see the case for asking for some contribution to cost recovery when we do something extra for a major reuser of our data. But I would prefer this to be couched as part of a wider strategic dialogue with those entities.
My particular concern is with attack pages, and if we are providing the service that crawls all edits including new pages then I think we can do what has in the past been dismissed as impossible or outside our control: Shift the new page process to one where unpatrolled pages are not crawled by search engine bots until after someone has patrolled them. Treat "flagged for deletion" as a third status in addition to patrolled and unpatrollled.. If we do this then when someone creates an article about their high school prom queen and her unorthodox method for getting good grades from male teachers, we should be able to delete it without it being mirrored for hours by search engines.
Others might want the dialogue to be more about how much content can be shown in an uneditable unattributed way by being treated as simply extracted facts and thereby public domain.
I'm keen that the WMF board has oversight of these arrangements, I appreciate that some data about crawl frequencies and algorithms will be confidential to the commercial entities involved, So I could understand if some discussions or briefing papers to the board were confidential.
What I don't want is for cost recovery to be the first item on the agenda when we talk about these relationships. Less mirroring of vandalism and attack pages, better compliance with CC-BY-SA and other licenses and more opportunities for readers to edit are more important to me, and considering our current financial health should be to us all..
This does of course bring us back to the discussion about conflicts of interest and the need for staff and trustees to recuse, not just when their employer's crawler is being discussed, but also when making decisions about entities in which they own any shares. I think we should also add when the trustees are discussing their employer's direct competitors. It might also help if more of the trustees had the detachment and neutrality of say a Canadian Medic as opposed to a silicon valley insider whose future employers could easily be other tech giants.
WereSpielChequers/Jonathan
Message: 3
Date: Sat, 16 Jan 2016 18:11:51 -0800 From: Denny Vrandecic dvrandecic@wikimedia.org To: Wikimedia Mailing List wikimedia-l@lists.wikimedia.org Subject: Re: [Wikimedia-l] Monetizing Wikimedia APIs Message-ID: < CALuRxAtFXjs9a3oO-KZ_w+pRdqShgfxHYE5KQ23rGfHeTAxC9g@mail.gmail.com> Content-Type: text/plain; charset=UTF-8
I find it rather surprising, but I very much find myself in agreement with most what Andreas Kolbe said on this thread.
To give a bit more thoughts: I am not terribly worried about current crawlers. But currently, and more in the future, I expect us to provide more complex and this expensive APIs: a SPARQL endpoint, parsing APIs, etc. These will be simply expensive to operate. Not for infrequent users - say, to the benefit of us 70,000 editors - but for use cases that involve tens or millions of requests per day. These have the potential of burning a lot of funds to basically support the operations of commercial companies whose mission might or might not be aligned with our.
Is monetizing such use cases really entirely unthinkable? Even under restrictions like the ones suggested by Andreas, or other such restrictions we should discuss? On Jan 16, 2016 3:49 PM, "Risker" risker.wp@gmail.com wrote:
Hmm. The majority of those crawlers are from search engines - the very search engines that keep us in the top 10 of their results (and often in the top 3), thus leading to the usage and donations that we need to survive. If they have to pay, then they might prefer to change their algorithm, or reduce the frequency of scraping (thus also failing to
catch
updates to articles including removal of vandalism in the lead
paragraphs,
which is historically one of the key reasons for frequently crawling the same articles). Those crawlers are what attracts people to our sites, to read, to make donations, to possibly edit. Of course there are lesser crawlers, but they're not really big players.
I'm at a loss to understand why the Wikimedia Foundation should take on
the
costs and indemnities associated with hiring staff to create a for-pay
API
that would have to meet the expectations of a customer (or more than one customer) that hasn't even agreed to pay for access. If they want a specialized API (and we've been given no evidence that they do), let THEM hire the staff, pay them, write the code in an appropriately open-source way, and donate it to the WMF with the understanding that it could be modified as required, and that it will be accessible to everyone.
It is good that the WMF has studied the usage patterns. Could a link be given to the report, please? It's public, correct? This is exactly the point of transparency. If only the WMF has the information, then it
gives
an excuse for the community's comments to be ignored "because they don't know the facts". So let's lay out all the facts on the table, please.
Risker/Anne