If we are running dedicated services for free on behalf of a major search
engine as part of our symbiotic relationship with them, then kudos to Lila
for putting it on the trustees agenda and getting some discussion in the
movement.
I can understand how we get into a situation where a known major search
engine is allowed a level of web crawling that would be treated as a denial
of service attack if it came from elsewhere.
Echoing Andreas and Denny I can see the case for asking for some
contribution to cost recovery when we do something extra for a major reuser
of our data. But I would prefer this to be couched as part of a wider
strategic dialogue with those entities.
My particular concern is with attack pages, and if we are providing the
service that crawls all edits including new pages then I think we can do
what has in the past been dismissed as impossible or outside our control:
Shift the new page process to one where unpatrolled pages are not crawled
by search engine bots until after someone has patrolled them.
Treat "flagged for deletion" as a third status in addition to patrolled and
unpatrollled..
If we do this then when someone creates an article about their high school
prom queen and her unorthodox method for getting good grades from male
teachers, we should be able to delete it without it being mirrored for
hours by search engines.
Others might want the dialogue to be more about how much content can be
shown in an uneditable unattributed way by being treated as simply
extracted facts and thereby public domain.
I'm keen that the WMF board has oversight of these arrangements, I
appreciate that some data about crawl frequencies and algorithms will be
confidential to the commercial entities involved, So I could understand if
some discussions or briefing papers to the board were confidential.
What I don't want is for cost recovery to be the first item on the agenda
when we talk about these relationships. Less mirroring of vandalism and
attack pages, better compliance with CC-BY-SA and other licenses and more
opportunities for readers to edit are more important to me, and considering
our current financial health should be to us all..
This does of course bring us back to the discussion about conflicts of
interest and the need for staff and trustees to recuse, not just when their
employer's crawler is being discussed, but also when making decisions about
entities in which they own any shares. I think we should also add when the
trustees are discussing their employer's direct competitors. It might also
help if more of the trustees had the detachment and neutrality of say a
Canadian Medic as opposed to a silicon valley insider whose future
employers could easily be other tech giants.
WereSpielChequers/Jonathan
Message: 3
Date: Sat, 16 Jan 2016 18:11:51 -0800
From: Denny Vrandecic <dvrandecic(a)wikimedia.org>
To: Wikimedia Mailing List <wikimedia-l(a)lists.wikimedia.org>
Subject: Re: [Wikimedia-l] Monetizing Wikimedia APIs
Message-ID:
<
CALuRxAtFXjs9a3oO-KZ_w+pRdqShgfxHYE5KQ23rGfHeTAxC9g(a)mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
I find it rather surprising, but I very much find myself in agreement with
most what Andreas Kolbe said on this thread.
To give a bit more thoughts: I am not terribly worried about current
crawlers. But currently, and more in the future, I expect us to provide
more complex and this expensive APIs: a SPARQL endpoint, parsing APIs, etc.
These will be simply expensive to operate. Not for infrequent users - say,
to the benefit of us 70,000 editors - but for use cases that involve tens
or millions of requests per day. These have the potential of burning a lot
of funds to basically support the operations of commercial companies whose
mission might or might not be aligned with our.
Is monetizing such use cases really entirely unthinkable? Even under
restrictions like the ones suggested by Andreas, or other such restrictions
we should discuss?
On Jan 16, 2016 3:49 PM, "Risker" <risker.wp(a)gmail.com> wrote:
Hmm. The majority of those crawlers are from
search engines - the very
search engines that keep us in the top 10 of their results (and often in
the top 3), thus leading to the usage and donations that we need to
survive. If they have to pay, then they might prefer to change their
algorithm, or reduce the frequency of scraping (thus also failing to
catch
updates to articles including removal of
vandalism in the lead
paragraphs,
which is historically one of the key reasons for
frequently crawling the
same articles). Those crawlers are what attracts people to our sites, to
read, to make donations, to possibly edit. Of course there are lesser
crawlers, but they're not really big players.
I'm at a loss to understand why the Wikimedia Foundation should take on
the
costs and indemnities associated with hiring
staff to create a for-pay
API
that would have to meet the expectations of a
customer (or more than one
customer) that hasn't even agreed to pay for access. If they want a
specialized API (and we've been given no evidence that they do), let THEM
hire the staff, pay them, write the code in an appropriately open-source
way, and donate it to the WMF with the understanding that it could be
modified as required, and that it will be accessible to everyone.
It is good that the WMF has studied the usage patterns. Could a link be
given to the report, please? It's public, correct? This is exactly the
point of transparency. If only the WMF has the information, then it
gives
an excuse for the community's comments to be
ignored "because they don't
know the facts". So let's lay out all the facts on the table, please.
Risker/Anne