It seems that a lot of people are excited about the concept of KDE integration (and of course integration beyond that), and I am thinking that we should start to think now a little bit about the load on the webservers and how to manage it.
My thought is that a well behaved application should produce no more load than people surfing our site with a traditional browser. But it is possible (likely, even) that some applications will not be well-behaved.
If we have a single generic interface where everyone pulls in exactly the same way, then we have no way to block abusive applications without blocking everyone.
Something as simple as requiring a "user agent" string might be enough?
This issue is quite similar to the issue of people pulling pages from our site "live" to make a mirror. It's not a good thing to do if done abusively. But with web hosts, it is easy enough to simply block them if they misbehave. A misbehaving application will come in through many ip numbers.
I have been thinking a little bit as well about a model with full blown keys (like the Google API) -- the point could be that for free software, we can give out free keys and support those users at our own expense. But for proprietary software, we can charge money for the keys.
Anyone who doesn't want to pay can still get the database dumps from time to time.
--Jimbo
On 25/06/05, Jimmy Wales jwales@wikia.com wrote:
I have been thinking a little bit as well about a model with full blown keys (like the Google API) -- the point could be that for free software, we can give out free keys and support those users at our own expense. But for proprietary software, we can charge money for the keys.
Could web reusers use the same interface (with appropriate caching)? Would they be able to get free keys if their sites were free?
(I have no plans to do anything along these lines, but I'd imagine that sooner or later a WWW::Wikipedia::API will pop up on the CPAN...)
Trillian currently uses a system where they distribute with their im client a list of all the article titles, and it compares all incoming strings on im, irc, etc.. against the titles. It then turns the strings into links that when hovered over, initiate a request to the Trillian servers for the entire text of the article, which is then displayed as a hover box. The article text is cached for future use.
If a pay-per-basis api is available, they may want to incorporate a small fee into the cost of the client which would allow each individual to access the wikipedia api, rather than trillian's servers. This benefits them for not having to handle so much traffic, and benefits the end-user because they get completely fresh data (not from a stale copy on trillian's servers), and it (may) benefit the foundation because it is incoming money, unless the service is provided at-cost.
Just wanted to present this as a possible real-world solution that is indicative of what folks are going to want to do with an API...
/Alterego
On 6/25/05, Jimmy Wales jwales@wikia.com wrote:
It seems that a lot of people are excited about the concept of KDE integration (and of course integration beyond that), and I am thinking that we should start to think now a little bit about the load on the webservers and how to manage it.
My thought is that a well behaved application should produce no more load than people surfing our site with a traditional browser. But it is possible (likely, even) that some applications will not be well-behaved.
If we have a single generic interface where everyone pulls in exactly the same way, then we have no way to block abusive applications without blocking everyone.
Something as simple as requiring a "user agent" string might be enough?
This issue is quite similar to the issue of people pulling pages from our site "live" to make a mirror. It's not a good thing to do if done abusively. But with web hosts, it is easy enough to simply block them if they misbehave. A misbehaving application will come in through many ip numbers.
I have been thinking a little bit as well about a model with full blown keys (like the Google API) -- the point could be that for free software, we can give out free keys and support those users at our own expense. But for proprietary software, we can charge money for the keys.
Anyone who doesn't want to pay can still get the database dumps from time to time.
--Jimbo
Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Jimmy Wales wrote:
My thought is that a well behaved application should produce no more load than people surfing our site with a traditional browser. But it is possible (likely, even) that some applications will not be well-behaved.
If we have a single generic interface where everyone pulls in exactly the same way, then we have no way to block abusive applications without blocking everyone.
Something as simple as requiring a "user agent" string might be enough?
This issue is quite similar to the issue of people pulling pages from our site "live" to make a mirror. It's not a good thing to do if done abusively. But with web hosts, it is easy enough to simply block them if they misbehave. A misbehaving application will come in through many ip numbers.
I have been thinking a little bit as well about a model with full blown keys (like the Google API) -- the point could be that for free software, we can give out free keys and support those users at our own expense. But for proprietary software, we can charge money for the keys.
For server-side systems (mirrors and other sites including our bits) something like the Google API's keys would work ok. A relatively small number of server admins will be setting things up, and the keys will be 'secret', kept on that server.
Things are a bit different with the client-side software integration envisioned by the KDE folks; like a web browser these apps will be distributed to thousands of users and should "just work". A per-user key setup would probably not be acceptable, while a non-secret key distributed with the app (like a user-agent string) can be easily ripped and faked by someone less scrupulous.
We probably want client-side apps to be able to act like a browser (easy, no manual setup), while trying to avoid server abuse by someone who decides to snarf a million pages a day; a per-IP rate limit on the 'free software key' is probably sufficient to automatically curb server-side abuse.
A separate policy for proprietary client-side apps would have to be policed on the honor system; someone could always have their client claim to be amaroK and we'd never be the wiser. :)
Blocking a client-side app that identifies itself with a unique user-agent or key, but behaves badly, would be easy. Blocking an app that behaves badly _and_ pretends to be legitimate software is a harder problem. (Sometimes you can distinguish between the real and the fake app by its behavior or a quirk of its HTTP submission formatting, sometimes that might not be easy.)
-- brion vibber (brion @ pobox.com)
If done in a way similar to Google's API, then ripping a key wouldn't matter much since the key is limiting. The client-side applications wouldn't directly access the webservice API, rather the client-side app provider would be an intermediary. Client apps query the host provider and that host queries Wikipedia's API. If the client app provider wants to support more than 1000 queries per day (Google's limit), then they'd need to pay for more queries. Can't be ripped anyway since the key lies with the provider, not the client.
On Wikipedia's side, the implementation and intermediary APIs and architecture don't matter. Just implement user key webservices API with query limits, and you're done.
- MHart - http://taxalmanac.org
MHart wrote:
If done in a way similar to Google's API, then ripping a key wouldn't matter much since the key is limiting. The client-side applications wouldn't directly access the webservice API, rather the client-side app provider would be an intermediary. Client apps query the host provider and that host queries Wikipedia's API. If the client app provider wants to support more than 1000 queries per day (Google's limit), then they'd need to pay for more queries. Can't be ripped anyway since the key lies with the provider, not the client.
Well, let's play through the scenario:
* User with KDE desktop fires up amaroK to play some tunes * Clicks for a Wikipedia article lookup on an artist * amaroK contacts a server that someone(?) runs for amaroK * amaroK server contacts Wikipedia server * Data is sent back to amaroK server * Data is sent back to amaroK client * Happy!
Now, while the amaroK server <-> Wikipedia server is locked by a secret key, the amaroK client <-> amaroK server probably isn't. Anybody can make a request to the amaroK server, claiming to be amaroK -- an abuse then DoSs the entire amaroK user base when it hits the maximum requests for the amaroK key.
On Wikipedia's side, the implementation and intermediary APIs and architecture don't matter. Just implement user key webservices API with query limits, and you're done.
Well, that just pushes the problem from one server to another. It doesn't change the overall analysis I gave.
-- brion vibber (brion @ pobox.com)
On 6/27/05, Brion Vibber brion@pobox.com wrote: ...
quite an interesting problem. And I don't think there is "The Solution" that you are all hoping for. It sounds a little bit like the problem of distributing keys in any encryption technology, but actually it is way worse than that. Any key, user-agent-string, or other authentication method would have to be written down somewhere in the source code of the application that is using the API. And since one of the primary aims of the API are going to be open source applications, everything "hidden" in the source code, is publicly accessible. So whatever is the method for an application to say "Hi, it's really me!", it can be copied, thus another application can fake it.
It would work with closed source applications, but offering the API only to closed source application isn't really an option.
So, there isn't any way to identify the individual applications. But there is a way to identify the individuals who are using the application which is using the API. Why do you want to block the application? Just limit the use of the API to 1000 accesses an hour by IP-adress (replace with different numbers as you see more fit). That blocks any application, that is misbehaving.
Ok, it would also block any other application that runs on the same machine (or over the same proxy), but I think that is acceptable, if it's to keep the whole thing running.
There could also be an option to still have user-agent-strings, and limit the access by application (a low number) and have an overall limit (a resonably larger number). That would keep one application from stopping all other access, but also protect against any misbehaving application that changes user-agent-strings.
hmmm...that sounds to easy...what did I miss? :-)
regards Henning Jungkurth
Henning Jungkurth wrote:
On 6/27/05, Brion Vibber brion@pobox.com wrote: ...
quite an interesting problem. And I don't think there is "The Solution" that you are all hoping for. It sounds a little bit like the problem of distributing keys in any encryption technology, but actually it is way worse than that. Any key, user-agent-string, or other authentication method would have to be written down somewhere in the source code of the application that is using the API. And since one of the primary aims of the API are going to be open source applications, everything "hidden" in the source code, is publicly accessible. So whatever is the method for an application to say "Hi, it's really me!", it can be copied, thus another application can fake it.
It would work with closed source applications, but offering the API only to closed source application isn't really an option.
So, there isn't any way to identify the individual applications. But there is a way to identify the individuals who are using the application which is using the API. Why do you want to block the application? Just limit the use of the API to 1000 accesses an hour by IP-adress (replace with different numbers as you see more fit). That blocks any application, that is misbehaving.
Ok, it would also block any other application that runs on the same machine (or over the same proxy), but I think that is acceptable, if it's to keep the whole thing running.
There could also be an option to still have user-agent-strings, and limit the access by application (a low number) and have an overall limit (a resonably larger number). That would keep one application from stopping all other access, but also protect against any misbehaving application that changes user-agent-strings.
hmmm...that sounds to easy...what did I miss? :-)
regards Henning Jungkurth
what about adding user logins into the mix? Use ip-based or user agent-based limits for anonymous access and user-based limits for those who use a login. The login should be the same as their wikipedia web login. That way a logged-in user wouldn't be restricted by ip limits.
in order words: If logged in{ decrease user quota } else { decrease ip/agent quota } if quota==0 {deny request}
Sincerely, Jason Edgecombe - a lurker who spoke up :)
At 2005-06-27 22:34, Brion Vibber wrote:
So, there isn't any way to identify the individual applications. But there is a way to identify the individuals who are using the application which is using the API. Why do you want to block the application? Just limit the use of the API to 1000 accesses an hour by IP-adress (replace with different numbers as you see more fit). That blocks any application, that is misbehaving.
Amazon's webservices have 'solved' the problem by allowing no more than one request per second from any IP address.
This however doesn't work properly, because the clients can't control whether they have a lot of visitors at once and then none for a length of time. Some (not very efficient) applications also request 10 results at once and then none for a while.
What I proposed was a simple mechanism: Give every IP-address a credit of say a 60 requests and decrement this with one for every request succesfully handled and increase the credit with one every second (up to the maximum of 60). As soon as the credit is zero the system either delays the response until there is credit again (so upto one second later) (prefered method) or it sends back an appropriate error message.
This system is easy to implement and it will give the clients a lot of freedom, but it will effectively limit the access by IP-addresses that send too many requests per unit of time.
Of course the values of the parameters are open to discussion.
Greetings, Jaap
-- My Amazon scripts: -- http://www.chipdir.nl/amazon/
Why does Wikimedia want rate limiting that's any different from what it has for normal webclients now (i.e. ban them if they're abusive), I don't see how it makes much of a difference that the request is going through SOAP over HTTP rather than XHTML over HTTP.
Now, while the amaroK server <-> Wikipedia server is locked by a secret
key, the amaroK client <-> amaroK server probably isn't. Anybody can make a request to the amaroK server, claiming to be amaroK -- an abuse then DoSs the entire amaroK user base when it hits the maximum requests for the amaroK key. <<
Well, that just pushes the problem from one server to another. It
doesn't change the overall analysis I gave. <<
I would phrase it as pushing the problem from Wikipedia to the app provider. Your statement about the client/amaroK key security as being "probably insecure" is speculation, and ultimately, that isn't Wikipedia's problem. You obviously have no control over other people's server or key security, and trying to secure public information is generally impossible anyway. People will find a way around the limitations of a public API. Google's webservices API make it easy to query and parse the results, but Google's HTML page returns are very clean and easy to parse anyway, making it simple to build a webservice that just scrapes the page.
Wikipedia is also clean html, and a scraper is simple to make. But you have to ignore those sorts of people. Make it easy for legit users to access what they need and ignore the people that are going to ignore your rules anyway. Introduce stricter controls when it becomes clearly necessary, but not before.
- MHart - http://taxalmanac.org
Brion Vibber wrote:
A separate policy for proprietary client-side apps would have to be policed on the honor system; someone could always have their client claim to be amaroK and we'd never be the wiser. :)
Perhaps. But imagine a scenario: Microsoft decides to add Wikipedia content on the fly for all users of Microsoft Media Player or whatever it would be. (Or Apple and iTunes, for example). We're pretty joyous about this in general (free information for everyone!) but it seems a bit unfair for us to have to foot the bill for the servers and bandwidth.
Now, if hypothetically, Apple were to include Wikipedia data by pulling it and pretending to be amoraK, presumably we would have some legal course of action, and of course the PR for them would be disastrous.
No one has said anything negative about what I'm proposing yet (that I've seen) but I should make clear that I'm not at all talking about making our free content costly for proprietary applications. It is only the hammering of our (expensive) servers that I'm worried about.
If Microsoft or Apple wants to mirror Wikipedia and hit their own servers, that's fine.
Blocking a client-side app that identifies itself with a unique user-agent or key, but behaves badly, would be easy. Blocking an app that behaves badly _and_ pretends to be legitimate software is a harder problem. (Sometimes you can distinguish between the real and the fake app by its behavior or a quirk of its HTTP submission formatting, sometimes that might not be easy.)
It seems unlikely that a popular proprietary app would do something so bad; legitimate companies would have too much to lose. It is of course possible that a spammer or smalltimer might do something malicious, but I don't suppose there is any way to prevent that.
--Jimbo
Jimmy Wales wrote:
No one has said anything negative about what I'm proposing yet (that I've seen) but I should make clear that I'm not at all talking about making our free content costly for proprietary applications. It is only the hammering of our (expensive) servers that I'm worried about.
If Microsoft or Apple wants to mirror Wikipedia and hit their own servers, that's fine.
On a technical note, Brion announced that the XML produced by Special:Export will be out new "dump" format, together with an import script. Coincidentially, I recently wrote an extension that can list the titles of articles changed since date/time X. Working together, these two parts could enable mirrors to keep their databases up-to-date with only a few hours/minutes delay to the "live" server, and without having to transfer the whole database once every two weeks (which was the release cycle, right?).
Magnus
P.S.: Please note that the extension is not fully functional yet, as it lacks information about page moves/deletion, and image upload/deletion.
Magnus Manske wrote:
Jimmy Wales wrote:
No one has said anything negative about what I'm proposing yet (that I've seen) but I should make clear that I'm not at all talking about making our free content costly for proprietary applications. It is only the hammering of our (expensive) servers that I'm worried about.
If Microsoft or Apple wants to mirror Wikipedia and hit their own servers, that's fine.
On a technical note, Brion announced that the XML produced by Special:Export will be out new "dump" format, together with an import script. Coincidentially, I recently wrote an extension that can list the titles of articles changed since date/time X. Working together, these two parts could enable mirrors to keep their databases up-to-date with only a few hours/minutes delay to the "live" server, and without having to transfer the whole database once every two weeks (which was the release cycle, right?).
Magnus, we've had this for a few months actually. See extensions/OAI for the client/server database updater system, which we're currently using somewhat experimentally for a couple clients. Wikimedia plans to offer this as a value-add service to commercial mirrors generally (and I hope for non-profit mirrors, though you'd have to talk to Jimbo to know what's up).
This uses the OAI-PMH[1] wrapper protocol to pull page updates, formatted using the Special:Export XML schema.
(The client portion currently works only on a local 1.4 installation though it's independent of the server version. I'll be tuning it up for 1.5 when I have a chance in the next few weeks.)
[1] http://www.openarchives.org/OAI/openarchivesprotocol.html
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
Magnus, we've had this for a few months actually. See extensions/OAI for the client/server database updater system, which we're currently using somewhat experimentally for a couple clients. Wikimedia plans to offer this as a value-add service to commercial mirrors generally (and I hope for non-profit mirrors, though you'd have to talk to Jimbo to know what's up).
Yes, we would love to do that. If someone is profiting, we should charge them. If it is non-profit, then we should only charge them if the load is so substantial that we feel they ought to raise their own funds to do it. Better to be helpful when we can, and only charge people money for stuff when it's painful to us to support them.
--Jimbo
Something as simple as requiring a "user agent" string might be enough?
No, see below..
A misbehaving application will come in through many ip numbers.
And a forged user agent will be easy to bounce through Tor + privoxy or similar mechanisms to "anonymize" the requests. You can't block all of the IPs (well, you could block the Tor nodes in that case, but someone will find dozens of ways around that too, by chaining proxies together).
David A. Desrosiers desrod@gnu-designs.com http://gnu-designs.com
But basically this announcement was made without anyone on "our side" having made the commitment to implement this right? I don't know about my fellow developers but I just just found out through dot.kde.org that KDE and Wikimedia were planning some kind of cooperation.
Has anyone commited (him|her)self to implement this?
Hi everybody,
Jimmy wrote:
I have been thinking a little bit as well about a model with full blown keys (like the Google API) -- the point could be that for free software, we can give out free keys and support those users at our own expense. But for proprietary software, we can charge money for the keys.
Brion Vibber wrote:
For server-side systems (mirrors and other sites including our bits) something like the Google API's keys would work ok. A relatively small number of server admins will be setting things up, and the keys will be 'secret', kept on that server.
Things are a bit different with the client-side software integration envisioned by the KDE folks; like a web browser these apps will be distributed to thousands of users and should "just work". A per-user key setup would probably not be acceptable, while a non-secret key distributed with the app (like a user-agent string) can be easily ripped and faked by someone less scrupulous.
We probably want client-side apps to be able to act like a browser (easy, no manual setup), while trying to avoid server abuse by someone who decides to snarf a million pages a day; a per-IP rate limit on the 'free software key' is probably sufficient to automatically curb server-side abuse.
At work we've recently been over a lot of these same issues while developing our API[0]. We've gone with a scheme which utilizes per-application tokens. For tasks which require a user account, there are per-user-per-app tokens. You can read more about it here[1]. Flickr recently drafted their new auth API (here[2]) which (I think) has a similar scheme.
Anyway, I thought looking at existing APIs which try to handle some of the same use cases as a hypothetical Wikipedia API might be helpful.
Ted
0. http://api.evdb.com/ 1. http://api.evdb.com/docs/auth/ 2. http://flickr.com/services/api/auth.spec.html
wikitech-l@lists.wikimedia.org