On 12/21/06, Neil Harris usenet@tonal.clara.co.uk wrote:
Just a thought: the en: Wikipedia gets about 3 edits a second. I wonder if it would be possible for us to use special pleading through the Foundation to get a dedicated search pipe into Google that would allow us to do, say, 30 searches a second 24 hours a day, (which would only be a tiny, tiny fraction of their overall capacity), in recognition of the _very_ substantial benefit in advertising revenue they must surely currently be receiving as a side effect of having Wikipedia's content online to draw in search queries.
(Think about it: even if only 20% of Wikimedia's 4000 or so page loads a second come from Google users who are expecting something like Wikipedia content, and Google only make $0.25 CPM on serving page ads on searches for those pages, that comes to an income stream of $0.20 per _second_ from Wikipedia searches, or a total of about $8M a year...)
If so, we could integrate the copyright violation bot into the toolserver, or into the MW server cluster itself.
Go ahead: Write the software, make it good, make it scale, make it robust so that you don't have to constantly twiddle with it to keep it working.
I have no doubt that Google's ratelimit can be worked out. I promise you that good work done towards these ends will not be work wasted. Make sure that it's sufficently modular that we'll be able to use it to generate queries against other texts sources.
The logic for software to do this well is not trivial but certainly not impossible. Working out the right access with Google is also not impossible. Someone just needs to step up an do it.