Tim Starling wrote:
But it's not going to happen unless someone gets around to writing a program which:
- Accepts URLs on stdin, separated by line breaks
Seems simple.
- Identifies plain page views
I assume you mean any /wiki/XXXX url, with no '?'. Quite easy, too.
- Breaks them down into per-page counts as described
And do it really fast... If wgArticleId was also sent, sorting and using the hashtable, would be easier.
- Provides a TCP query interface
I'd share the memory hashtable between process, and simply add a 'reader' one. We can live with race conditions, too.
- Does all that for 30k req/s using less than 10% CPU and 2GB memory
You mean 10% of the *cluster CPU*, isn't? ;)
Impossible?
We could start by profiling. Read data for 5 minutes, compute it for 25. This would low the rate to 1k req/s.