January 20, 2007
Hello wikitech friends:
I have discovered that the following templates will provide the following information.
1. Current Day (Sunday to Saturday) {{CURRENTDAYNAME}}
2. Current Month (January to December) {{CURRENTMONTHNAME}}
3. Current Day (1 to 31) {{CURRENTDAY}}
4. Current Year (2007) {{CURRENTYEAR}}
5. Number of articles on your wiki {{NUMBEROFARTICLES}}
QUESTION -------------- On Wikipedia.org, the number of page views can be found here: http://en.wikipedia.org/wiki/Special:Statistics
What is the template one would use to indicate the number of page views for a wiki?
Thanks.
David Spencer in Canada
On 1/20/07, David Spencer wikimedia.org-updates@davidspencer.ca wrote:
January 20, 2007
Hello wikitech friends:
I have discovered that the following templates will provide the following information.
- Current Day (Sunday to Saturday)
{{CURRENTDAYNAME}}
- Current Month (January to December)
{{CURRENTMONTHNAME}}
- Current Day (1 to 31)
{{CURRENTDAY}}
- Current Year (2007)
{{CURRENTYEAR}}
- Number of articles on your wiki
{{NUMBEROFARTICLES}}
QUESTION
On Wikipedia.org, the number of page views can be found here: http://en.wikipedia.org/wiki/Special:Statistics
What is the template one would use to indicate the number of page views for a wiki?
1) They aren't templates, they're variables/magic words. http://meta.wikimedia.org/wiki/Help:Magic_words
2) On Wikimedia projects, page views are not tracked due to load. Probably because of this, nobody's created a {{PAGEVIEWCOUNT}} magic word or whatnot.
On 1/21/07, Simetrical Simetrical+wikitech@gmail.com wrote:
- On Wikimedia projects, page views are not tracked due to load.
Probably because of this, nobody's created a {{PAGEVIEWCOUNT}} magic word or whatnot.
I thought they kind of were now. Like every 1000th view is recorded somewhere, possibly offsite. Wikicharts, it was called. What's the status of that?
Steve
Steve Bennett schrieb:
On 1/21/07, Simetrical Simetrical+wikitech@gmail.com wrote:
- On Wikimedia projects, page views are not tracked due to load.
Probably because of this, nobody's created a {{PAGEVIEWCOUNT}} magic word or whatnot.
I thought they kind of were now. Like every 1000th view is recorded somewhere, possibly offsite. Wikicharts, it was called. What's the status of that?
that is on toolserver. but i think it would produce too much server load both on ts and on "normal" servers to call the stat info for every page call...
marco
Marco Schuster wrote:
Steve Bennett schrieb:
On 1/21/07, Simetrical Simetrical+wikitech@gmail.com wrote:
- On Wikimedia projects, page views are not tracked due to load.
Probably because of this, nobody's created a {{PAGEVIEWCOUNT}} magic word or whatnot.
I thought they kind of were now. Like every 1000th view is recorded somewhere, possibly offsite. Wikicharts, it was called. What's the status of that?
that is on toolserver. but i think it would produce too much server load both on ts and on "normal" servers to call the stat info for every page call...
Wikicharts is only accurate for about the top 50 most popular pages. There's no technical reason why we couldn't count every single page view and store the counts in an in-memory hashtable for lookup by some web-based script. There's only 25M pages, so the memory requirements wouldn't be particularly onerous. This kind of thing would be enabled by the UDP logging infrastructure I've been setting up.
But it's not going to happen unless someone gets around to writing a program which:
* Accepts URLs on stdin, separated by line breaks * Identifies plain page views * Breaks them down into per-page counts as described * Provides a TCP query interface * Does all that for 30k req/s using less than 10% CPU and 2GB memory
Impossible?
-- Tim Starling
Tim Starling wrote:
But it's not going to happen unless someone gets around to writing a program which:
- Accepts URLs on stdin, separated by line breaks
Seems simple.
- Identifies plain page views
I assume you mean any /wiki/XXXX url, with no '?'. Quite easy, too.
- Breaks them down into per-page counts as described
And do it really fast... If wgArticleId was also sent, sorting and using the hashtable, would be easier.
- Provides a TCP query interface
I'd share the memory hashtable between process, and simply add a 'reader' one. We can live with race conditions, too.
- Does all that for 30k req/s using less than 10% CPU and 2GB memory
You mean 10% of the *cluster CPU*, isn't? ;)
Impossible?
We could start by profiling. Read data for 5 minutes, compute it for 25. This would low the rate to 1k req/s.
Platonides wrote:
Tim Starling wrote:
But it's not going to happen unless someone gets around to writing a program which:
- Accepts URLs on stdin, separated by line breaks
Seems simple.
- Identifies plain page views
I assume you mean any /wiki/XXXX url, with no '?'. Quite easy, too.
- Breaks them down into per-page counts as described
And do it really fast... If wgArticleId was also sent, sorting and using the hashtable, would be easier.
- Provides a TCP query interface
I'd share the memory hashtable between process, and simply add a 'reader' one. We can live with race conditions, too.
There's only one log stream so I would think there would be only one process. The task is log analysis, not log collection.
- Does all that for 30k req/s using less than 10% CPU and 2GB memory
You mean 10% of the *cluster CPU*, isn't? ;)
10% of one processor. Maybe we could relax it if that proves to be impossible, but there will only be one log host for now, and there might be lots of log analysis tools, so we don't want any CPU hogs. Think C++, not perl.
Impossible?
We could start by profiling. Read data for 5 minutes, compute it for 25. This would low the rate to 1k req/s.
I could make a log snippet available for optimisation purposes, but ultimately, it will have to work on the full stream. Sampling would give you an unacceptable noise floor for the majority of those 25 million articles.
(to nobody in particular) Thinking about the pipe buffer issues we had the other day, it might make sense to recompile the kernel on henbane to have a larger pipe buffer, to cut down on context switches. At 30k req/s, it would fill every 1.2ms.
-- Tim Starling
Tim Starling wrote:
- Provides a TCP query interface
I'd share the memory hashtable between process, and simply add a 'reader' one. We can live with race conditions, too.
There's only one log stream so I would think there would be only one process. The task is log analysis, not log collection.
Nope, i was thinking on one writer process, and several 'reader' processes attaching (read-only) to the mapped data.
- Does all that for 30k req/s using less than 10% CPU and 2GB memory
You mean 10% of the *cluster CPU*, isn't? ;)
10% of one processor. Maybe we could relax it if that proves to be impossible, but there will only be one log host for now, and there might be lots of log analysis tools, so we don't want any CPU hogs. Think C++, not perl.
I was thinking in pure C...
Impossible?
We could start by profiling. Read data for 5 minutes, compute it for 25. This would low the rate to 1k req/s.
I could make a log snippet available for optimisation purposes, but ultimately, it will have to work on the full stream. Sampling would give you an unacceptable noise floor for the majority of those 25 million articles.
Making the program seems fair easy. But ensuring it will cope with all that data is a bit scaring.
(to nobody in particular) Thinking about the pipe buffer issues we had the other day, it might make sense to recompile the kernel on henbane to have a larger pipe buffer, to cut down on context switches. At 30k req/s, it would fill every 1.2ms.
-- Tim Starling
The reading process should probably have higher priority than the colector. Losing UDP packets is better than fulling the pipe (/me wonders what happens when it gets full. Write failed? SIGPIPE? Nuclear meltdown?).
Platonides wrote:
The reading process should probably have higher priority than the colector. Losing UDP packets is better than fulling the pipe (/me wonders what happens when it gets full. Write failed? SIGPIPE? Nuclear meltdown?).
The pipe write will block, the collection process will yield, and the analysis process will gain control and empty the buffer. Not nuclear meltdown, just some microseconds lost. Losing UDP packets is almost certainly worse than that, but luckily the UDP receive buffer is quite a bit larger, and it's configurable at runtime.
-- Tim Starling
wikitech-l@lists.wikimedia.org