Hi!
Currently, if you want data on, for example, every
article on the English
Wikipedia, you'd have to make 3.7 million individual HTTP requests to
Henrik's tool. At one per second, you're looking at over a month's worth of
continuous fetching. This is obviously not practical.
Or you can download raw data.
A lot of people were waiting on Wikimedia's Open
Web Analytics work to come
to fruition, but it seems that has been indefinitely put on hold. (Is that
right?)
That project was pulsing with naiveness, if it ever had to be applied to wide scope of all
projects ;-)
Is it worth a Toolserver user's time to try to
create a database of
per-project, per-page page view statistics?
Creating such database is easy, making it efficient is a bit different :-)
And, of course, it wouldn't be a bad idea if
Domas' first-pass implementation was improved on Wikimedia's side, regardless.
My implementation is for obtaining raw data from our squid tier, what is wrong with it?
Generally I had ideas of making query-able data source - it isn't impossible given a
decent mix of data structures ;-)
Thoughts and comments welcome on this. There's a
lot of desire to have a
usable system.
Sure, interesting what people think could be useful with the dataset - we may facilitate
it.
But short of believing that in
December 2010 "User Datagram Protocol" was more interesting to people
than Julian Assange you would need some other data source to make good
statistics.
Yeah, "lies, damn lies and statistics". We need better statistics (adjusted by
wikipedian geekiness) than full page sample because you don't believe general purpose
wiki articles that people can use in their work can be more popular than some random guy
on the internet and trivia about him.
Dracula is also more popular than Julian Assange, so is Jenna Jameson ;-)
Unfortunately every time you add ability to spam something, people will spam. There's
also unintentional crap that ends up in HTTP requests because of broken clients. It is
easy to filter that out in postprocessing, if you want, by applying article-exists bloom
filter ;-)
If the stats.grok.se data actually captures nearly all
requests, then I am not sure you realize how low the figures are.
Low they are, Wikipedia's content is all about very long tail of data, besides some
heavily accessed head. Just graph top-100 or top-1000 and you will see the shape of the
curve:
https://docs.google.com/spreadsheet/pub?hl=en_US&key=0AtHDNfVx0WNhdGhWV…
As someone with most of the skills and resources (with
the exception of time, possibly) to create a page view stats database, reading something
like this makes me think...
Wow.
Yes, the data is susceptible to manipulation, both
intentional and unintentional.
I wonder how someone with most of skills and resources wants to solve this problem
(besides the aforementioned article-exists filter, which could reduce dataset quite a lot
;)
... you can begin doing real analysis work. Currently,
this really isn't possible, and that's a Bad Thing.
Raw data allows you to do whatever analysis you want. Shove it into SPSS/R/.. ;-)
Statistics much?
The main bottleneck has been that, like MZMcBride
mentions, an underlying
database of page view data is unavailable.
Underlying database is available, just not in easily queryable format. There's a
distinction there, unless you all imagine database as something you send SQL to and it
gives you data. Sorted files are databases too ;-)
Anyway, I don't say that the project is impossible or unnecessary, but there're
lots of tradeoffs to be made - what kind of real time querying workloads are to be
expected, what kind of pre-filtering do people expect, etc.
Of course, we could always use OWA.
Domas