Hello everyone,
I've actually been parsing the raw data from [http://dammit.lt/wikistats/] daily into a MySQL database for over a year now. I also store statistics at hour-granularity, whereas [stats.grok.se] stores them at day granularity, it seems.
I only do this for en.wiki, and its certainly not efficient enough to open up for public use. However, I'd be willing to chat and share code with any interested developer. The strategy and schema are a bit awkward, but it works, and requires on average ~2 hours processing to store 24 hours worth of statistics.
Thanks, -AW
On 08/12/2011 04:49 AM, Domas Mituzas wrote:
Hi!
Currently, if you want data on, for example, every article on the English Wikipedia, you'd have to make 3.7 million individual HTTP requests to Henrik's tool. At one per second, you're looking at over a month's worth of continuous fetching. This is obviously not practical.
Or you can download raw data.
A lot of people were waiting on Wikimedia's Open Web Analytics work to come to fruition, but it seems that has been indefinitely put on hold. (Is that right?)
That project was pulsing with naiveness, if it ever had to be applied to wide scope of all projects ;-)
Is it worth a Toolserver user's time to try to create a database of per-project, per-page page view statistics?
Creating such database is easy, making it efficient is a bit different :-)
And, of course, it wouldn't be a bad idea if Domas' first-pass implementation was improved on Wikimedia's side, regardless.
My implementation is for obtaining raw data from our squid tier, what is wrong with it? Generally I had ideas of making query-able data source - it isn't impossible given a decent mix of data structures ;-)
Thoughts and comments welcome on this. There's a lot of desire to have a usable system.
Sure, interesting what people think could be useful with the dataset - we may facilitate it.
But short of believing that in December 2010 "User Datagram Protocol" was more interesting to people than Julian Assange you would need some other data source to make good statistics.
Yeah, "lies, damn lies and statistics". We need better statistics (adjusted by wikipedian geekiness) than full page sample because you don't believe general purpose wiki articles that people can use in their work can be more popular than some random guy on the internet and trivia about him. Dracula is also more popular than Julian Assange, so is Jenna Jameson ;-)
http://stats.grok.se/de/201009/Ngai.cc would be another example.
Unfortunately every time you add ability to spam something, people will spam. There's also unintentional crap that ends up in HTTP requests because of broken clients. It is easy to filter that out in postprocessing, if you want, by applying article-exists bloom filter ;-)
If the stats.grok.se data actually captures nearly all requests, then I am not sure you realize how low the figures are.
Low they are, Wikipedia's content is all about very long tail of data, besides some heavily accessed head. Just graph top-100 or top-1000 and you will see the shape of the curve: https://docs.google.com/spreadsheet/pub?hl=en_US&key=0AtHDNfVx0WNhdGhWVl...
As someone with most of the skills and resources (with the exception of time, possibly) to create a page view stats database, reading something like this makes me think...
Wow.
Yes, the data is susceptible to manipulation, both intentional and unintentional.
I wonder how someone with most of skills and resources wants to solve this problem (besides the aforementioned article-exists filter, which could reduce dataset quite a lot ;)
... you can begin doing real analysis work. Currently, this really isn't possible, and that's a Bad Thing.
Raw data allows you to do whatever analysis you want. Shove it into SPSS/R/.. ;-) Statistics much?
The main bottleneck has been that, like MZMcBride mentions, an underlying database of page view data is unavailable.
Underlying database is available, just not in easily queryable format. There's a distinction there, unless you all imagine database as something you send SQL to and it gives you data. Sorted files are databases too ;-) Anyway, I don't say that the project is impossible or unnecessary, but there're lots of tradeoffs to be made - what kind of real time querying workloads are to be expected, what kind of pre-filtering do people expect, etc.
Of course, we could always use OWA.
Domas _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l