] daily into a MySQL database for over a
year now. I also store statistics at hour-granularity, whereas
[stats.grok.se] stores them at day granularity, it seems.
I only do this for en.wiki, and its certainly not efficient enough to
open up for public use. However, I'd be willing to chat and share code
with any interested developer. The strategy and schema are a bit
awkward, but it works, and requires on average ~2 hours processing to
store 24 hours worth of statistics.
Thanks, -AW
On 08/12/2011 04:49 AM, Domas Mituzas wrote:
Hi!
Currently, if you want data on, for example,
every article on the English
Wikipedia, you'd have to make 3.7 million individual HTTP requests to
Henrik's tool. At one per second, you're looking at over a month's worth of
continuous fetching. This is obviously not practical.
Or you can download raw data.
A lot of people were waiting on Wikimedia's
Open Web Analytics work to come
to fruition, but it seems that has been indefinitely put on hold. (Is that
right?)
That project was pulsing with naiveness, if it ever had to be applied to wide scope of
all projects ;-)
Is it worth a Toolserver user's time to try
to create a database of
per-project, per-page page view statistics?
Creating such database is easy, making it efficient is a bit different :-)
And, of course, it wouldn't be a bad idea if
Domas' first-pass implementation was improved on Wikimedia's side, regardless.
My implementation is for obtaining raw data from our squid tier, what is wrong with it?
Generally I had ideas of making query-able data source - it isn't impossible given a
decent mix of data structures ;-)
Thoughts and comments welcome on this.
There's a lot of desire to have a
usable system.
Sure, interesting what people think could be useful with the dataset - we may facilitate
it.
But short of believing that in
December 2010 "User Datagram Protocol" was more interesting to people
than Julian Assange you would need some other data source to make good
statistics.
Yeah, "lies, damn lies and statistics". We need better statistics (adjusted by
wikipedian geekiness) than full page sample because you don't believe general purpose
wiki articles that people can use in their work can be more popular than some random guy
on the internet and trivia about him.
Dracula is also more popular than Julian Assange, so is Jenna Jameson ;-)
Unfortunately every time you add ability to spam something, people will spam. There's
also unintentional crap that ends up in HTTP requests because of broken clients. It is
easy to filter that out in postprocessing, if you want, by applying article-exists bloom
filter ;-)
If the stats.grok.se data actually captures
nearly all requests, then I am not sure you realize how low the figures are.
Low they are, Wikipedia's content is all about very long tail of data, besides some
heavily accessed head. Just graph top-100 or top-1000 and you will see the shape of the
curve:
https://docs.google.com/spreadsheet/pub?hl=en_US&key=0AtHDNfVx0WNhdGhWV…
As someone with most of the skills and resources
(with the exception of time, possibly) to create a page view stats database, reading
something like this makes me think...
Wow.
Yes, the data is susceptible to manipulation,
both intentional and unintentional.
I wonder how someone with most of skills and resources wants to solve this problem
(besides the aforementioned article-exists filter, which could reduce dataset quite a lot
;)
... you can begin doing real analysis work.
Currently, this really isn't possible, and that's a Bad Thing.
Raw data allows you to do whatever analysis you want. Shove it into SPSS/R/.. ;-)
Statistics much?
The main bottleneck has been that, like MZMcBride
mentions, an underlying
database of page view data is unavailable.
Underlying database is available, just not in easily queryable format. There's a
distinction there, unless you all imagine database as something you send SQL to and it
gives you data. Sorted files are databases too ;-)
Anyway, I don't say that the project is impossible or unnecessary, but there're
lots of tradeoffs to be made - what kind of real time querying workloads are to be
expected, what kind of pre-filtering do people expect, etc.
Of course, we could always use OWA.
Domas
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
--
Andrew G. West, Doctoral Student
Dept. of Computer and Information Science
University of Pennsylvania, Philadelphia PA
Email: westand(a)cis.upenn.edu
Website: