Re: [Wikitech-l] State of page view stats

12 Aug 2011

Hello everyone,

I've actually been parsing the raw data from 
[http://dammit.lt/wikistats/] daily into a MySQL database for over a 
year now. I also store statistics at hour-granularity, whereas 
[stats.grok.se] stores them at day granularity, it seems.

I only do this for en.wiki, and its certainly not efficient enough to 
open up for public use. However, I'd be willing to chat and share code 
with any interested developer. The strategy and schema are a bit 
awkward, but it works, and requires on average ~2 hours processing to 
store 24 hours worth of statistics.

Thanks, -AW

On 08/12/2011 04:49 AM, Domas Mituzas wrote:
...
  Hi!

  Currently, if you want data on, for example,
every article on the English
 Wikipedia, you'd have to make 3.7 million individual HTTP requests to
 Henrik's tool. At one per second, you're looking at over a month's worth of
 continuous fetching. This is obviously not practical. 
 Or you can download raw data.

  A lot of people were waiting on Wikimedia's
Open Web Analytics work to come
 to fruition, but it seems that has been indefinitely put on hold. (Is that
 right?) 
 That project was pulsing with naiveness, if it ever had to be applied to wide scope of
all projects ;-)

  Is it worth a Toolserver user's time to try
to create a database of
 per-project, per-page page view statistics? 
 Creating such database is easy, making it efficient is a bit different :-)

  And, of course, it wouldn't be a bad idea if
Domas' first-pass implementation was improved on Wikimedia's side, regardless.

 My implementation is for obtaining raw data from our squid tier, what is wrong with it?
 Generally I had ideas of making query-able data source - it isn't impossible given a
decent mix of data structures ;-)

  Thoughts and comments welcome on this.
There's a lot of desire to have a
 usable system. 
 Sure, interesting what people think could be useful with the dataset - we may facilitate
it.

    But short of believing that in
 December 2010 "User Datagram Protocol" was more interesting to people
 than Julian Assange you would need some other data source to make good
 statistics. 
 Yeah, "lies, damn lies and statistics". We need better statistics (adjusted by
wikipedian geekiness) than full page sample because you don't believe general purpose
wiki articles that people can use in their work can be more popular than some random guy
on the internet and trivia about him.
 Dracula is also more popular than Julian Assange, so is Jenna Jameson ;-)

  http://stats.grok.se/de/201009/Ngai.cc would be
another example. 

 Unfortunately every time you add ability to spam something, people will spam. There's
also unintentional crap that ends up in HTTP requests because of broken clients. It is
easy to filter that out in postprocessing, if you want, by applying article-exists bloom
filter ;-)

  If the stats.grok.se data actually captures
nearly all requests, then I am not sure you realize how low the figures are. 
 Low they are, Wikipedia's content is all about very long tail of data, besides some
heavily accessed head. Just graph top-100 or top-1000 and you will see the shape of the
curve:

https://docs.google.com/spreadsheet/pub?hl=en_US&key=0AtHDNfVx0WNhdGhWV…

  As someone with most of the skills and resources
(with the exception of time, possibly) to create a page view stats database, reading
something like this makes me think... 
 Wow.

  Yes, the data is susceptible to manipulation,
both intentional and unintentional. 
 I wonder how someone with most of skills and resources wants to solve this problem
(besides the aforementioned article-exists filter, which could reduce dataset quite a lot
;)

  ... you can begin doing real analysis work.
Currently, this really isn't possible, and that's a Bad Thing. 
 Raw data allows you to do whatever analysis you want. Shove it into SPSS/R/.. ;-)
Statistics much?

  The main bottleneck has been that, like MZMcBride
mentions, an underlying
 database of page view data is unavailable. 
 Underlying database is available, just not in easily queryable format. There's a
distinction there, unless you all imagine database as something you send SQL to and it
gives you data. Sorted files are databases too ;-)
 Anyway, I don't say that the project is impossible or unnecessary, but there're
lots of tradeoffs to be made - what kind of real time querying workloads are to be
expected, what kind of pre-filtering do people expect, etc.

 Of course, we could always use OWA.

 Domas
 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l 
-- 
Andrew G. West, Doctoral Student
Dept. of Computer and Information Science
University of Pennsylvania, Philadelphia PA
Email:   westand(a)cis.upenn.edu
Website: http://www.cis.upenn.edu/~westand

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] State of page view stats