Re: [Wikitech-l] State of page view stats

12 Aug 2011


      Domas Mituzas wrote:
...
Hi!
Hi!
...
...
Currently, if you want data on, for example, every article on the English
Wikipedia, you'd have to make 3.7 million individual HTTP requests to
Henrik's tool. At one per second, you're looking at over a month's worth of
continuous fetching. This is obviously not practical.
Or you can download raw data.
Downloading gigs and gigs of raw data and then processing it is generally
more impractical for end-users.
...
...
Is it worth a Toolserver user's time to try to create a database of
per-project, per-page page view statistics?
Creating such database is easy, making it efficient is a bit different :-)
Any tips? :-)  My thoughts were that the schema used by the GlobalUsage
extension might be reusable here (storing wiki, page namespace ID, page
namespace name, and page title).
...
...
And, of course, it wouldn't be a bad idea if Domas' first-pass implementation
was improved on Wikimedia's side, regardless.
My implementation is for obtaining raw data from our squid tier, what is wrong
with it? Generally I had ideas of making query-able data source - it isn't
impossible given a decent mix of data structures ;-)
Well, more documentation is always a good thing. I'd start there.
As I recall, the system of determining which domain a request went to is a
bit esoteric and it might be the worth the cost to store the whole domain
name in order to cover edge cases (labs wikis, wikimediafoundation.org,
*.wikimedia.org, etc.).
There's some sort of distinction between projectcounts and pagecounts (again
with documentation) that could probably stand to be eliminated or
simplified.
But the biggest improvement would be post-processing (cleaning up) the
source files. Right now if there are anomalies in the data, every re-user is
expected to find and fix these on their own. It's _incredibly_ inefficient
for everyone to adjust the data (for encoding strangeness, for bad clients,
for data manipulation, for page existence possibly, etc.) rather than having
the source files come out cleaner.
I think your first-pass was great. But I also think it could be improved.
:-)
...
...
As someone with most of the skills and resources (with the exception of time,
possibly) to create a page view stats database, reading something like this
makes me think...
Wow.
I meant that it wouldn't be very difficult to write a script to take the raw
data and put it into a public database on the Toolserver (which probably has
enough hardware resources for this project currently). It's maintainability
and sustainability that are the bigger concerns. Once you create a public
database for something like this, people will want it to stick around
indefinitely. That's quite a load to take on.
I'm also likely being incredibly naïve, though I did note somewhere that it
wouldn't be a particularly small undertaking to do this project well.
...
...
Yes, the data is susceptible to manipulation, both intentional and
unintentional.
I wonder how someone with most of skills and resources wants to solve this
problem (besides the aforementioned article-exists filter, which could reduce
dataset quite a lot ;)
I'd actually say that having data for non-existent pages is a feature, not a
bug. There's potential there to catch future redirects and new pages, I
imagine.
...
...
... you can begin doing real analysis work. Currently, this really isn't
possible, and that's a Bad Thing.
Raw data allows you to do whatever analysis you want. Shove it into SPSS/R/..
;-) Statistics much?
A user wants to analyze a category with 100 members for the page view data
of each category member. You think it's a Good Thing that the user has to
first spend countless hours processing gigabytes of raw data in order to do
that analysis? It's a Very Bad Thing. And the people who are capable of
doing analysis aren't always the ones capable of writing the scripts and the
schemas necessary to get the data into a usable form.
...
...
The main bottleneck has been that, like MZMcBride mentions, an underlying
database of page view data is unavailable.
Underlying database is available, just not in easily queryable format. There's
a distinction there, unless you all imagine database as something you send SQL
to and it gives you data. Sorted files are databases too ;-)
The reality is that a large pile of data that's not easily queryable is
directly equivalent to no data at all, for most users. Echoing what I said
earlier, it doesn't make much sense for people to be continually forced to
reinvent the wheel (post-processing raw data and putting it into a queryable
format).
MZMcBride

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] State of page view stats