Re: [Wikitech-l] State of page view stats

12 Aug 2011


      ...
Downloading gigs and gigs of raw data and then processing it is generally
more impractical for end-users.
You were talking about 3.7M articles. :) It is way more practical than working with pointwise APIs though :-)
...
Any tips? :-)  My thoughts were that the schema used by the GlobalUsage
extension might be reusable here (storing wiki, page namespace ID, page
namespace name, and page title).
I don't know what GlobalUsage does, but probably it is all wrong ;-)
...
As I recall, the system of determining which domain a request went to is a
bit esoteric and it might be the worth the cost to store the whole domain
name in order to cover edge cases (labs wikis, wikimediafoundation.org,
*.wikimedia.org, etc.).
*shrug*, maybe, if I'd run a second pass I'd aim for cache oblivious system with compressed data both on-disk and in-cache (currently it is b-tree with standard b-tree costs). 
Then we could actually store more data ;-) Do note, there're _lots_ of data items, and increasing per-item cost may quadruple resource usage ;-)
Otoh, expanding project names is straightforward, if you know how).
...
There's some sort of distinction between projectcounts and pagecounts (again
with documentation) that could probably stand to be eliminated or
simplified.
projectcounts are aggregated by project, pagecounts are aggregated by page. If you looked at data it should be obvious ;-) 
And yes, probably best documentation was in some email somewhere. I should've started a decent project with descriptions and support and whatever. 
Maybe once we move data distribution back into WMF proper, there's no need for it to live nowadays somewhere in Germany.
...
But the biggest improvement would be post-processing (cleaning up) the
source files. Right now if there are anomalies in the data, every re-user is
expected to find and fix these on their own. It's _incredibly_ inefficient
for everyone to adjust the data (for encoding strangeness, for bad clients,
for data manipulation, for page existence possibly, etc.) rather than having
the source files come out cleaner.
Raw data is fascinating in that regard though - one can see what are bad clients, what are anomalies, how they encode titles, what are erroneus titles, etc. 
There're zillions of ways to do post-processing, and none of these will match all needs of every user.
...
I think your first-pass was great. But I also think it could be improved.
:-)
Sure, it can be improved in many ways, including more data (some people ask (page,geography) aggregations, though with our long tail that is huuuuuge dataset growth ;-)
...
I meant that it wouldn't be very difficult to write a script to take the raw
data and put it into a public database on the Toolserver (which probably has
enough hardware resources for this project currently).
I doubt Toolserver has enough resources to have this data thrown at it and queried more, unless you simplify needs a lot. 
There's 5G raw uncompressed data per day in text form, and long tail makes caching quite painful, unless you go for cache oblivious methods.
...
It's maintainability
and sustainability that are the bigger concerns. Once you create a public
database for something like this, people will want it to stick around
indefinitely. That's quite a load to take on.
I'd love to see that all the data is preserved infinitely. It is one of most interesting datasets around, and its value for the future is quite incredible.
...
I'm also likely being incredibly naïve, though I did note somewhere that it
wouldn't be a particularly small undertaking to do this project well.
Well, initial work took few hours ;-) I guess by spending few more hours we could improve that, if we really knew what we want.
...
I'd actually say that having data for non-existent pages is a feature, not a
bug. There's potential there to catch future redirects and new pages, I
imagine.
That is one of reasons we don't eliminate that data now from raw dataset. I don't see it as a bug, I just see that for long-term aggregations that data could be omitted.
...
A user wants to analyze a category with 100 members for the page view data
of each category member. You think it's a Good Thing that the user has to
first spend countless hours processing gigabytes of raw data in order to do
that analysis? It's a Very Bad Thing. And the people who are capable of
doing analysis aren't always the ones capable of writing the scripts and the
schemas necessary to get the data into a usable form.
No, I think we should have API to that data to fetch small sets of data without much pain.
...
The reality is that a large pile of data that's not easily queryable is
directly equivalent to no data at all, for most users. Echoing what I said
earlier, it doesn't make much sense for people to be continually forced to
reinvent the wheel (post-processing raw data and putting it into a queryable
format).
I agree. By opening up the dataset I expected others to build upon that and create services. 
Apparently that doesn't happen. As lots of people use the data, I guess there is need for it, but not enough will to build anything for others to use, so it will end up being created in WMF proper.
Building a service where data would be shown on every article is relatively different task from just analytical workload support.
For now, building query-able service has been on my todo list, but there were too many initiatives around that suggested that someone else will do that ;-)
Domas

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] State of page view stats