Downloading gigs and gigs of raw data and then processing it is generally more impractical for end-users.
You were talking about 3.7M articles. :) It is way more practical than working with pointwise APIs though :-)
Any tips? :-) My thoughts were that the schema used by the GlobalUsage extension might be reusable here (storing wiki, page namespace ID, page namespace name, and page title).
I don't know what GlobalUsage does, but probably it is all wrong ;-)
As I recall, the system of determining which domain a request went to is a bit esoteric and it might be the worth the cost to store the whole domain name in order to cover edge cases (labs wikis, wikimediafoundation.org, *.wikimedia.org, etc.).
*shrug*, maybe, if I'd run a second pass I'd aim for cache oblivious system with compressed data both on-disk and in-cache (currently it is b-tree with standard b-tree costs). Then we could actually store more data ;-) Do note, there're _lots_ of data items, and increasing per-item cost may quadruple resource usage ;-)
Otoh, expanding project names is straightforward, if you know how).
There's some sort of distinction between projectcounts and pagecounts (again with documentation) that could probably stand to be eliminated or simplified.
projectcounts are aggregated by project, pagecounts are aggregated by page. If you looked at data it should be obvious ;-) And yes, probably best documentation was in some email somewhere. I should've started a decent project with descriptions and support and whatever. Maybe once we move data distribution back into WMF proper, there's no need for it to live nowadays somewhere in Germany.
But the biggest improvement would be post-processing (cleaning up) the source files. Right now if there are anomalies in the data, every re-user is expected to find and fix these on their own. It's _incredibly_ inefficient for everyone to adjust the data (for encoding strangeness, for bad clients, for data manipulation, for page existence possibly, etc.) rather than having the source files come out cleaner.
Raw data is fascinating in that regard though - one can see what are bad clients, what are anomalies, how they encode titles, what are erroneus titles, etc. There're zillions of ways to do post-processing, and none of these will match all needs of every user.
I think your first-pass was great. But I also think it could be improved. :-)
Sure, it can be improved in many ways, including more data (some people ask (page,geography) aggregations, though with our long tail that is huuuuuge dataset growth ;-)
I meant that it wouldn't be very difficult to write a script to take the raw data and put it into a public database on the Toolserver (which probably has enough hardware resources for this project currently).
I doubt Toolserver has enough resources to have this data thrown at it and queried more, unless you simplify needs a lot. There's 5G raw uncompressed data per day in text form, and long tail makes caching quite painful, unless you go for cache oblivious methods.
It's maintainability and sustainability that are the bigger concerns. Once you create a public database for something like this, people will want it to stick around indefinitely. That's quite a load to take on.
I'd love to see that all the data is preserved infinitely. It is one of most interesting datasets around, and its value for the future is quite incredible.
I'm also likely being incredibly naïve, though I did note somewhere that it wouldn't be a particularly small undertaking to do this project well.
Well, initial work took few hours ;-) I guess by spending few more hours we could improve that, if we really knew what we want.
I'd actually say that having data for non-existent pages is a feature, not a bug. There's potential there to catch future redirects and new pages, I imagine.
That is one of reasons we don't eliminate that data now from raw dataset. I don't see it as a bug, I just see that for long-term aggregations that data could be omitted.
A user wants to analyze a category with 100 members for the page view data of each category member. You think it's a Good Thing that the user has to first spend countless hours processing gigabytes of raw data in order to do that analysis? It's a Very Bad Thing. And the people who are capable of doing analysis aren't always the ones capable of writing the scripts and the schemas necessary to get the data into a usable form.
No, I think we should have API to that data to fetch small sets of data without much pain.
The reality is that a large pile of data that's not easily queryable is directly equivalent to no data at all, for most users. Echoing what I said earlier, it doesn't make much sense for people to be continually forced to reinvent the wheel (post-processing raw data and putting it into a queryable format).
I agree. By opening up the dataset I expected others to build upon that and create services. Apparently that doesn't happen. As lots of people use the data, I guess there is need for it, but not enough will to build anything for others to use, so it will end up being created in WMF proper.
Building a service where data would be shown on every article is relatively different task from just analytical workload support. For now, building query-able service has been on my todo list, but there were too many initiatives around that suggested that someone else will do that ;-)
Domas