Hey, Domas! Firstly, sorry to confuse you with Dario earlier. I am so very bad with names. :)
Secondly, thank you for putting together the data we have today. I'm not sure if anyone's mentioned it lately, but it's clearly a really useful thing. I think that's why we're having this conversation now: what's been learned about potential use cases, and how can we make this excellent resource even more valuable?
Any tips? :-) My thoughts were that the schema used by the GlobalUsage
extension might be reusable here (storing wiki, page namespace ID, page namespace name, and page title).
I don't know what GlobalUsage does, but probably it is all wrong ;-)
Here's an excerpt form the readme:
"When using a shared image repository, it is impossible to see within MediaWiki whether a file is used on one of the slave wikis. On Wikimedia this is handled by the CheckUsage tool on the toolserver, but it is merely a hack of function that should be built in.
"GlobalUsage creates a new table globalimagelinks, which is basically the same as imagelinks, but includes the usage of all images on all associated wikis."
The database table itself is about what you'd imagine. It's approximately the metadata we'd need to uniquely identify an article, but it seems to be solving a rather different problem. Uniquely identifying an article is certainly necessary, but I don't think it's the hard part.
I'm not sure that Mysql is the place to store this data--it's big and has few dimensions. Since we'd have to make external queries available through an API anyway, why not back it with the right storage engine?
[...]
projectcounts are aggregated by project, pagecounts are aggregated by page. If you looked at data it should be obvious ;-) And yes, probably best documentation was in some email somewhere. I should've started a decent project with descriptions and support and whatever. Maybe once we move data distribution back into WMF proper, there's no need for it to live nowadays somewhere in Germany.
The documentation needed here seems pretty straightforward. Like, a file at http://dammit.lt/wikistats/README that just explains the format of the data, what's included, and what's not. We've covered most of it in this thread already. All that's left is a basic explanation of what each field means in pagecounts/projectcounts. If you tell me these things, I'll even write it. :)
But the biggest improvement would be post-processing (cleaning up) the source files. Right now if there are anomalies in the data, every re-user
is
expected to find and fix these on their own. It's _incredibly_
inefficient
for everyone to adjust the data (for encoding strangeness, for bad
clients,
for data manipulation, for page existence possibly, etc.) rather than
having
the source files come out cleaner.
Raw data is fascinating in that regard though - one can see what are bad clients, what are anomalies, how they encode titles, what are erroneus titles, etc. There're zillions of ways to do post-processing, and none of these will match all needs of every user.
Oh, totally! However, I think some uses are more common than others. I bet this covers them:
1. View counts for a subset of existing articles over a range of dates. 2. Sorted/limited aggregate stats (top 100, bottom 50, etc) for a subset of articles and date range. 3. Most popular non-existing (missing) articles for a project.
I feel like making those things easier would be awesome, and raw data would still be available for anyone who wants to build something else. I think Domas's dataset is great, and the above should be based on it.
Sure, it can be improved in many ways, including more data (some people ask
(page,geography) aggregations, though with our long tail that is huuuuuge dataset growth ;-)
Absolutely. I think it makes sense to start by making the existing data more usable, and then potentially add more to it in the future.
I meant that it wouldn't be very difficult to write a script to take the
raw
data and put it into a public database on the Toolserver (which probably
has
enough hardware resources for this project currently).
I doubt Toolserver has enough resources to have this data thrown at it and queried more, unless you simplify needs a lot. There's 5G raw uncompressed data per day in text form, and long tail makes caching quite painful, unless you go for cache oblivious methods.
Yeah. The folks at trendingtopics.org are processing it all on an EC2 Hadoop cluster, and throwing the results in a SQL database. They have a very specific focus, though, so their methods might not be appropriate here. They're an excellent example of someone using the existing dataset in an interesting way, but the fact that they're using EC2 is telling: many people do not have the expertise to handle that sort of thing.
I think building an efficiently queryable set of all historic data is unrealistic without a separate cluster. We're talking 100GB/year, before indexing, which is about 400GB if we go back to 2008. I can imagine a workable solution that discards resolution as time passes, which is what most web stats generation packages do anyway. Here's an example:
Daily counts (and maybe hour of day averages) going back one month (~10GB) Weekly counts, day of week and hour of day averages going back six months (~10GB) Monthly stats (including averages) forever (~4GB/year)
That data could be kept in RAM, hashed across two machines, if we really wanted it to be fast. That's probably not necessary, but you get my point.
It's maintainability and sustainability that are the bigger concerns. Once you create a public database for something like this, people will want it to stick around indefinitely. That's quite a load to take on.
I'd love to see that all the data is preserved infinitely. It is one of most interesting datasets around, and its value for the future is quite incredible.
Agreed. 100GB a year is not a lot of data to *store* (especially if it's compressed). It's just a lot to interactively query.
I'm also likely being incredibly naïve, though I did note somewhere that
it
wouldn't be a particularly small undertaking to do this project well.
Well, initial work took few hours ;-) I guess by spending few more hours we could improve that, if we really knew what we want.
I think we're in a position to decide what we want.
Honestly, the investigation I've done while participating in this thread suggests that I can probably get what I want from the raw data. I'll just pull each day into an in-memory hash, update a database table, and move to the next day. It'll be slower than if the data was already hanging out in some hashed format (like Berkeley DB), but whatever. However, I need data for all articles, which is different from most use cases I think.
I'd like to assemble some examples of projects that need better data, so we know what it makes sense to build--what seems nice to have and what's actually useful is so often different.
I agree. By opening up the dataset I expected others to build upon that and
create services. Apparently that doesn't happen. As lots of people use the data, I guess there is need for it, but not enough will to build anything for others to use, so it will end up being created in WMF proper.
Yeah. I think it's just a tough problem to solve for an outside contributor. It's hard to get around the need for hardware (which in turn must be managed and maintained).
Building a service where data would be shown on every article is relatively different task from just analytical workload support.
Yep, however it depends entirely on the same data. It's really just another post-processing step.
-Ian