Hey, Domas! Firstly, sorry to confuse you with Dario earlier. I am so very
bad with names. :)
Secondly, thank you for putting together the data we have today. I'm not
sure if anyone's mentioned it lately, but it's clearly a really useful
thing. I think that's why we're having this conversation now: what's been
learned about potential use cases, and how can we make this excellent
resource even more valuable?
Any tips? :-) My thoughts were that the schema used
by the GlobalUsage
extension might be reusable here (storing wiki,
page namespace ID, page
namespace name, and page title).
I don't know what GlobalUsage does, but probably it is all wrong ;-)
Here's an excerpt form the readme:
"When using a shared image repository, it is impossible to see within
MediaWiki
whether a file is used on one of the slave wikis. On Wikimedia this is
handled
by the CheckUsage tool on the toolserver, but it is merely a hack of
function
that should be built in.
"GlobalUsage creates a new table globalimagelinks, which is basically the
same
as imagelinks, but includes the usage of all images on all associated
wikis."
The database table itself is about what you'd imagine. It's approximately
the metadata we'd need to uniquely identify an article, but it seems to be
solving a rather different problem. Uniquely identifying an article is
certainly necessary, but I don't think it's the hard part.
I'm not sure that Mysql is the place to store this data--it's big and has
few dimensions. Since we'd have to make external queries available through
an API anyway, why not back it with the right storage engine?
[...]
projectcounts are aggregated by project, pagecounts
are aggregated by page.
If you looked at data it should be obvious ;-)
And yes, probably best documentation was in some email somewhere. I
should've started a decent project with descriptions and support and
whatever.
Maybe once we move data distribution back into WMF proper, there's no need
for it to live nowadays somewhere in Germany.
The documentation needed here seems pretty straightforward. Like, a file
at
http://dammit.lt/wikistats/README that just explains the format of the
data, what's included, and what's not. We've covered most of it in this
thread already. All that's left is a basic explanation of what each field
means in pagecounts/projectcounts. If you tell me these things, I'll even
write it. :)
But the
biggest improvement would be post-processing (cleaning up) the
source files. Right now if there are anomalies in the data, every re-user
is
expected to find and fix these on their own.
It's _incredibly_
inefficient
for everyone to adjust the data (for encoding
strangeness, for bad
clients,
for data manipulation, for page existence
possibly, etc.) rather than
having
the source files come out cleaner.
Raw data is fascinating in that regard though - one can see what are bad
clients, what are anomalies, how they encode titles, what are erroneus
titles, etc.
There're zillions of ways to do post-processing, and none of these will
match all needs of every user.
Oh, totally! However, I think some uses are more common than others. I bet
this covers them:
1. View counts for a subset of existing articles over a range of dates.
2. Sorted/limited aggregate stats (top 100, bottom 50, etc) for a subset of
articles and date range.
3. Most popular non-existing (missing) articles for a project.
I feel like making those things easier would be awesome, and raw data would
still be available for anyone who wants to build something else. I think
Domas's dataset is great, and the above should be based on it.
Sure, it can be improved in many ways, including more data (some people ask
(page,geography) aggregations, though with our long
tail that is huuuuuge
dataset growth ;-)
Absolutely. I think it makes sense to start by making the existing data
more usable, and then potentially add more to it in the future.
I meant that
it wouldn't be very difficult to write a script to take the
raw
data and put it into a public database on the
Toolserver (which probably
has
enough hardware resources for this project
currently).
I doubt Toolserver has enough resources to have this data thrown at it and
queried more, unless you simplify needs a lot.
There's 5G raw uncompressed data per day in text form, and long tail makes
caching quite painful, unless you go for cache oblivious methods.
Yeah. The folks at
trendingtopics.org are processing it all on an EC2
Hadoop cluster, and throwing the results in a SQL database. They have a
very specific focus, though, so their methods might not be appropriate here.
They're an excellent example of someone using the existing dataset in an
interesting way, but the fact that they're using EC2 is telling: many people
do not have the expertise to handle that sort of thing.
I think building an efficiently queryable set of all historic data is
unrealistic without a separate cluster. We're talking 100GB/year, before
indexing, which is about 400GB if we go back to 2008. I can imagine a
workable solution that discards resolution as time passes, which is what
most web stats generation packages do anyway. Here's an example:
Daily counts (and maybe hour of day averages) going back one month (~10GB)
Weekly counts, day of week and hour of day averages going back six months
(~10GB)
Monthly stats (including averages) forever (~4GB/year)
That data could be kept in RAM, hashed across two machines, if we really
wanted it to be fast. That's probably not necessary, but you get my point.
It's
maintainability
and sustainability that are the bigger concerns. Once you create a public
database for something like this, people will want it to stick around
indefinitely. That's quite a load to take on.
I'd love to see that all the data is preserved infinitely. It is one of
most interesting datasets around, and its value for the future is quite
incredible.
Agreed. 100GB a year is not a lot of data to *store* (especially if it's
compressed). It's just a lot to interactively query.
I'm also
likely being incredibly naïve, though I did note somewhere that
it
wouldn't be a particularly small undertaking
to do this project well.
Well, initial work took few hours ;-) I guess by spending few more hours we
could improve that, if we really knew what we want.
I think we're in a position to decide what we want.
Honestly, the investigation I've done while participating in this thread
suggests that I can probably get what I want from the raw data. I'll just
pull each day into an in-memory hash, update a database table, and move to
the next day. It'll be slower than if the data was already hanging out in
some hashed format (like Berkeley DB), but whatever. However, I need data
for all articles, which is different from most use cases I think.
I'd like to assemble some examples of projects that need better data, so we
know what it makes sense to build--what seems nice to have and what's
actually useful is so often different.
I agree. By opening up the dataset I expected others to build upon that and
create services.
Apparently that doesn't happen. As lots of people use the data, I guess
there is need for it, but not enough will to build anything for others to
use, so it will end up being created in WMF proper.
Yeah. I think it's just a tough problem to solve for an outside
contributor. It's hard to get around the need for hardware (which in turn
must be managed and maintained).
Building a service where data would be shown on every
article is relatively
different task from just analytical workload support.
Yep, however it depends entirely on the same data. It's really just another
post-processing step.
-Ian