Domas Mituzas wrote:
Hi!
Hi!
Currently, if you want data on, for example, every article on the English Wikipedia, you'd have to make 3.7 million individual HTTP requests to Henrik's tool. At one per second, you're looking at over a month's worth of continuous fetching. This is obviously not practical.
Or you can download raw data.
Downloading gigs and gigs of raw data and then processing it is generally more impractical for end-users.
Is it worth a Toolserver user's time to try to create a database of per-project, per-page page view statistics?
Creating such database is easy, making it efficient is a bit different :-)
Any tips? :-) My thoughts were that the schema used by the GlobalUsage extension might be reusable here (storing wiki, page namespace ID, page namespace name, and page title).
And, of course, it wouldn't be a bad idea if Domas' first-pass implementation was improved on Wikimedia's side, regardless.
My implementation is for obtaining raw data from our squid tier, what is wrong with it? Generally I had ideas of making query-able data source - it isn't impossible given a decent mix of data structures ;-)
Well, more documentation is always a good thing. I'd start there.
As I recall, the system of determining which domain a request went to is a bit esoteric and it might be the worth the cost to store the whole domain name in order to cover edge cases (labs wikis, wikimediafoundation.org, *.wikimedia.org, etc.).
There's some sort of distinction between projectcounts and pagecounts (again with documentation) that could probably stand to be eliminated or simplified.
But the biggest improvement would be post-processing (cleaning up) the source files. Right now if there are anomalies in the data, every re-user is expected to find and fix these on their own. It's _incredibly_ inefficient for everyone to adjust the data (for encoding strangeness, for bad clients, for data manipulation, for page existence possibly, etc.) rather than having the source files come out cleaner.
I think your first-pass was great. But I also think it could be improved. :-)
As someone with most of the skills and resources (with the exception of time, possibly) to create a page view stats database, reading something like this makes me think...
Wow.
I meant that it wouldn't be very difficult to write a script to take the raw data and put it into a public database on the Toolserver (which probably has enough hardware resources for this project currently). It's maintainability and sustainability that are the bigger concerns. Once you create a public database for something like this, people will want it to stick around indefinitely. That's quite a load to take on.
I'm also likely being incredibly naïve, though I did note somewhere that it wouldn't be a particularly small undertaking to do this project well.
Yes, the data is susceptible to manipulation, both intentional and unintentional.
I wonder how someone with most of skills and resources wants to solve this problem (besides the aforementioned article-exists filter, which could reduce dataset quite a lot ;)
I'd actually say that having data for non-existent pages is a feature, not a bug. There's potential there to catch future redirects and new pages, I imagine.
... you can begin doing real analysis work. Currently, this really isn't possible, and that's a Bad Thing.
Raw data allows you to do whatever analysis you want. Shove it into SPSS/R/.. ;-) Statistics much?
A user wants to analyze a category with 100 members for the page view data of each category member. You think it's a Good Thing that the user has to first spend countless hours processing gigabytes of raw data in order to do that analysis? It's a Very Bad Thing. And the people who are capable of doing analysis aren't always the ones capable of writing the scripts and the schemas necessary to get the data into a usable form.
The main bottleneck has been that, like MZMcBride mentions, an underlying database of page view data is unavailable.
Underlying database is available, just not in easily queryable format. There's a distinction there, unless you all imagine database as something you send SQL to and it gives you data. Sorted files are databases too ;-)
The reality is that a large pile of data that's not easily queryable is directly equivalent to no data at all, for most users. Echoing what I said earlier, it doesn't make much sense for people to be continually forced to reinvent the wheel (post-processing raw data and putting it into a queryable format).
MZMcBride