Hi.
I've been asked a few times recently about doing reports of the most-viewed pages per month/per day/per year/etc. A few years after Domas first started publishing this information in raw form, the current situation seems rather bleak. Henrik has a visualization tool with a very simple JSON API behind it (http://stats.grok.se), but other than that, I don't know of any efforts to put this data into a database.
Currently, if you want data on, for example, every article on the English Wikipedia, you'd have to make 3.7 million individual HTTP requests to Henrik's tool. At one per second, you're looking at over a month's worth of continuous fetching. This is obviously not practical.
A lot of people were waiting on Wikimedia's Open Web Analytics work to come to fruition, but it seems that has been indefinitely put on hold. (Is that right?)
Is it worth a Toolserver user's time to try to create a database of per-project, per-page page view statistics? Is it worth a grant from the Wikimedia Foundation to have someone work on this? Is it worth trying to convince Wikimedia Deutschland to assign resources? And, of course, it wouldn't be a bad idea if Domas' first-pass implementation was improved on Wikimedia's side, regardless.
Thoughts and comments welcome on this. There's a lot of desire to have a usable system.
MZMcBride
Thanks for bringing this up! I don't have any answers, but there's a feature I'd like to build on this dataset. I wonder if bringing this stuff into a more readily available database could be part of that project in some way.
Basically, I'd like to publish per-editor pageview stats. That is, Mediawiki would keep track of the number of times an article had been viewed since the first day you edited it, and let you know how many times your edits had been seen (approximately, depending on the resolution of the data). I think such personalized stats could really help to drive editor retention. The information is available now through Henrik's tool, but even if you know about stats.grok.se, it's hard to keep track and make the connection between the graphs there and one's own contributions.
Clearly, pageview data of at least daily resolution would be required to make such a thing work.
Are there other specific projects that require this data? It will be much easier to make a case for accelerating development of the dataset if there are some clear examples of where it's needed, and especially if it can help to meet the current editor retention goals.
-Ian
On Thu, Aug 11, 2011 at 3:12 PM, MZMcBride z@mzmcbride.com wrote:
Hi.
I've been asked a few times recently about doing reports of the most-viewed pages per month/per day/per year/etc. A few years after Domas first started publishing this information in raw form, the current situation seems rather bleak. Henrik has a visualization tool with a very simple JSON API behind it (http://stats.grok.se), but other than that, I don't know of any efforts to put this data into a database.
Currently, if you want data on, for example, every article on the English Wikipedia, you'd have to make 3.7 million individual HTTP requests to Henrik's tool. At one per second, you're looking at over a month's worth of continuous fetching. This is obviously not practical.
A lot of people were waiting on Wikimedia's Open Web Analytics work to come to fruition, but it seems that has been indefinitely put on hold. (Is that right?)
Is it worth a Toolserver user's time to try to create a database of per-project, per-page page view statistics? Is it worth a grant from the Wikimedia Foundation to have someone work on this? Is it worth trying to convince Wikimedia Deutschland to assign resources? And, of course, it wouldn't be a bad idea if Domas' first-pass implementation was improved on Wikimedia's side, regardless.
Thoughts and comments welcome on this. There's a lot of desire to have a usable system.
MZMcBride
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
* Ian Baker wrote:
Basically, I'd like to publish per-editor pageview stats. That is, Mediawiki would keep track of the number of times an article had been viewed since the first day you edited it, and let you know how many times your edits had been seen (approximately, depending on the resolution of the data). I think such personalized stats could really help to drive editor retention. The information is available now through Henrik's tool, but even if you know about stats.grok.se, it's hard to keep track and make the connection between the graphs there and one's own contributions.
If the stats.grok.se data actually captures nearly all requests, then I am not sure you realize how low the figures are. On the german Wikipedia during 20-22 December 2009 the median number of requests for articles in the category "Mann" (men) was 7, meaning half of the articles have been requested at most 7 times during a 3 day period (2.33 times per day). In the same period, the "Hauptseite" (Main Page) registered 900000 requests (128 000 times as many requests compared to the "Mann" median figure).
Ian Baker wrote:
Basically, I'd like to publish per-editor pageview stats. That is, Mediawiki would keep track of the number of times an article had been viewed since the first day you edited it, and let you know how many times your edits had been seen (approximately, depending on the resolution of the data). I think such personalized stats could really help to drive editor retention. The information is available now through Henrik's tool, but even if you know about stats.grok.se, it's hard to keep track and make the connection between the graphs there and one's own contributions.
This is a neat idea. MediaWiki has some page view count support built in, but it's been disabled on Wikimedia wikis for pretty much forever. The reality is that MediaWiki isn't launched for the vast majority of requests. A user making an edit is obviously different, though. I think a database with per-day view support would make this feature somewhat feasible, in a JavaScript gadget or in a MediaWiki extension.
Are there other specific projects that require this data? It will be much easier to make a case for accelerating development of the dataset if there are some clear examples of where it's needed, and especially if it can help to meet the current editor retention goals.
Heh. It's refreshing to hear this said aloud. Yes, if there were some way to tie page view stats to fundraising/editor retention/usability/the gender gap/the Global South, it'd be much simpler to get resources devoted to it. Without a doubt.
There are countless applications for this data, particularly as a means of measuring Wikipedia's impact. This data also provides a scale against which other articles and projects can be measured. In a vacuum, knowing that the English Wikipedia's article "John Doe" received 400 views per day on average in June means very little. When you can compare that figure to the average views per day of every other article on the English Wikipedia (or every other article on the German Wikipedia), you can begin doing real analysis work. Currently, this really isn't possible, and that's a Bad Thing.
MZMcBride
On Thu, Aug 11, 2011 at 8:23 PM, MZMcBride z@mzmcbride.com wrote:
This is a neat idea. MediaWiki has some page view count support built in, but it's been disabled on Wikimedia wikis for pretty much forever. The reality is that MediaWiki isn't launched for the vast majority of requests. A user making an edit is obviously different, though. I think a database with per-day view support would make this feature somewhat feasible, in a JavaScript gadget or in a MediaWiki extension.
Oh, totally. The only place we can get meaningful data is from the squids, which is where Dario's data comes from, yes?
Sadly, anything we build that works with that architecture won't be so useful to other mediawiki installations, at least on the backend. I can imagine an extension that displays the info to editors that can fetch its stats from a slightly abstracted datasource, with some architecture whereby the stats could come from a variety of log-processing applications via a script or plugin. Then, we could write a connector for our cache clusters, and someone else could write one for theirs, and we'd still get to share one codebase for everything else.
Are there other specific projects that require this data? It will be
much
easier to make a case for accelerating development of the dataset if
there
are some clear examples of where it's needed, and especially if it can
help
to meet the current editor retention goals.
Heh. It's refreshing to hear this said aloud. Yes, if there were some way to tie page view stats to fundraising/editor retention/usability/the gender gap/the Global South, it'd be much simpler to get resources devoted to it. Without a doubt.
There are countless applications for this data, particularly as a means of measuring Wikipedia's impact. This data also provides a scale against which other articles and projects can be measured. In a vacuum, knowing that the English Wikipedia's article "John Doe" received 400 views per day on average in June means very little. When you can compare that figure to the average views per day of every other article on the English Wikipedia (or every other article on the German Wikipedia), you can begin doing real analysis work. Currently, this really isn't possible, and that's a Bad Thing.
Oh, totally. I can see a lot of really effective potential applications for this data. However, the link between view statistics and editor retention isn't necessarily immediately clear. At WMF at least, the prevailing point-of-view is that readership is doing okay, and at the moment we're focused on other areas. Personally, I think readership numbers are an important piece of the puzzle and could be a direct motivator for editing. Furthermore, increasingly nuanced readership stats might be usable for that other perennial goal, fundraising (thought I don't have specific ideas for this at the moment).
I wonder if maybe we could consolidate a couple of concrete proposals for features that are dependent on this data. That would help to highlight this as a bottleneck and clearly explain how solving this problem now will help contribute to meeting current goals.
My thinking is, if it's possible to make a good case for it, it should happen now. Even if WMF has a req out for a developer to build this, there's no reason to avoid consolidating the research and ideas in one place so that person can work more effectively. Were someone in the community to start building it, even better! If we bring on a dev to collaborate and maintain it long-term, they'd just end up working together closely for a while, which would accelerate the learning process for the new developer. As someone who's still on the steep part of that learning curve, I can attest that any and all information we can provide will get this feature out the door faster. :)
-Ian
* MZMcBride wrote:
I've been asked a few times recently about doing reports of the most-viewed pages per month/per day/per year/etc. A few years after Domas first started publishing this information in raw form, the current situation seems rather bleak. Henrik has a visualization tool with a very simple JSON API behind it (http://stats.grok.se), but other than that, I don't know of any efforts to put this data into a database.
When making http://katograph.appspot.com/ which renders the german Wiki- pedia category system as an interactive "treemap" based on information like number of articles in them and requests during a 3 day period, I found that the proxy logs used for stats.grok.se are rather unreliable, with many of the "top" pages being inplausible (articles on not very notable subjects that have existed only for a very short time show up in the top ten, for instance). On http://stats.grok.se/en/top you can see this aswell, 40 million views for `Special:Export/Robert L. Bradley, Jr` is rather implausible, as far as human users are concerned.
Is it worth a Toolserver user's time to try to create a database of per-project, per-page page view statistics? Is it worth a grant from the Wikimedia Foundation to have someone work on this? Is it worth trying to convince Wikimedia Deutschland to assign resources? And, of course, it wouldn't be a bad idea if Domas' first-pass implementation was improved on Wikimedia's side, regardless.
The data that powers stats.grok.se is available for download, it should be rather trivial to feed it into toolserver databases and query it as desired, ignoring performance problems. But short of believing that in December 2010 "User Datagram Protocol" was more interesting to people than Julian Assange you would need some other data source to make good statistics. http://stats.grok.se/de/201009/Ngai.cc would be another ex- ample.
Bjoern Hoehrmann wrote:
When making http://katograph.appspot.com/ which renders the german Wiki- pedia category system as an interactive "treemap" based on information like number of articles in them and requests during a 3 day period, I found that the proxy logs used for stats.grok.se are rather unreliable, with many of the "top" pages being inplausible (articles on not very notable subjects that have existed only for a very short time show up in the top ten, for instance). On http://stats.grok.se/en/top you can see this aswell, 40 million views for `Special:Export/Robert L. Bradley, Jr` is rather implausible, as far as human users are concerned.
Yes, the data is susceptible to manipulation, both intentional and unintentional. As I said, this was a first-pass implementation on Domas' part. As far as I know, this hasn't been touched by anyone in years. You're absolutely correct that, at the end of the day, until the data itself is better (more reliable), the resulting tools/graphs/scripts/everything that rely on it will be bound by its limitations.
MZMcBride wrote:
Is it worth a Toolserver user's time to try to create a database of per-project, per-page page view statistics? Is it worth a grant from the Wikimedia Foundation to have someone work on this? Is it worth trying to convince Wikimedia Deutschland to assign resources? And, of course, it wouldn't be a bad idea if Domas' first-pass implementation was improved on Wikimedia's side, regardless.
The data that powers stats.grok.se is available for download, it should be rather trivial to feed it into toolserver databases and query it as desired, ignoring performance problems.
Not simply performance. It's a lot of data and it needs to be indexed. That has a real cost. There are also edge cases and corner cases (different encodings of requests, etc.) that need to be accounted for. It's not a particularly small undertaking, if it's to be done properly.
MZMcBride
On Thu, Aug 11, 2011 at 3:12 PM, MZMcBride z@mzmcbride.com wrote:
I've been asked a few times recently about doing reports of the most-viewed pages per month/per day/per year/etc. A few years after Domas first started publishing this information in raw form, the current situation seems rather bleak. Henrik has a visualization tool with a very simple JSON API behind it (http://stats.grok.se), but other than that, I don't know of any efforts to put this data into a database. [...] A lot of people were waiting on Wikimedia's Open Web Analytics work to come to fruition, but it seems that has been indefinitely put on hold. (Is that right?)
That's correct. I owe everyone a longer writeup of our change in direction on that project which I have the raw notes for.
The short answer is that we've been having a tough time hiring the people we'd have do this work. Here are the two job descriptions: http://wikimediafoundation.org/wiki/Job_openings/Software_Developer_Backend http://wikimediafoundation.org/wiki/Job_openings/Systems_Engineer_-_Data_Ana...
Please help us recruit for these roles (and apply if you believe you are a fit)!
Thanks! Rob
Rob Lanphier wrote:
On Thu, Aug 11, 2011 at 3:12 PM, MZMcBride z@mzmcbride.com wrote:
I've been asked a few times recently about doing reports of the most-viewed pages per month/per day/per year/etc. A few years after Domas first started publishing this information in raw form, the current situation seems rather bleak. Henrik has a visualization tool with a very simple JSON API behind it (http://stats.grok.se), but other than that, I don't know of any efforts to put this data into a database. [...] A lot of people were waiting on Wikimedia's Open Web Analytics work to come to fruition, but it seems that has been indefinitely put on hold. (Is that right?)
That's correct. I owe everyone a longer writeup of our change in direction on that project which I have the raw notes for.
Okay. Please be sure to copy this list on that write-up. :-)
The short answer is that we've been having a tough time hiring the people we'd have do this work. Here are the two job descriptions: http://wikimedia.org/wiki/Job_openings/Software_Developer_Backend http://wikimedia.org/wiki/Job_openings/Systems_Engineer_-_Data_Analytics
As someone with most of the skills and resources (with the exception of time, possibly) to create a page view stats database, reading something like this makes me think it's not the worth the effort on my part, iff Wikimedia is planning on devoting actual resources to the endeavor. Is that a reasonable conclusion to draw? Is it unreasonable?
MZMcBride
I'd be willing to work on this on a volunteer basis.
I developed http://toolserver.org/~emw/wikistats/, a page view analysis tool that incorporates lots of features that have been requested of Henrik's tool. The main bottleneck has been that, like MZMcBride mentions, an underlying database of page view data is unavailable. Henrik's JSON API has limitations probably tied to the underlying data model. The fact that there aren't any other such API's is arguably the bigger problem.
I wrote down some initial thoughts on how this data reliability, and WMF's page view data services generally, could be improved at http://en.wikipedia.org/w/index.php?title=User_talk:Emw&oldid=442596566#.... I've also drafted more specific implementation plans. These plans assume that I would be working with the basic data in Domas's archives. There is still a lot of untapped information in that data -- e.g. hourly views -- and potential for mashups with categories, automated inference of trend causes, etc. If more detailed (but still anonymized) OWA data were available, however, that would obviously open up the potential for much richer APIs and analysis.
Getting the archived page view data into a database seems very doable. This data seems like it would be useful even if there were OWA data available, since that OWA data wouldn't cover 12/2007 through 2009. As I see it, the main thing needed from WMF would be storage space on a publicly-available server. Then, optionally, maybe some funds for the cost of cloud services to process and compress the data, and put it into a database. Input and advice would be invaluable, too.
Eric
Hi!
Currently, if you want data on, for example, every article on the English Wikipedia, you'd have to make 3.7 million individual HTTP requests to Henrik's tool. At one per second, you're looking at over a month's worth of continuous fetching. This is obviously not practical.
Or you can download raw data.
A lot of people were waiting on Wikimedia's Open Web Analytics work to come to fruition, but it seems that has been indefinitely put on hold. (Is that right?)
That project was pulsing with naiveness, if it ever had to be applied to wide scope of all projects ;-)
Is it worth a Toolserver user's time to try to create a database of per-project, per-page page view statistics?
Creating such database is easy, making it efficient is a bit different :-)
And, of course, it wouldn't be a bad idea if Domas' first-pass implementation was improved on Wikimedia's side, regardless.
My implementation is for obtaining raw data from our squid tier, what is wrong with it? Generally I had ideas of making query-able data source - it isn't impossible given a decent mix of data structures ;-)
Thoughts and comments welcome on this. There's a lot of desire to have a usable system.
Sure, interesting what people think could be useful with the dataset - we may facilitate it.
But short of believing that in December 2010 "User Datagram Protocol" was more interesting to people than Julian Assange you would need some other data source to make good statistics.
Yeah, "lies, damn lies and statistics". We need better statistics (adjusted by wikipedian geekiness) than full page sample because you don't believe general purpose wiki articles that people can use in their work can be more popular than some random guy on the internet and trivia about him. Dracula is also more popular than Julian Assange, so is Jenna Jameson ;-)
http://stats.grok.se/de/201009/Ngai.cc would be another example.
Unfortunately every time you add ability to spam something, people will spam. There's also unintentional crap that ends up in HTTP requests because of broken clients. It is easy to filter that out in postprocessing, if you want, by applying article-exists bloom filter ;-)
If the stats.grok.se data actually captures nearly all requests, then I am not sure you realize how low the figures are.
Low they are, Wikipedia's content is all about very long tail of data, besides some heavily accessed head. Just graph top-100 or top-1000 and you will see the shape of the curve: https://docs.google.com/spreadsheet/pub?hl=en_US&key=0AtHDNfVx0WNhdGhWVl...
As someone with most of the skills and resources (with the exception of time, possibly) to create a page view stats database, reading something like this makes me think...
Wow.
Yes, the data is susceptible to manipulation, both intentional and unintentional.
I wonder how someone with most of skills and resources wants to solve this problem (besides the aforementioned article-exists filter, which could reduce dataset quite a lot ;)
... you can begin doing real analysis work. Currently, this really isn't possible, and that's a Bad Thing.
Raw data allows you to do whatever analysis you want. Shove it into SPSS/R/.. ;-) Statistics much?
The main bottleneck has been that, like MZMcBride mentions, an underlying database of page view data is unavailable.
Underlying database is available, just not in easily queryable format. There's a distinction there, unless you all imagine database as something you send SQL to and it gives you data. Sorted files are databases too ;-) Anyway, I don't say that the project is impossible or unnecessary, but there're lots of tradeoffs to be made - what kind of real time querying workloads are to be expected, what kind of pre-filtering do people expect, etc.
Of course, we could always use OWA.
Domas
Hello everyone,
I've actually been parsing the raw data from [http://dammit.lt/wikistats/] daily into a MySQL database for over a year now. I also store statistics at hour-granularity, whereas [stats.grok.se] stores them at day granularity, it seems.
I only do this for en.wiki, and its certainly not efficient enough to open up for public use. However, I'd be willing to chat and share code with any interested developer. The strategy and schema are a bit awkward, but it works, and requires on average ~2 hours processing to store 24 hours worth of statistics.
Thanks, -AW
On 08/12/2011 04:49 AM, Domas Mituzas wrote:
Hi!
Currently, if you want data on, for example, every article on the English Wikipedia, you'd have to make 3.7 million individual HTTP requests to Henrik's tool. At one per second, you're looking at over a month's worth of continuous fetching. This is obviously not practical.
Or you can download raw data.
A lot of people were waiting on Wikimedia's Open Web Analytics work to come to fruition, but it seems that has been indefinitely put on hold. (Is that right?)
That project was pulsing with naiveness, if it ever had to be applied to wide scope of all projects ;-)
Is it worth a Toolserver user's time to try to create a database of per-project, per-page page view statistics?
Creating such database is easy, making it efficient is a bit different :-)
And, of course, it wouldn't be a bad idea if Domas' first-pass implementation was improved on Wikimedia's side, regardless.
My implementation is for obtaining raw data from our squid tier, what is wrong with it? Generally I had ideas of making query-able data source - it isn't impossible given a decent mix of data structures ;-)
Thoughts and comments welcome on this. There's a lot of desire to have a usable system.
Sure, interesting what people think could be useful with the dataset - we may facilitate it.
But short of believing that in December 2010 "User Datagram Protocol" was more interesting to people than Julian Assange you would need some other data source to make good statistics.
Yeah, "lies, damn lies and statistics". We need better statistics (adjusted by wikipedian geekiness) than full page sample because you don't believe general purpose wiki articles that people can use in their work can be more popular than some random guy on the internet and trivia about him. Dracula is also more popular than Julian Assange, so is Jenna Jameson ;-)
http://stats.grok.se/de/201009/Ngai.cc would be another example.
Unfortunately every time you add ability to spam something, people will spam. There's also unintentional crap that ends up in HTTP requests because of broken clients. It is easy to filter that out in postprocessing, if you want, by applying article-exists bloom filter ;-)
If the stats.grok.se data actually captures nearly all requests, then I am not sure you realize how low the figures are.
Low they are, Wikipedia's content is all about very long tail of data, besides some heavily accessed head. Just graph top-100 or top-1000 and you will see the shape of the curve: https://docs.google.com/spreadsheet/pub?hl=en_US&key=0AtHDNfVx0WNhdGhWVl...
As someone with most of the skills and resources (with the exception of time, possibly) to create a page view stats database, reading something like this makes me think...
Wow.
Yes, the data is susceptible to manipulation, both intentional and unintentional.
I wonder how someone with most of skills and resources wants to solve this problem (besides the aforementioned article-exists filter, which could reduce dataset quite a lot ;)
... you can begin doing real analysis work. Currently, this really isn't possible, and that's a Bad Thing.
Raw data allows you to do whatever analysis you want. Shove it into SPSS/R/.. ;-) Statistics much?
The main bottleneck has been that, like MZMcBride mentions, an underlying database of page view data is unavailable.
Underlying database is available, just not in easily queryable format. There's a distinction there, unless you all imagine database as something you send SQL to and it gives you data. Sorted files are databases too ;-) Anyway, I don't say that the project is impossible or unnecessary, but there're lots of tradeoffs to be made - what kind of real time querying workloads are to be expected, what kind of pre-filtering do people expect, etc.
Of course, we could always use OWA.
Domas _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Andrew G. West wrote:
I've actually been parsing the raw data from [http://dammit.lt/wikistats/] daily into a MySQL database for over a year now. I also store statistics at hour-granularity, whereas [stats.grok.se] stores them at day granularity, it seems.
I only do this for en.wiki, and its certainly not efficient enough to open up for public use. However, I'd be willing to chat and share code with any interested developer. The strategy and schema are a bit awkward, but it works, and requires on average ~2 hours processing to store 24 hours worth of statistics.
I'd certainly be interested in seeing the code and database schema you've written, if only as a point of reference and to learn from any bugs/issues/etc. that you've encountered along the way. Is it possible for you to post the code you're using somewhere?
MZMcBride
Note that to avoid too much traffic here, I've responded to MZMcBride privately with my code. I'd be happy to share my code with others, and include others in its discussion -- just contact me/us privately.
Thanks, -AW
On 08/12/2011 10:30 AM, MZMcBride wrote:
Andrew G. West wrote:
I've actually been parsing the raw data from [http://dammit.lt/wikistats/] daily into a MySQL database for over a year now. I also store statistics at hour-granularity, whereas [stats.grok.se] stores them at day granularity, it seems.
I only do this for en.wiki, and its certainly not efficient enough to open up for public use. However, I'd be willing to chat and share code with any interested developer. The strategy and schema are a bit awkward, but it works, and requires on average ~2 hours processing to store 24 hours worth of statistics.
I'd certainly be interested in seeing the code and database schema you've written, if only as a point of reference and to learn from any bugs/issues/etc. that you've encountered along the way. Is it possible for you to post the code you're using somewhere?
MZMcBride
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Sat, Aug 13, 2011 at 1:19 AM, Andrew G. West westand@cis.upenn.edu wrote:
Note that to avoid too much traffic here, I've responded to MZMcBride privately with my code. I'd be happy to share my code with others, and include others in its discussion -- just contact me/us privately.
Thanks, -AW
Depending on what license and type of release you want on your code you should consider dumping it up on our svn, If you don't have commit access you can read this page http://www.mediawiki.org/wiki/Commit_access for more information if you would like to consider that route.
Anyway, I don't say that the project is impossible or unnecessary, but
there're lots of tradeoffs to be made
- what kind of real time querying workloads are to be expected, what kind of
pre-filtering do people expect, etc.
I could be biased here, but I think the canonical use case for someone seeking page view information would be viewing page view counts for a set of articles -- most times a single article, but also multiple articles -- over an arbitrary time range. Narrowing that down, I'm not sure whether the level of demand for real-time data (say, for the previous hour) would be higher than the demand for fast query results for more historical data. Would these two workloads imply the kind of trade-off you were referring to? If not, could you give some examples of what kind of expected workloads/use cases would entail such trade-offs?
If ordering pages by page view count for a given time period would imply such a tradeoff, then I think it'd make sense to deprioritize page ordering.
I'd be really interested to know your thoughts on an efficient schema for organizing the raw page view data in the archives at http://dammit.lt/wikistats/.
Thanks, Eric
Domas Mituzas wrote:
Hi!
Hi!
Currently, if you want data on, for example, every article on the English Wikipedia, you'd have to make 3.7 million individual HTTP requests to Henrik's tool. At one per second, you're looking at over a month's worth of continuous fetching. This is obviously not practical.
Or you can download raw data.
Downloading gigs and gigs of raw data and then processing it is generally more impractical for end-users.
Is it worth a Toolserver user's time to try to create a database of per-project, per-page page view statistics?
Creating such database is easy, making it efficient is a bit different :-)
Any tips? :-) My thoughts were that the schema used by the GlobalUsage extension might be reusable here (storing wiki, page namespace ID, page namespace name, and page title).
And, of course, it wouldn't be a bad idea if Domas' first-pass implementation was improved on Wikimedia's side, regardless.
My implementation is for obtaining raw data from our squid tier, what is wrong with it? Generally I had ideas of making query-able data source - it isn't impossible given a decent mix of data structures ;-)
Well, more documentation is always a good thing. I'd start there.
As I recall, the system of determining which domain a request went to is a bit esoteric and it might be the worth the cost to store the whole domain name in order to cover edge cases (labs wikis, wikimediafoundation.org, *.wikimedia.org, etc.).
There's some sort of distinction between projectcounts and pagecounts (again with documentation) that could probably stand to be eliminated or simplified.
But the biggest improvement would be post-processing (cleaning up) the source files. Right now if there are anomalies in the data, every re-user is expected to find and fix these on their own. It's _incredibly_ inefficient for everyone to adjust the data (for encoding strangeness, for bad clients, for data manipulation, for page existence possibly, etc.) rather than having the source files come out cleaner.
I think your first-pass was great. But I also think it could be improved. :-)
As someone with most of the skills and resources (with the exception of time, possibly) to create a page view stats database, reading something like this makes me think...
Wow.
I meant that it wouldn't be very difficult to write a script to take the raw data and put it into a public database on the Toolserver (which probably has enough hardware resources for this project currently). It's maintainability and sustainability that are the bigger concerns. Once you create a public database for something like this, people will want it to stick around indefinitely. That's quite a load to take on.
I'm also likely being incredibly naïve, though I did note somewhere that it wouldn't be a particularly small undertaking to do this project well.
Yes, the data is susceptible to manipulation, both intentional and unintentional.
I wonder how someone with most of skills and resources wants to solve this problem (besides the aforementioned article-exists filter, which could reduce dataset quite a lot ;)
I'd actually say that having data for non-existent pages is a feature, not a bug. There's potential there to catch future redirects and new pages, I imagine.
... you can begin doing real analysis work. Currently, this really isn't possible, and that's a Bad Thing.
Raw data allows you to do whatever analysis you want. Shove it into SPSS/R/.. ;-) Statistics much?
A user wants to analyze a category with 100 members for the page view data of each category member. You think it's a Good Thing that the user has to first spend countless hours processing gigabytes of raw data in order to do that analysis? It's a Very Bad Thing. And the people who are capable of doing analysis aren't always the ones capable of writing the scripts and the schemas necessary to get the data into a usable form.
The main bottleneck has been that, like MZMcBride mentions, an underlying database of page view data is unavailable.
Underlying database is available, just not in easily queryable format. There's a distinction there, unless you all imagine database as something you send SQL to and it gives you data. Sorted files are databases too ;-)
The reality is that a large pile of data that's not easily queryable is directly equivalent to no data at all, for most users. Echoing what I said earlier, it doesn't make much sense for people to be continually forced to reinvent the wheel (post-processing raw data and putting it into a queryable format).
MZMcBride
Downloading gigs and gigs of raw data and then processing it is generally more impractical for end-users.
You were talking about 3.7M articles. :) It is way more practical than working with pointwise APIs though :-)
Any tips? :-) My thoughts were that the schema used by the GlobalUsage extension might be reusable here (storing wiki, page namespace ID, page namespace name, and page title).
I don't know what GlobalUsage does, but probably it is all wrong ;-)
As I recall, the system of determining which domain a request went to is a bit esoteric and it might be the worth the cost to store the whole domain name in order to cover edge cases (labs wikis, wikimediafoundation.org, *.wikimedia.org, etc.).
*shrug*, maybe, if I'd run a second pass I'd aim for cache oblivious system with compressed data both on-disk and in-cache (currently it is b-tree with standard b-tree costs). Then we could actually store more data ;-) Do note, there're _lots_ of data items, and increasing per-item cost may quadruple resource usage ;-)
Otoh, expanding project names is straightforward, if you know how).
There's some sort of distinction between projectcounts and pagecounts (again with documentation) that could probably stand to be eliminated or simplified.
projectcounts are aggregated by project, pagecounts are aggregated by page. If you looked at data it should be obvious ;-) And yes, probably best documentation was in some email somewhere. I should've started a decent project with descriptions and support and whatever. Maybe once we move data distribution back into WMF proper, there's no need for it to live nowadays somewhere in Germany.
But the biggest improvement would be post-processing (cleaning up) the source files. Right now if there are anomalies in the data, every re-user is expected to find and fix these on their own. It's _incredibly_ inefficient for everyone to adjust the data (for encoding strangeness, for bad clients, for data manipulation, for page existence possibly, etc.) rather than having the source files come out cleaner.
Raw data is fascinating in that regard though - one can see what are bad clients, what are anomalies, how they encode titles, what are erroneus titles, etc. There're zillions of ways to do post-processing, and none of these will match all needs of every user.
I think your first-pass was great. But I also think it could be improved. :-)
Sure, it can be improved in many ways, including more data (some people ask (page,geography) aggregations, though with our long tail that is huuuuuge dataset growth ;-)
I meant that it wouldn't be very difficult to write a script to take the raw data and put it into a public database on the Toolserver (which probably has enough hardware resources for this project currently).
I doubt Toolserver has enough resources to have this data thrown at it and queried more, unless you simplify needs a lot. There's 5G raw uncompressed data per day in text form, and long tail makes caching quite painful, unless you go for cache oblivious methods.
It's maintainability and sustainability that are the bigger concerns. Once you create a public database for something like this, people will want it to stick around indefinitely. That's quite a load to take on.
I'd love to see that all the data is preserved infinitely. It is one of most interesting datasets around, and its value for the future is quite incredible.
I'm also likely being incredibly naïve, though I did note somewhere that it wouldn't be a particularly small undertaking to do this project well.
Well, initial work took few hours ;-) I guess by spending few more hours we could improve that, if we really knew what we want.
I'd actually say that having data for non-existent pages is a feature, not a bug. There's potential there to catch future redirects and new pages, I imagine.
That is one of reasons we don't eliminate that data now from raw dataset. I don't see it as a bug, I just see that for long-term aggregations that data could be omitted.
A user wants to analyze a category with 100 members for the page view data of each category member. You think it's a Good Thing that the user has to first spend countless hours processing gigabytes of raw data in order to do that analysis? It's a Very Bad Thing. And the people who are capable of doing analysis aren't always the ones capable of writing the scripts and the schemas necessary to get the data into a usable form.
No, I think we should have API to that data to fetch small sets of data without much pain.
The reality is that a large pile of data that's not easily queryable is directly equivalent to no data at all, for most users. Echoing what I said earlier, it doesn't make much sense for people to be continually forced to reinvent the wheel (post-processing raw data and putting it into a queryable format).
I agree. By opening up the dataset I expected others to build upon that and create services. Apparently that doesn't happen. As lots of people use the data, I guess there is need for it, but not enough will to build anything for others to use, so it will end up being created in WMF proper.
Building a service where data would be shown on every article is relatively different task from just analytical workload support. For now, building query-able service has been on my todo list, but there were too many initiatives around that suggested that someone else will do that ;-)
Domas
Hey, Domas! Firstly, sorry to confuse you with Dario earlier. I am so very bad with names. :)
Secondly, thank you for putting together the data we have today. I'm not sure if anyone's mentioned it lately, but it's clearly a really useful thing. I think that's why we're having this conversation now: what's been learned about potential use cases, and how can we make this excellent resource even more valuable?
Any tips? :-) My thoughts were that the schema used by the GlobalUsage
extension might be reusable here (storing wiki, page namespace ID, page namespace name, and page title).
I don't know what GlobalUsage does, but probably it is all wrong ;-)
Here's an excerpt form the readme:
"When using a shared image repository, it is impossible to see within MediaWiki whether a file is used on one of the slave wikis. On Wikimedia this is handled by the CheckUsage tool on the toolserver, but it is merely a hack of function that should be built in.
"GlobalUsage creates a new table globalimagelinks, which is basically the same as imagelinks, but includes the usage of all images on all associated wikis."
The database table itself is about what you'd imagine. It's approximately the metadata we'd need to uniquely identify an article, but it seems to be solving a rather different problem. Uniquely identifying an article is certainly necessary, but I don't think it's the hard part.
I'm not sure that Mysql is the place to store this data--it's big and has few dimensions. Since we'd have to make external queries available through an API anyway, why not back it with the right storage engine?
[...]
projectcounts are aggregated by project, pagecounts are aggregated by page. If you looked at data it should be obvious ;-) And yes, probably best documentation was in some email somewhere. I should've started a decent project with descriptions and support and whatever. Maybe once we move data distribution back into WMF proper, there's no need for it to live nowadays somewhere in Germany.
The documentation needed here seems pretty straightforward. Like, a file at http://dammit.lt/wikistats/README that just explains the format of the data, what's included, and what's not. We've covered most of it in this thread already. All that's left is a basic explanation of what each field means in pagecounts/projectcounts. If you tell me these things, I'll even write it. :)
But the biggest improvement would be post-processing (cleaning up) the source files. Right now if there are anomalies in the data, every re-user
is
expected to find and fix these on their own. It's _incredibly_
inefficient
for everyone to adjust the data (for encoding strangeness, for bad
clients,
for data manipulation, for page existence possibly, etc.) rather than
having
the source files come out cleaner.
Raw data is fascinating in that regard though - one can see what are bad clients, what are anomalies, how they encode titles, what are erroneus titles, etc. There're zillions of ways to do post-processing, and none of these will match all needs of every user.
Oh, totally! However, I think some uses are more common than others. I bet this covers them:
1. View counts for a subset of existing articles over a range of dates. 2. Sorted/limited aggregate stats (top 100, bottom 50, etc) for a subset of articles and date range. 3. Most popular non-existing (missing) articles for a project.
I feel like making those things easier would be awesome, and raw data would still be available for anyone who wants to build something else. I think Domas's dataset is great, and the above should be based on it.
Sure, it can be improved in many ways, including more data (some people ask
(page,geography) aggregations, though with our long tail that is huuuuuge dataset growth ;-)
Absolutely. I think it makes sense to start by making the existing data more usable, and then potentially add more to it in the future.
I meant that it wouldn't be very difficult to write a script to take the
raw
data and put it into a public database on the Toolserver (which probably
has
enough hardware resources for this project currently).
I doubt Toolserver has enough resources to have this data thrown at it and queried more, unless you simplify needs a lot. There's 5G raw uncompressed data per day in text form, and long tail makes caching quite painful, unless you go for cache oblivious methods.
Yeah. The folks at trendingtopics.org are processing it all on an EC2 Hadoop cluster, and throwing the results in a SQL database. They have a very specific focus, though, so their methods might not be appropriate here. They're an excellent example of someone using the existing dataset in an interesting way, but the fact that they're using EC2 is telling: many people do not have the expertise to handle that sort of thing.
I think building an efficiently queryable set of all historic data is unrealistic without a separate cluster. We're talking 100GB/year, before indexing, which is about 400GB if we go back to 2008. I can imagine a workable solution that discards resolution as time passes, which is what most web stats generation packages do anyway. Here's an example:
Daily counts (and maybe hour of day averages) going back one month (~10GB) Weekly counts, day of week and hour of day averages going back six months (~10GB) Monthly stats (including averages) forever (~4GB/year)
That data could be kept in RAM, hashed across two machines, if we really wanted it to be fast. That's probably not necessary, but you get my point.
It's maintainability and sustainability that are the bigger concerns. Once you create a public database for something like this, people will want it to stick around indefinitely. That's quite a load to take on.
I'd love to see that all the data is preserved infinitely. It is one of most interesting datasets around, and its value for the future is quite incredible.
Agreed. 100GB a year is not a lot of data to *store* (especially if it's compressed). It's just a lot to interactively query.
I'm also likely being incredibly naïve, though I did note somewhere that
it
wouldn't be a particularly small undertaking to do this project well.
Well, initial work took few hours ;-) I guess by spending few more hours we could improve that, if we really knew what we want.
I think we're in a position to decide what we want.
Honestly, the investigation I've done while participating in this thread suggests that I can probably get what I want from the raw data. I'll just pull each day into an in-memory hash, update a database table, and move to the next day. It'll be slower than if the data was already hanging out in some hashed format (like Berkeley DB), but whatever. However, I need data for all articles, which is different from most use cases I think.
I'd like to assemble some examples of projects that need better data, so we know what it makes sense to build--what seems nice to have and what's actually useful is so often different.
I agree. By opening up the dataset I expected others to build upon that and
create services. Apparently that doesn't happen. As lots of people use the data, I guess there is need for it, but not enough will to build anything for others to use, so it will end up being created in WMF proper.
Yeah. I think it's just a tough problem to solve for an outside contributor. It's hard to get around the need for hardware (which in turn must be managed and maintained).
Building a service where data would be shown on every article is relatively different task from just analytical workload support.
Yep, however it depends entirely on the same data. It's really just another post-processing step.
-Ian
I think building an efficiently queryable set of all historic data is unrealistic without a separate cluster. We're talking 100GB/year, before indexing, which is about 400GB if we go back to 2008.
[etc]
So, these numbers were based on my incorrect assumption that the data I was looking at was daily, but it's actually hourly. So, I guess, multiply everything by 24, and then disregard some of what I said there?
-Ian
Domas Mituzas wrote:
Any tips? :-) My thoughts were that the schema used by the GlobalUsage extension might be reusable here (storing wiki, page namespace ID, page namespace name, and page title).
I don't know what GlobalUsage does, but probably it is all wrong ;-)
GlobalUsage tracks file uses across a wiki family. Its schema is available here: http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/GlobalUsage/Glob alUsage.sql?view=log.
But the biggest improvement would be post-processing (cleaning up) the source files. Right now if there are anomalies in the data, every re-user is expected to find and fix these on their own. It's _incredibly_ inefficient for everyone to adjust the data (for encoding strangeness, for bad clients, for data manipulation, for page existence possibly, etc.) rather than having the source files come out cleaner.
Raw data is fascinating in that regard though - one can see what are bad clients, what are anomalies, how they encode titles, what are erroneus titles, etc. There're zillions of ways to do post-processing, and none of these will match all needs of every user.
Yes, so providing raw data alongside cleaner data or alongside SQL table dumps (similar to the current dumps for MediaWiki tables) might make more sense here.
I'd love to see that all the data is preserved infinitely. It is one of most interesting datasets around, and its value for the future is quite incredible.
Nemo has done some work to put the files on Internet Archive, I think.
The reality is that a large pile of data that's not easily queryable is directly equivalent to no data at all, for most users. Echoing what I said earlier, it doesn't make much sense for people to be continually forced to reinvent the wheel (post-processing raw data and putting it into a queryable format).
I agree. By opening up the dataset I expected others to build upon that and create services. Apparently that doesn't happen. As lots of people use the data, I guess there is need for it, but not enough will to build anything for others to use, so it will end up being created in WMF proper.
Building a service where data would be shown on every article is relatively different task from just analytical workload support. For now, building query-able service has been on my todo list, but there were too many initiatives around that suggested that someone else will do that ;-)
Yes, beyond Henrik's site, there really isn't much. It would probably help if Wikimedia stopped engaging in so much cookie-licking. That was part of the purpose of this thread: to clarify what Wikimedia is actually planning to invest in this endeavor.
Thank you for the detailed replies, Domas. :-)
MZMcBride
On Fri, Aug 12, 2011 at 12:21 PM, MZMcBride z@mzmcbride.com wrote:
Domas Mituzas wrote:
Building a service where data would be shown on every article is relatively different task from just analytical workload support. For now, building query-able service has been on my todo list, but there were too many initiatives around that suggested that someone else will do that ;-)
Yes, beyond Henrik's site, there really isn't much. It would probably help if Wikimedia stopped engaging in so much cookie-licking. That was part of the purpose of this thread: to clarify what Wikimedia is actually planning to invest in this endeavor.
If the question is "am I wasting my time if I work on this?", the answer is "almost certainly not", so please embark. It will almost certainly be valuable no matter what you do.
Now, the caveat on that is this: if you ask "will I feel like I've wasted my time?", the answer is more ambiguous, because I don't know what you expect. Even a proof of concept is valuable, but you probably don't want to write a mere proof of concept. So, if you want to increase the odds that your work will be more than a proof of concept, then there's more overhead.
Here's an extremely likely version of the future should you decide to do something here and you're successful in building something: you'll do something that gets a following. WMF hires a couple of engineers, and starts on working on the system. The two systems are complementary, and both end up having their own followings for different reasons. While it's likely that some future WMF system will eventually be capable of this, getting granular per-page statistics is something that hasn't been at the top of the priority list. In one "wasted time" scenario, we figure out that it wouldn't be *that* hard to do the same thing with the data we have, and we figure out how to provide an alternative. However, I suspect that day probably gets postponed because there would be some other system providing that function.
With any luck, if you build something, it will be in a state that we can actually work together on it at some point after we get the people we plan to hire hired. The more review you get from other people who understand the Wikimedia cluster, the more likely that case is.
Here's an extremely likely version of the future should you decide not to do something here: we won't build something like what you have in mind. So, the best way to guarantee what you want will exist is to build it.
Re: cookie licking. That's a side-effect of planning in the open. If we wait until we're sure a project is going to be successfully completed before we talk about it, we either won't be as open as we should be, or not taking the risks we should be, or both.
Rob
wikitech-l@lists.wikimedia.org