Is page view statistics (as in stats.grok.se) being imported to the toolserver and available in a database for quick reference?
I'm experimenting with using this as a tool for finding which articles need improvement: Among short stubs or articles in a watch category, I'm addressing the most popular articles first. The only thing I need is the aggregate number of page views in the previous month.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Lars Aronsson:
Is page view statistics (as in stats.grok.se) being imported to the toolserver and available in a database for quick reference?
A user has made this available in raw form at /mnt/user-store/stats. We are currently working on making this more official; that'll probably be announced in a month or two.
It's not available in the database yet, but that's something we're looking at doing. If anyone else has a particular reason to need this data, it would help if they could describe it, so we can decide how to format the data, and how detailed it needs to be.
- river.
On 3/29/2010 6:08 PM, River Tarnell wrote:
Lars Aronsson:
Is page view statistics (as in stats.grok.se) being imported to the toolserver and available in a database for quick reference?
A user has made this available in raw form at /mnt/user-store/stats. We are currently working on making this more official; that'll probably be announced in a month or two.
It's not available in the database yet, but that's something we're looking at doing. If anyone else has a particular reason to need this data, it would help if they could describe it, so we can decide how to format the data, and how detailed it needs to be.
- river.
I use the pageview data currently, for similar reasons that Lars mentioned. I produce monthly pageview data reports for Wikiprojects on enwiki.[1] Currently I only keep the data for the projects that are subscribed to the service (305 projects, ~2.3 million pages for April 2010)
If it were in the database, it could be easier, though I'd have to rewrite a lot of stuff. I only need monthly data for what I'm currently doing, but daily data could potentially be useful for other things.
The current (raw) format is fine for me, though it would be nice if the files used a more consistent naming. Most are in the form: pagecounts-YYYYMMDD-HH0000.gz but every once in a while, pagecounts-YYYYMMDD-HH0001.gz and very rarely, things like pagecounts-YYYYMMDD-HH2001.gz and a few other variations. Also, every once in a while, there are long delays before files are available or files are missing entirely.
If there is interest in having this in the database somewhere, I might be able to help out in terms of coding, as much of what I have could be fairly easily adapted to extract the data for all pages on all projects, rather than just the ones on the English Wikipedia that my tool needs.
[1] http://toolserver.org/~alexz/pop/view.php
On 29 Mar 2010, at 23:08, River Tarnell wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Lars Aronsson:
Is page view statistics (as in stats.grok.se) being imported to the toolserver and available in a database for quick reference?
A user has made this available in raw form at /mnt/user-store/ stats. We are currently working on making this more official; that'll probably be announced in a month or two.
It's not available in the database yet, but that's something we're looking at doing. If anyone else has a particular reason to need this data, it would help if they could describe it, so we can decide how to format the data, and how detailed it needs to be.
One thing I'd love to see is integration of these numbers into Magnus's GLAMorous tool [1] so that it says how many page views each page that an image is used in receives (per month, perhaps), and aggregate totals. That would be very useful when talking to GLAM institutes about future content partnerships.
Mike Peel (WMUK; personal viewpoint)
Hi,
Am 30.03.2010, 01:08 Uhr, schrieb River Tarnell river.tarnell@wikimedia.de:
If anyone else has a particular reason to need this data, it would help if they could describe it, so we can decide how to format the data, and how detailed it needs to be.
I think there are some tools, which could use this data as "how popular is an article" for sorting results. For this monthly data would be fine. Daily data is great, too - maybe the sum of the last 30 days or so could be cached somewhere. More detailed data (like hourly) seems too precise for me, but of course one could imagine even tools for per-minute data: e.g. the usage stats for articles relevant for a question at "who wants to be a millionaire" would be very interesting. But there are only few such things, so this could be done on raw data, if needed. So I would propose (if possible) daily data with a db view for either "last month" or better "last 30 days" (I don't know, if something like this could be cached... or maybe a cron for updating extra tables...)
Sincerelly, Christian Thiele / user:apper
On Tue, Mar 30, 2010 at 00:08, River Tarnell river.tarnell@wikimedia.de wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Lars Aronsson:
Is page view statistics (as in stats.grok.se) being imported to the toolserver and available in a database for quick reference?
A user has made this available in raw form at /mnt/user-store/stats. We are currently working on making this more official; that'll probably be announced in a month or two.
It's not available in the database yet, but that's something we're looking at doing. If anyone else has a particular reason to need this data, it would help if they could describe it, so we can decide how to format the data, and how detailed it needs to be.
I use it for Wikitrends[1], which I'm in the process of moving to Toolserver. I like the current format, but the data is not perfect. It contains duplicates (probably pages that redirect) and encoding problems. Here's three examples that counts as different pages in the dumps, but point to the same Wikipedia page.
Same word but ISO-8859-1 vs. UTF-8:
F%F6rst%E4rkare F%C3%B6rst%C3%A4rkare
URL encoded vs. not URL encoded:
Adam_%26_Eva Adam_&_Eva
Typical redirect:
Adolf_Hitler Adolf_hitler
[1] http://users.student.lth.se/dt05jg2/wikitrends/en/24h.html
- river. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (HP-UX)
iEYEARECAAYFAkuxJMwACgkQIXd7fCuc5vLAbgCfdfmu8xoa78lT3CJCXyt6pF3q 1UgAn2Uzvy5pbJn/oJWSjmolgEDL0NwN =vyoR -----END PGP SIGNATURE-----
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
River Tarnell wrote:
Is page view statistics (as in stats.grok.se) being imported to the toolserver and available in a database for quick reference?
A user has made this available in raw form at /mnt/user-store/stats. We are currently working on making this more official; that'll probably be announced in a month or two.
Being this user, I can only be glad if this happens :-) And I'll be glad to help setting this up if I can.
Are you thinking of storing the raw data (something desperately needed, even though there are a few copies around), or just the formatted data ?
It's not available in the database yet, but that's something we're looking at doing. If anyone else has a particular reason to need this data, it would help if they could describe it, so we can decide how to format the data, and how detailed it needs to be.
It seems that many people have been working with this data in completely independent way; Erik Zachte has a set of scripts to reformat it (which I have used), but several other people have developed their own tools. It'd be nice to have a common format (better that the current "raw" data).
The interest for this goes beyond the toolserver, so I have just created a (very short, for now) page on meta.wikipedia.org to store information related to these files:
http://meta.wikimedia.org/wiki/Statistics/Consultation
If this information is already available elsewhere and I missed it, please let me know.
Frédéric
Frédéric Schütz wrote:
River Tarnell wrote:
It's not available in the database yet, but that's something we're looking at doing. If anyone else has a particular reason to need this data, it would help if they could describe it, so we can decide how to format the data, and how detailed it needs to be.
It seems that many people have been working with this data in completely independent way; Erik Zachte has a set of scripts to reformat it (which I have used), but several other people have developed their own tools. It'd be nice to have a common format (better that the current "raw" data).
Rather than a common format, I think there should be a common api. Then it would store i on whatever format makes it most efficient. We might agree on a common xml-based format which could be accessed on hundreds of languages... and be completely useless.
Platonides wrote:
It's not available in the database yet, but that's something we're looking at doing. If anyone else has a particular reason to need this data, it would help if they could describe it, so we can decide how to format the data, and how detailed it needs to be.
It seems that many people have been working with this data in completely independent way; Erik Zachte has a set of scripts to reformat it (which I have used), but several other people have developed their own tools. It'd be nice to have a common format (better that the current "raw" data).
Rather than a common format, I think there should be a common api. Then it would store i on whatever format makes it most efficient. We might agree on a common xml-based format which could be accessed on hundreds of languages... and be completely useless.
Common API would be nice, but that means having a service able to provide the data. At the moment, it is already quite difficult for people to access the files containing the raw data -- well, it is not *so* difficult, but there is no central location containing all the data, and the files are cumbersome to read and process (as already mentioned here).
So in the short run, we need an easy way to provide access to the data files. Anything above that will be icing on the cake.
Frédéric
toolserver-l@lists.wikimedia.org