Dear Analytics Team,
I am a M.Sc. student at Copenhagen Business School. For my Master Thesis I would like to use page views data from certain Wikipedia articles. I found out that in July 2015 a new API was created which delivers this data. However, for my project I have to use data from before 2015. In my further search I found out that the old page views data exists (https://dumps.wikimedia.org/other/pagecounts-raw/ https://dumps.wikimedia.org/other/pagecounts-raw/) and until March 2017 it could be queried by using stats.grok.se. Unfortunately, this site does no longer exists, which is why I cannot filter and query the raw data in .gz format on the webpage.
Are there any possibilities to get the page views data for certain articles from before July 2017?
Thanks a lot and best regards,
Lars Hillebrand
PS: I am conducting my research in R and for the post 2015 data the package “pageviews” works great.
Hi Lars,
You have a couple of options:
1. download the data in lossless compressed form, https://dumps.wikimedia. org/other/pagecounts-ez/ The format is clever and doesn't lose granularity, should be a lot quicker than pagecounts-raw (this is basically what stats.grok.se did with the data as well, so downloading this way should be equivalent) 2. work on Toolforge, a virtual cloud that's on the same network as the data, so getting the data is a lot faster and you can use our compute resources (free, of course): https://wikitech.wikimedia.org/wiki/Portal: Toolforge
If you decide to go with the second option, the IRC channel where they support folks like you is #wikimedia-cloud and you can always find me there as milimetric.
On Tue, Feb 20, 2018 at 12:51 PM, Lars Hillebrand <larshillebrand@icloud.com
wrote:
Dear Analytics Team,
I am a M.Sc. student at Copenhagen Business School. For my Master Thesis I would like to use page views data from certain Wikipedia articles. I found out that in July 2015 a new API was created which delivers this data. However, for my project I have to use data from before 2015. In my further search I found out that the old page views data exists ( https://dumps.wikimedia.org/other/pagecounts-raw/) and until March 2017 it could be queried by using stats.grok.se. Unfortunately, this site does no longer exists, which is why I cannot filter and query the raw data in .gz format on the webpage.
Are there any possibilities to get the page views data for certain articles from before July 2017?
Thanks a lot and best regards,
Lars Hillebrand
PS: I am conducting my research in R and for the post 2015 data the package “pageviews” works great.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Dan,
One clarification point I'd make is that while the data is lossless for 30M articles, it is 100% lossy for redirects, old page names, or pages created after September 2013, correct?
John
On Wed, Feb 21, 2018 at 2:26 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Hi Lars,
You have a couple of options:
- download the data in lossless compressed form, https://dumps.wikimedia.
org/other/pagecounts-ez/ The format is clever and doesn't lose granularity, should be a lot quicker than pagecounts-raw (this is basically what stats.grok.se did with the data as well, so downloading this way should be equivalent) 2. work on Toolforge, a virtual cloud that's on the same network as the data, so getting the data is a lot faster and you can use our compute resources (free, of course): https://wikitech.wiki media.org/wiki/Portal:Toolforge
If you decide to go with the second option, the IRC channel where they support folks like you is #wikimedia-cloud and you can always find me there as milimetric.
On Tue, Feb 20, 2018 at 12:51 PM, Lars Hillebrand < larshillebrand@icloud.com> wrote:
Dear Analytics Team,
I am a M.Sc. student at Copenhagen Business School. For my Master Thesis I would like to use page views data from certain Wikipedia articles. I found out that in July 2015 a new API was created which delivers this data. However, for my project I have to use data from before 2015. In my further search I found out that the old page views data exists ( https://dumps.wikimedia.org/other/pagecounts-raw/) and until March 2017 it could be queried by using stats.grok.se. Unfortunately, this site does no longer exists, which is why I cannot filter and query the raw data in .gz format on the webpage.
Are there any possibilities to get the page views data for certain articles from before July 2017?
Thanks a lot and best regards,
Lars Hillebrand
PS: I am conducting my research in R and for the post 2015 data the package “pageviews” works great.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
John: I think you may have gotten the wrong impression from some description, and I'm not sure what you were looking at. As far as I know, pagecounts-ez is the most comprehensive dataset we have with pageviews from as early as we started tracking them. It should have all articles, regardless when they were created, regardless whether they're redirects or not. If you find evidence to the contrary, either in docs or the data itself, please let me know.
Tilman: thanks very much for the docs update, I'm never quite sure what is and isn't clear, and I'm afraid we have a mountain of documentation that might defeat its own purpose.
On Wed, Feb 21, 2018 at 2:34 PM, John Urbanik jurbanik@predata.com wrote:
Dan,
One clarification point I'd make is that while the data is lossless for 30M articles, it is 100% lossy for redirects, old page names, or pages created after September 2013, correct?
John
On Wed, Feb 21, 2018 at 2:26 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Hi Lars,
You have a couple of options:
- download the data in lossless compressed form,
https://dumps.wikimedia.org/other/pagecounts-ez/ The format is clever and doesn't lose granularity, should be a lot quicker than pagecounts-raw (this is basically what stats.grok.se did with the data as well, so downloading this way should be equivalent) 2. work on Toolforge, a virtual cloud that's on the same network as the data, so getting the data is a lot faster and you can use our compute resources (free, of course): https://wikitech.wiki media.org/wiki/Portal:Toolforge
If you decide to go with the second option, the IRC channel where they support folks like you is #wikimedia-cloud and you can always find me there as milimetric.
On Tue, Feb 20, 2018 at 12:51 PM, Lars Hillebrand < larshillebrand@icloud.com> wrote:
Dear Analytics Team,
I am a M.Sc. student at Copenhagen Business School. For my Master Thesis I would like to use page views data from certain Wikipedia articles. I found out that in July 2015 a new API was created which delivers this data. However, for my project I have to use data from before 2015. In my further search I found out that the old page views data exists ( https://dumps.wikimedia.org/other/pagecounts-raw/) and until March 2017 it could be queried by using stats.grok.se. Unfortunately, this site does no longer exists, which is why I cannot filter and query the raw data in .gz format on the webpage.
Are there any possibilities to get the page views data for certain articles from before July 2017?
Thanks a lot and best regards,
Lars Hillebrand
PS: I am conducting my research in R and for the post 2015 data the package “pageviews” works great.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
*JOHN URBANIK* Lead Data Engineer
jurbanik@predata.com 860.878.1010 <(860)%20878-1010> 379 West Broadway New York, NY 10012
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Dan,
Thanks for the clarification - digging into the files, I see that there are redirects and more than 30M titles.
My view had been informed by the documentation at https://dumps.wikimedia.org/other/pagecounts-ez/:
Hourly page views per article for around 30 million article titles (Sept
- in around 800+ Wikimedia wikis. Repackaged (with extreme shrinkage,
without losing granularity), corrected, reformatted. Daily files and two monthly files (see notes below).
Regarding the claim that pagecounts-ez has data back to when wikimedia started tracking pageviews, I'll point out another error in the documentation that may have led to that view. The documentation claims that data is available from 2007 onward:
From 2007 to May 2015: derived from Domas' pagecount/projectcount files
However, if you check out the actual files ( https://dumps.wikimedia.org/other/pagecounts-ez/merged/), you'll see that the pagecounts only go back to late 2011.
I never bothered with pagecounts-ez because of the belief that only 30 million articles were covered in the datasets and because the data isn't available for the 2008-2011 period. I always jumped straight to using pagecounts-raw because we need access to newer and older page titles and so as to not have to merge as many formats if I wanted a dataset ranging back to 2008.
Now that I know that the dataset has broader coverage, I would find it extremely helpful if some jobs could be run to generate pagecounts-ez from 2008-2011.
On Thu, Feb 22, 2018 at 12:17 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
John: I think you may have gotten the wrong impression from some description, and I'm not sure what you were looking at. As far as I know, pagecounts-ez is the most comprehensive dataset we have with pageviews from as early as we started tracking them. It should have all articles, regardless when they were created, regardless whether they're redirects or not. If you find evidence to the contrary, either in docs or the data itself, please let me know.
Tilman: thanks very much for the docs update, I'm never quite sure what is and isn't clear, and I'm afraid we have a mountain of documentation that might defeat its own purpose.
On Wed, Feb 21, 2018 at 2:34 PM, John Urbanik jurbanik@predata.com wrote:
Dan,
One clarification point I'd make is that while the data is lossless for 30M articles, it is 100% lossy for redirects, old page names, or pages created after September 2013, correct?
John
On Wed, Feb 21, 2018 at 2:26 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Hi Lars,
You have a couple of options:
- download the data in lossless compressed form,
https://dumps.wikimedia.org/other/pagecounts-ez/ The format is clever and doesn't lose granularity, should be a lot quicker than pagecounts-raw (this is basically what stats.grok.se did with the data as well, so downloading this way should be equivalent) 2. work on Toolforge, a virtual cloud that's on the same network as the data, so getting the data is a lot faster and you can use our compute resources (free, of course): https://wikitech.wiki media.org/wiki/Portal:Toolforge
If you decide to go with the second option, the IRC channel where they support folks like you is #wikimedia-cloud and you can always find me there as milimetric.
On Tue, Feb 20, 2018 at 12:51 PM, Lars Hillebrand < larshillebrand@icloud.com> wrote:
Dear Analytics Team,
I am a M.Sc. student at Copenhagen Business School. For my Master Thesis I would like to use page views data from certain Wikipedia articles. I found out that in July 2015 a new API was created which delivers this data. However, for my project I have to use data from before 2015. In my further search I found out that the old page views data exists ( https://dumps.wikimedia.org/other/pagecounts-raw/) and until March 2017 it could be queried by using stats.grok.se. Unfortunately, this site does no longer exists, which is why I cannot filter and query the raw data in .gz format on the webpage.
Are there any possibilities to get the page views data for certain articles from before July 2017?
Thanks a lot and best regards,
Lars Hillebrand
PS: I am conducting my research in R and for the post 2015 data the package “pageviews” works great.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
*JOHN URBANIK* Lead Data Engineer
jurbanik@predata.com 860.878.1010 <(860)%20878-1010> 379 West Broadway New York, NY 10012
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
My view had been informed by the documentation at https://dumps.wikimedia.org/other/pagecounts-ez/:
Hourly page views per article for around 30 million article titles (Sept
- in around 800+ Wikimedia wikis. Repackaged (with extreme shrinkage,
without losing granularity), corrected, reformatted. Daily files and two monthly files (see notes below).
Regarding the claim that pagecounts-ez has data back to when wikimedia started tracking pageviews, I'll point out another error in the documentation that may have led to that view. The documentation claims that data is available from 2007 onward:
From 2007 to May 2015: derived from Domas' pagecount/projectcount files
However, if you check out the actual files (https://dumps.wikimedia.org/o ther/pagecounts-ez/merged/), you'll see that the pagecounts only go back to late 2011.
Ah, yes, but the projectcount files go back to 2007-12, that's where that confusion comes from, we should clarify or generate the old data. I'm not sure whether this is easy, but I think it's fairly straightforward and I've opened a task for it: https://phabricator.wikimedia.org/T188041 (we have a lot of work in our backlog, though, so we probably won't be able to get to this for a bit).
Dear List-eners,
I write in to argue the case for an Wikipedia effort to make something like stats.grok.se (page views per day per article from 2007 onwards) available again.
I am author of the first R-package that was providing easy access to pageview counts by accessing the stats.grok.se service and translating the it into need little R data frames.
Since stats.grok.se is gone somebody writes in once a month - mostly from academia - asking about the status of page view data for the time before late 2015 - counts, per article, per day. To underline this further: the R pageviews package written by one of your former colleagues has over 7000 downloads within 2 years while my package has 14000 within 4 years (which are conservative numbers because they stem from one particular CRAN mirror only).
I made some efforts to reconstruct the service that stats.grok.se was providing but well it's not a trivial endeavour as far as I can see (BIG data, demanding some computing time and storage resources and bandwidth, and some thinking about how to re-arrange and aggregate the data so it can be queried and served efficiently - not to mention that the data is raw meaning it needs some proper cleaning up before using, also hosting will need some resources, ...) - and so my efforts have gone nowhere .
Would it not be nice if Wikipedia could jump in and support research by going the whole mile and making those page counts available?
In regard to the prioritizing - I am sure you have a long backlog - I would argue that this is something that really is a multiplier thing. It enables a lot of people to start researching. Daily page counts are not that fancy but without them people are simply blocked. They cannot start because they cant even get a basic idea about what was the general article popularity for a given day.
Best Peter
PS.: I would be willing to put in some time to help you folks in any way I can.
2018-02-22 21:56 GMT+01:00 Dan Andreescu dandreescu@wikimedia.org:
My view had been informed by the documentation at
https://dumps.wikimedia.org/other/pagecounts-ez/:
Hourly page views per article for around 30 million article titles (Sept
- in around 800+ Wikimedia wikis. Repackaged (with extreme shrinkage,
without losing granularity), corrected, reformatted. Daily files and two monthly files (see notes below).
Regarding the claim that pagecounts-ez has data back to when wikimedia started tracking pageviews, I'll point out another error in the documentation that may have led to that view. The documentation claims that data is available from 2007 onward:
From 2007 to May 2015: derived from Domas' pagecount/projectcount files
However, if you check out the actual files (https://dumps.wikimedia.org/o ther/pagecounts-ez/merged/), you'll see that the pagecounts only go back to late 2011.
Ah, yes, but the projectcount files go back to 2007-12, that's where that confusion comes from, we should clarify or generate the old data. I'm not sure whether this is easy, but I think it's fairly straightforward and I've opened a task for it: https://phabricator.wikimedia.org/T188041 (we have a lot of work in our backlog, though, so we probably won't be able to get to this for a bit).
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Peter, the data you mention here is quite large, and storage is cheap but not free. For now, we don't have capacity to serve that kind of timespan from the API, but we will work to improve the dumps version so it's more comprehensive.
On Thu, Feb 22, 2018 at 4:12 PM, Peter Meissner retep.meissner@gmail.com wrote:
Dear List-eners,
I write in to argue the case for an Wikipedia effort to make something like stats.grok.se (page views per day per article from 2007 onwards) available again.
I am author of the first R-package that was providing easy access to pageview counts by accessing the stats.grok.se service and translating the it into need little R data frames.
Since stats.grok.se is gone somebody writes in once a month - mostly from academia - asking about the status of page view data for the time before late 2015 - counts, per article, per day. To underline this further: the R pageviews package written by one of your former colleagues has over 7000 downloads within 2 years while my package has 14000 within 4 years (which are conservative numbers because they stem from one particular CRAN mirror only).
I made some efforts to reconstruct the service that stats.grok.se was providing but well it's not a trivial endeavour as far as I can see (BIG data, demanding some computing time and storage resources and bandwidth, and some thinking about how to re-arrange and aggregate the data so it can be queried and served efficiently - not to mention that the data is raw meaning it needs some proper cleaning up before using, also hosting will need some resources, ...) - and so my efforts have gone nowhere .
Would it not be nice if Wikipedia could jump in and support research by going the whole mile and making those page counts available?
In regard to the prioritizing - I am sure you have a long backlog - I would argue that this is something that really is a multiplier thing. It enables a lot of people to start researching. Daily page counts are not that fancy but without them people are simply blocked. They cannot start because they cant even get a basic idea about what was the general article popularity for a given day.
Best Peter
PS.: I would be willing to put in some time to help you folks in any way I can.
2018-02-22 21:56 GMT+01:00 Dan Andreescu dandreescu@wikimedia.org:
My view had been informed by the documentation at
https://dumps.wikimedia.org/other/pagecounts-ez/:
Hourly page views per article for around 30 million article titles (Sept
- in around 800+ Wikimedia wikis. Repackaged (with extreme shrinkage,
without losing granularity), corrected, reformatted. Daily files and two monthly files (see notes below).
Regarding the claim that pagecounts-ez has data back to when wikimedia started tracking pageviews, I'll point out another error in the documentation that may have led to that view. The documentation claims that data is available from 2007 onward:
From 2007 to May 2015: derived from Domas' pagecount/projectcount files
However, if you check out the actual files ( https://dumps.wikimedia.org/other/pagecounts-ez/merged/), you'll see that the pagecounts only go back to late 2011.
Ah, yes, but the projectcount files go back to 2007-12, that's where that confusion comes from, we should clarify or generate the old data. I'm not sure whether this is easy, but I think it's fairly straightforward and I've opened a task for it: https://phabricator.wikimedia.org/T188041 (we have a lot of work in our backlog, though, so we probably won't be able to get to this for a bit).
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Like dumps on article-day level? That would be already super awesome much better than the current state.
Best, Peter
Am 22.02.2018 22:23 schrieb "Dan Andreescu" dandreescu@wikimedia.org:
Peter, the data you mention here is quite large, and storage is cheap but not free. For now, we don't have capacity to serve that kind of timespan from the API, but we will work to improve the dumps version so it's more comprehensive.
On Thu, Feb 22, 2018 at 4:12 PM, Peter Meissner retep.meissner@gmail.com wrote:
Dear List-eners,
I write in to argue the case for an Wikipedia effort to make something like stats.grok.se (page views per day per article from 2007 onwards) available again.
I am author of the first R-package that was providing easy access to pageview counts by accessing the stats.grok.se service and translating the it into need little R data frames.
Since stats.grok.se is gone somebody writes in once a month - mostly from academia - asking about the status of page view data for the time before late 2015 - counts, per article, per day. To underline this further: the R pageviews package written by one of your former colleagues has over 7000 downloads within 2 years while my package has 14000 within 4 years (which are conservative numbers because they stem from one particular CRAN mirror only).
I made some efforts to reconstruct the service that stats.grok.se was providing but well it's not a trivial endeavour as far as I can see (BIG data, demanding some computing time and storage resources and bandwidth, and some thinking about how to re-arrange and aggregate the data so it can be queried and served efficiently - not to mention that the data is raw meaning it needs some proper cleaning up before using, also hosting will need some resources, ...) - and so my efforts have gone nowhere .
Would it not be nice if Wikipedia could jump in and support research by going the whole mile and making those page counts available?
In regard to the prioritizing - I am sure you have a long backlog - I would argue that this is something that really is a multiplier thing. It enables a lot of people to start researching. Daily page counts are not that fancy but without them people are simply blocked. They cannot start because they cant even get a basic idea about what was the general article popularity for a given day.
Best Peter
PS.: I would be willing to put in some time to help you folks in any way I can.
2018-02-22 21:56 GMT+01:00 Dan Andreescu dandreescu@wikimedia.org:
My view had been informed by the documentation at
https://dumps.wikimedia.org/other/pagecounts-ez/:
Hourly page views per article for around 30 million article titles
(Sept 2013) in around 800+ Wikimedia wikis. Repackaged (with extreme shrinkage, without losing granularity), corrected, reformatted. Daily files and two monthly files (see notes below).
Regarding the claim that pagecounts-ez has data back to when wikimedia started tracking pageviews, I'll point out another error in the documentation that may have led to that view. The documentation claims that data is available from 2007 onward:
From 2007 to May 2015: derived from Domas' pagecount/projectcount files
However, if you check out the actual files ( https://dumps.wikimedia.org/other/pagecounts-ez/merged/), you'll see that the pagecounts only go back to late 2011.
Ah, yes, but the projectcount files go back to 2007-12, that's where that confusion comes from, we should clarify or generate the old data. I'm not sure whether this is easy, but I think it's fairly straightforward and I've opened a task for it: https://phabricator.wikimedia.org/T188041 (we have a lot of work in our backlog, though, so we probably won't be able to get to this for a bit).
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Peter:
Do submit a phabricator tasks with your request, it'll be easier to follow on it than it is via e-mail. Our backlog: https://phabricator.wikimedia.org/tag/analytics/
I assume you know that per article views are available since 2015, a way to see those: https://tools.wmflabs.org/pageviews/
Per project views are available since early on, in either downloadable files or programatic form: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Legacy_Pagecounts
Thanks,
Nuria
On Thu, Feb 22, 2018 at 1:44 PM, Peter Meissner retep.meissner@gmail.com wrote:
Like dumps on article-day level? That would be already super awesome much better than the current state.
Best, Peter
Am 22.02.2018 22:23 schrieb "Dan Andreescu" dandreescu@wikimedia.org:
Peter, the data you mention here is quite large, and storage is cheap but not free. For now, we don't have capacity to serve that kind of timespan from the API, but we will work to improve the dumps version so it's more comprehensive.
On Thu, Feb 22, 2018 at 4:12 PM, Peter Meissner <retep.meissner@gmail.com
wrote:
Dear List-eners,
I write in to argue the case for an Wikipedia effort to make something like stats.grok.se (page views per day per article from 2007 onwards) available again.
I am author of the first R-package that was providing easy access to pageview counts by accessing the stats.grok.se service and translating the it into need little R data frames.
Since stats.grok.se is gone somebody writes in once a month - mostly from academia - asking about the status of page view data for the time before late 2015 - counts, per article, per day. To underline this further: the R pageviews package written by one of your former colleagues has over 7000 downloads within 2 years while my package has 14000 within 4 years (which are conservative numbers because they stem from one particular CRAN mirror only).
I made some efforts to reconstruct the service that stats.grok.se was providing but well it's not a trivial endeavour as far as I can see (BIG data, demanding some computing time and storage resources and bandwidth, and some thinking about how to re-arrange and aggregate the data so it can be queried and served efficiently - not to mention that the data is raw meaning it needs some proper cleaning up before using, also hosting will need some resources, ...) - and so my efforts have gone nowhere .
Would it not be nice if Wikipedia could jump in and support research by going the whole mile and making those page counts available?
In regard to the prioritizing - I am sure you have a long backlog - I would argue that this is something that really is a multiplier thing. It enables a lot of people to start researching. Daily page counts are not that fancy but without them people are simply blocked. They cannot start because they cant even get a basic idea about what was the general article popularity for a given day.
Best Peter
PS.: I would be willing to put in some time to help you folks in any way I can.
2018-02-22 21:56 GMT+01:00 Dan Andreescu dandreescu@wikimedia.org:
My view had been informed by the documentation at
https://dumps.wikimedia.org/other/pagecounts-ez/:
Hourly page views per article for around 30 million article titles
(Sept 2013) in around 800+ Wikimedia wikis. Repackaged (with extreme shrinkage, without losing granularity), corrected, reformatted. Daily files and two monthly files (see notes below).
Regarding the claim that pagecounts-ez has data back to when wikimedia started tracking pageviews, I'll point out another error in the documentation that may have led to that view. The documentation claims that data is available from 2007 onward:
From 2007 to May 2015: derived from Domas' pagecount/projectcount
files
However, if you check out the actual files ( https://dumps.wikimedia.org/other/pagecounts-ez/merged/), you'll see that the pagecounts only go back to late 2011.
Ah, yes, but the projectcount files go back to 2007-12, that's where that confusion comes from, we should clarify or generate the old data. I'm not sure whether this is easy, but I think it's fairly straightforward and I've opened a task for it: https://phabricator.wikimedia.org/T188041 (we have a lot of work in our backlog, though, so we probably won't be able to get to this for a bit).
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
FYI that there is a phabricator task to load legacy pagecounts by article to AQS: https://phabricator.wikimedia.org/T173720
That task arose from a discussion on this mailing list mid-last year: https://www.mail-archive.com/analytics@lists.wikimedia.org/msg04349.html https://www.mail-archive.com/analytics@lists.wikimedia.org/msg04350.html
Cheers, Scott
On Thu, Feb 22, 2018 at 11:25 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Peter:
Do submit a phabricator tasks with your request, it'll be easier to follow on it than it is via e-mail. Our backlog: https://phabricator. wikimedia.org/tag/analytics/
I assume you know that per article views are available since 2015, a way to see those: https://tools.wmflabs.org/pageviews/
Per project views are available since early on, in either downloadable files or programatic form: https://wikitech.wikimedia.org/wiki/Analytics/ AQS/Legacy_Pagecounts
Thanks,
Nuria
On Thu, Feb 22, 2018 at 1:44 PM, Peter Meissner retep.meissner@gmail.com wrote:
Like dumps on article-day level? That would be already super awesome much better than the current state.
Best, Peter
Am 22.02.2018 22:23 schrieb "Dan Andreescu" dandreescu@wikimedia.org:
Peter, the data you mention here is quite large, and storage is cheap but not free. For now, we don't have capacity to serve that kind of timespan from the API, but we will work to improve the dumps version so it's more comprehensive.
On Thu, Feb 22, 2018 at 4:12 PM, Peter Meissner < retep.meissner@gmail.com> wrote:
Dear List-eners,
I write in to argue the case for an Wikipedia effort to make something like stats.grok.se (page views per day per article from 2007 onwards) available again.
I am author of the first R-package that was providing easy access to pageview counts by accessing the stats.grok.se service and translating the it into need little R data frames.
Since stats.grok.se is gone somebody writes in once a month - mostly from academia - asking about the status of page view data for the time before late 2015 - counts, per article, per day. To underline this further: the R pageviews package written by one of your former colleagues has over 7000 downloads within 2 years while my package has 14000 within 4 years (which are conservative numbers because they stem from one particular CRAN mirror only).
I made some efforts to reconstruct the service that stats.grok.se was providing but well it's not a trivial endeavour as far as I can see (BIG data, demanding some computing time and storage resources and bandwidth, and some thinking about how to re-arrange and aggregate the data so it can be queried and served efficiently - not to mention that the data is raw meaning it needs some proper cleaning up before using, also hosting will need some resources, ...) - and so my efforts have gone nowhere .
Would it not be nice if Wikipedia could jump in and support research by going the whole mile and making those page counts available?
In regard to the prioritizing - I am sure you have a long backlog - I would argue that this is something that really is a multiplier thing. It enables a lot of people to start researching. Daily page counts are not that fancy but without them people are simply blocked. They cannot start because they cant even get a basic idea about what was the general article popularity for a given day.
Best Peter
PS.: I would be willing to put in some time to help you folks in any way I can.
2018-02-22 21:56 GMT+01:00 Dan Andreescu dandreescu@wikimedia.org:
My view had been informed by the documentation at
https://dumps.wikimedia.org/other/pagecounts-ez/:
Hourly page views per article for around 30 million article titles > (Sept 2013) in around 800+ Wikimedia wikis. Repackaged (with extreme > shrinkage, without losing granularity), corrected, reformatted. Daily files > and two monthly files (see notes below).
Regarding the claim that pagecounts-ez has data back to when wikimedia started tracking pageviews, I'll point out another error in the documentation that may have led to that view. The documentation claims that data is available from 2007 onward:
From 2007 to May 2015: derived from Domas' pagecount/projectcount > files
However, if you check out the actual files ( https://dumps.wikimedia.org/other/pagecounts-ez/merged/), you'll see that the pagecounts only go back to late 2011.
Ah, yes, but the projectcount files go back to 2007-12, that's where that confusion comes from, we should clarify or generate the old data. I'm not sure whether this is easy, but I think it's fairly straightforward and I've opened a task for it: https://phabricator.wikimedia.org/T188041 (we have a lot of work in our backlog, though, so we probably won't be able to get to this for a bit).
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Thanks, Scott, I failed to find that task and incorrectly assumed we had declined it. My fault, we'll see about loading that data then.
And yes, Peter, per-article dumps are already there but they're split across pagecounts-raw from 2008-2011 and pagecounts-ez after that. The conversation before you posted was that we would try to get pagecounts-ez to include all available history on a per-article level. Since pagecounts-ez is the most convenient and fast way to get to this data.
On Thu, Feb 22, 2018 at 6:31 PM, Scott Hale computermacgyver@gmail.com wrote:
FYI that there is a phabricator task to load legacy pagecounts by article to AQS: https://phabricator.wikimedia.org/T173720
That task arose from a discussion on this mailing list mid-last year: https://www.mail-archive.com/analytics@lists.wikimedia.org/msg04349.html https://www.mail-archive.com/analytics@lists.wikimedia.org/msg04350.html
Cheers, Scott
On Thu, Feb 22, 2018 at 11:25 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Peter:
Do submit a phabricator tasks with your request, it'll be easier to follow on it than it is via e-mail. Our backlog: https://phabricator.w ikimedia.org/tag/analytics/
I assume you know that per article views are available since 2015, a way to see those: https://tools.wmflabs.org/pageviews/
Per project views are available since early on, in either downloadable files or programatic form: https://wikitech.wikimed ia.org/wiki/Analytics/AQS/Legacy_Pagecounts
Thanks,
Nuria
On Thu, Feb 22, 2018 at 1:44 PM, Peter Meissner <retep.meissner@gmail.com
wrote:
Like dumps on article-day level? That would be already super awesome much better than the current state.
Best, Peter
Am 22.02.2018 22:23 schrieb "Dan Andreescu" dandreescu@wikimedia.org:
Peter, the data you mention here is quite large, and storage is cheap but not free. For now, we don't have capacity to serve that kind of timespan from the API, but we will work to improve the dumps version so it's more comprehensive.
On Thu, Feb 22, 2018 at 4:12 PM, Peter Meissner < retep.meissner@gmail.com> wrote:
Dear List-eners,
I write in to argue the case for an Wikipedia effort to make something like stats.grok.se (page views per day per article from 2007 onwards) available again.
I am author of the first R-package that was providing easy access to pageview counts by accessing the stats.grok.se service and translating the it into need little R data frames.
Since stats.grok.se is gone somebody writes in once a month - mostly from academia - asking about the status of page view data for the time before late 2015 - counts, per article, per day. To underline this further: the R pageviews package written by one of your former colleagues has over 7000 downloads within 2 years while my package has 14000 within 4 years (which are conservative numbers because they stem from one particular CRAN mirror only).
I made some efforts to reconstruct the service that stats.grok.se was providing but well it's not a trivial endeavour as far as I can see (BIG data, demanding some computing time and storage resources and bandwidth, and some thinking about how to re-arrange and aggregate the data so it can be queried and served efficiently - not to mention that the data is raw meaning it needs some proper cleaning up before using, also hosting will need some resources, ...) - and so my efforts have gone nowhere .
Would it not be nice if Wikipedia could jump in and support research by going the whole mile and making those page counts available?
In regard to the prioritizing - I am sure you have a long backlog - I would argue that this is something that really is a multiplier thing. It enables a lot of people to start researching. Daily page counts are not that fancy but without them people are simply blocked. They cannot start because they cant even get a basic idea about what was the general article popularity for a given day.
Best Peter
PS.: I would be willing to put in some time to help you folks in any way I can.
2018-02-22 21:56 GMT+01:00 Dan Andreescu dandreescu@wikimedia.org:
My view had been informed by the documentation at > https://dumps.wikimedia.org/other/pagecounts-ez/: > > Hourly page views per article for around 30 million article titles >> (Sept 2013) in around 800+ Wikimedia wikis. Repackaged (with extreme >> shrinkage, without losing granularity), corrected, reformatted. Daily files >> and two monthly files (see notes below). > > > Regarding the claim that pagecounts-ez has data back to when > wikimedia started tracking pageviews, I'll point out another error in the > documentation that may have led to that view. The documentation claims that > data is available from 2007 onward: > > From 2007 to May 2015: derived from Domas' pagecount/projectcount >> files > > > However, if you check out the actual files ( > https://dumps.wikimedia.org/other/pagecounts-ez/merged/), you'll > see that the pagecounts only go back to late 2011. >
Ah, yes, but the projectcount files go back to 2007-12, that's where that confusion comes from, we should clarify or generate the old data. I'm not sure whether this is easy, but I think it's fairly straightforward and I've opened a task for it: https://phabricator.wikimedia.org/T188041 (we have a lot of work in our backlog, though, so we probably won't be able to get to this for a bit).
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Dr Scott A. Hale http://scott.hale.us computermacgyver@gmail.com
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Thanks Dan! We should try to have this kind of information in the actual documentation updated; I just added your remarks to the page about pagecounts-raw https://wikitech.wikimedia.org/wiki/Analytics/Archive/Data/Pagecounts-raw, where the pagecounts-ez alternative had not been mentioned yet.
On Wed, Feb 21, 2018 at 11:26 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
Hi Lars,
You have a couple of options:
- download the data in lossless compressed form, https://dumps.wikimedia.
org/other/pagecounts-ez/ The format is clever and doesn't lose granularity, should be a lot quicker than pagecounts-raw (this is basically what stats.grok.se did with the data as well, so downloading this way should be equivalent) 2. work on Toolforge, a virtual cloud that's on the same network as the data, so getting the data is a lot faster and you can use our compute resources (free, of course): https://wikitech.wiki media.org/wiki/Portal:Toolforge
More specifically there is https://wikitech.wikimedia.org/wiki/PAWS . (And I assume "getting the data" meant transferring the files from dumps.wikimedia.org, correct?)
If you decide to go with the second option, the IRC channel where they support folks like you is #wikimedia-cloud and you can always find me there as milimetric.
On Tue, Feb 20, 2018 at 12:51 PM, Lars Hillebrand < larshillebrand@icloud.com> wrote:
Dear Analytics Team,
I am a M.Sc. student at Copenhagen Business School. For my Master Thesis I would like to use page views data from certain Wikipedia articles. I found out that in July 2015 a new API was created which delivers this data. However, for my project I have to use data from before 2015. In my further search I found out that the old page views data exists ( https://dumps.wikimedia.org/other/pagecounts-raw/) and until March 2017 it could be queried by using stats.grok.se. Unfortunately, this site does no longer exists, which is why I cannot filter and query the raw data in .gz format on the webpage.
Are there any possibilities to get the page views data for certain articles from before July 2017?
Thanks a lot and best regards,
Lars Hillebrand
PS: I am conducting my research in R and for the post 2015 data the package “pageviews” works great.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics