Hi,
I am working on the pagecounts data for a project and I have noticed that data for the period September, 21st - September 30th, 2009 are missing.[1]
However I also found that some data for that period are avialable from The Internet Archive[2].
I see also that somehow at https://stats.grok.se there are some data available (see for example the page view data for"Influenza" on en.wiki[3]).
So, my question is: are there issues with using the data from the IA? I checked that besides for a difference for the file pagecounts-20090921-160000.gz which seems incomplete on dumps.wikimedia.org the rest of the files are the same.
Thanks in advance for your help.
Cristian [1]: https://dumps.wikimedia.org/other/pagecounts-raw/2009/2009-09/ [2]: https://archive.org/details/wikipedia_visitor_stats_200909 [3]: http://stats.grok.se/en/200909/Influenza
are there issues with using the data from the IA?
Since that much predates our team record keeping of data issues the answer is that we do not know. Maybe someone in this list can chip in and we will add this answer to our dataset known issues which can be found here:
https://wikitech.wikimedia.org/wiki/Analytics/Archive/Data/Pagecounts-raw#Ev...
On Mon, May 1, 2017 at 7:51 AM, Cristian Consonni cristian@balist.es wrote:
Hi,
I am working on the pagecounts data for a project and I have noticed that data for the period September, 21st - September 30th, 2009 are missing.[1]
However I also found that some data for that period are avialable from The Internet Archive[2].
I see also that somehow at https://stats.grok.se there are some data available (see for example the page view data for"Influenza" on en.wiki[3]).
So, my question is: are there issues with using the data from the IA? I checked that besides for a difference for the file pagecounts-20090921-160000.gz which seems incomplete on dumps.wikimedia.org the rest of the files are the same.
Thanks in advance for your help.
Cristian [1]: https://dumps.wikimedia.org/other/pagecounts-raw/2009/2009-09/ [2]: https://archive.org/details/wikipedia_visitor_stats_200909 [3]: http://stats.grok.se/en/200909/Influenza
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi,
On 01/05/2017 18:18, Nuria Ruiz wrote:
are there issues with using the data from the IA?
Since that much predates our team record keeping of data issues the answer is that we do not know. Maybe someone in this list can chip in and we will add this answer to our dataset known issues which can be found here:
https://wikitech.wikimedia.org/wiki/Analytics/Archive/Data/Pagecounts-raw#Ev...
I should add that there are a handful of files in October 2011[1] that are incorrect, as they are not compressed and appear to be HTML pages (also, they are 92KB files instead of being ~85 MB)
Again, the files from Internet Archive[2] seem to be OK.
Cristian
[1] https://dumps.wikimedia.org/other/pagecounts-raw/2011/2011-10/ Specifically, the following: * pagecounts-20111008-180001.gz * pagecounts-20111008-190000.gz * pagecounts-20111008-200000.gz * pagecounts-20111008-210000.gz * pagecounts-20111008-220000.gz [2]: https://archive.org/details/wikipedia_visitor_stats_201110
Hi Christian,
Also do not recall what happened, but maybe those files were removed purposefully because of data being corrupt or other issues.
Cheers!
On Fri, May 5, 2017 at 9:36 AM, Cristian Consonni cristian@balist.es wrote:
Hi,
On 01/05/2017 18:18, Nuria Ruiz wrote:
are there issues with using the data from the IA?
Since that much predates our team record keeping of data issues the answer is that we do not know. Maybe someone in this list can chip in and we will add this answer to our dataset known issues which can be found here:
Data/Pagecounts-raw#Events_and_known_problems_since_2014-03-01
I should add that there are a handful of files in October 2011[1] that are incorrect, as they are not compressed and appear to be HTML pages (also, they are 92KB files instead of being ~85 MB)
Again, the files from Internet Archive[2] seem to be OK.
Cristian
[1] https://dumps.wikimedia.org/other/pagecounts-raw/2011/2011-10/ Specifically, the following:
- pagecounts-20111008-180001.gz
- pagecounts-20111008-190000.gz
- pagecounts-20111008-200000.gz
- pagecounts-20111008-210000.gz
- pagecounts-20111008-220000.gz
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics