Dear wiki analytics team
I am looking at your pagecounts as archived on https://dumps.wikimedia.org/other/pagecounts-raw/2015/2015-12/ Can you tell me from what timezone the time stamps originate?
the filename pagecounts-20151224-070000 indicate pagecounts between from 7am to 8am. This 7 AM to 8 AM period. in what geographical timezone is it? GMT, UTC + or - how many hours.
thanks very much Maurice Vergeer
Maurice Vergeer, 24/12/2015 10:16:
I am looking at your pagecounts as archived on https://dumps.wikimedia.org/other/pagecounts-raw/2015/2015-12/ Can you tell me from what timezone the time stamps originate?
Any and all timestamps in dumps.wikimedia.org are in UTC. Apparently this is not as obvious as generally thought, so I've added a note: https://wikitech.wikimedia.org/w/index.php?title=Analytics%2FData%2FPagecoun...
Nemo
Dear Federico Leva, thank you very much for the clarification and the quick reply.
Because I want to relate the pagecounts to events taking place in the Netherlands (e.g televised political debates), knowing what the timezone is is important.
Again thanks and best regards Maurice Vergeer
On Thu, Dec 24, 2015 at 10:38 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Maurice Vergeer, 24/12/2015 10:16:
I am looking at your pagecounts as archived on https://dumps.wikimedia.org/other/pagecounts-raw/2015/2015-12/ Can you tell me from what timezone the time stamps originate?
Any and all timestamps in dumps.wikimedia.org are in UTC. Apparently this is not as obvious as generally thought, so I've added a note: https://wikitech.wikimedia.org/w/index.php?title=Analytics%2FData%2FPagecoun...
Nemo
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Maurice, if you're looking for recent data, we have a better source: dumps.wikimedia.org/other/pageviews/
This is better than pagecounts-raw because it excludes spider (crawler) traffic, some other automata traffic, and includes mobile traffic. We have been slow to announce this and change the pages because it's a confusing change.
On Thu, Dec 24, 2015 at 4:44 AM, Maurice Vergeer m.vergeer@maw.ru.nl wrote:
Dear Federico Leva, thank you very much for the clarification and the quick reply.
Because I want to relate the pagecounts to events taking place in the Netherlands (e.g televised political debates), knowing what the timezone is is important.
Again thanks and best regards Maurice Vergeer
On Thu, Dec 24, 2015 at 10:38 AM, Federico Leva (Nemo) <nemowiki@gmail.com
wrote:
Maurice Vergeer, 24/12/2015 10:16:
I am looking at your pagecounts as archived on https://dumps.wikimedia.org/other/pagecounts-raw/2015/2015-12/ Can you tell me from what timezone the time stamps originate?
Any and all timestamps in dumps.wikimedia.org are in UTC. Apparently this is not as obvious as generally thought, so I've added a note: https://wikitech.wikimedia.org/w/index.php?title=Analytics%2FData%2FPagecoun...
Nemo
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- ________________________________________________ Maurice Vergeer To contact me, see http://mauricevergeer.nl/node/5 To see my publications, see http://mauricevergeer.nl/node/1 ________________________________________________
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Dear Dan,
thanks for this information. I looked at it the new data. Are these data also available per subject or only aggregated for the entire site? I am mainly interested in statistics about specific pages in Wikipedia. For now the traditional measurements are sufficient. I.e. I assume spider traffic is sort of random on a daily basis, and because I will do timeseries analysis, this will have little affect on the findings . On a similar topic, when a user is logged in to perform edits to a page and he refreshes the wikipedia-page, does this register as a visit? Or do only not loggedon visits register as a visit?
To conclude, I think Wikipedia data are very nteresting for scientific study. I've seen some studies in information science, but in my field - Communicatoin science - very little. I hope to change that with my contribution :-)
Again thanks Dan, Maurice
On Thu, Dec 24, 2015 at 2:07 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Maurice, if you're looking for recent data, we have a better source: dumps.wikimedia.org/other/pageviews/
This is better than pagecounts-raw because it excludes spider (crawler) traffic, some other automata traffic, and includes mobile traffic. We have been slow to announce this and change the pages because it's a confusing change.
On Thu, Dec 24, 2015 at 4:44 AM, Maurice Vergeer m.vergeer@maw.ru.nl wrote:
Dear Federico Leva, thank you very much for the clarification and the quick reply.
Because I want to relate the pagecounts to events taking place in the Netherlands (e.g televised political debates), knowing what the timezone is is important.
Again thanks and best regards Maurice Vergeer
On Thu, Dec 24, 2015 at 10:38 AM, Federico Leva (Nemo) < nemowiki@gmail.com> wrote:
Maurice Vergeer, 24/12/2015 10:16:
I am looking at your pagecounts as archived on https://dumps.wikimedia.org/other/pagecounts-raw/2015/2015-12/ Can you tell me from what timezone the time stamps originate?
Any and all timestamps in dumps.wikimedia.org are in UTC. Apparently this is not as obvious as generally thought, so I've added a note: https://wikitech.wikimedia.org/w/index.php?title=Analytics%2FData%2FPagecoun...
Nemo
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- ________________________________________________ Maurice Vergeer To contact me, see http://mauricevergeer.nl/node/5 To see my publications, see http://mauricevergeer.nl/node/1 ________________________________________________
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Maurice, the data per-article is available starting in October: http://dumps.wikimedia.org/other/pageviews/2015/2015-10/
We have the data so we could backfill back to May 2015, but not beyond that. The backfill process would take quite a long time, however, because there's a lot of data to crunch through. So we haven't decided to kick off that back-filling until people ask us for it. So do ask for it if that's useful.
If you need data going further back, we have the pagecounts-all-sites dataset, which at least includes mobile data as well. This is available per-article starting in late 2014.
As you can see, this is confusing which is why I just started a thread on this list about simplifying it. Please chime in there if you have an opinion about the cleanup.
On Thu, Dec 24, 2015 at 8:41 AM, Maurice Vergeer m.vergeer@maw.ru.nl wrote:
Dear Dan,
thanks for this information. I looked at it the new data. Are these data also available per subject or only aggregated for the entire site? I am mainly interested in statistics about specific pages in Wikipedia. For now the traditional measurements are sufficient. I.e. I assume spider traffic is sort of random on a daily basis, and because I will do timeseries analysis, this will have little affect on the findings . On a similar topic, when a user is logged in to perform edits to a page and he refreshes the wikipedia-page, does this register as a visit? Or do only not loggedon visits register as a visit?
To conclude, I think Wikipedia data are very nteresting for scientific study. I've seen some studies in information science, but in my field - Communicatoin science - very little. I hope to change that with my contribution :-)
Again thanks Dan, Maurice
On Thu, Dec 24, 2015 at 2:07 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Maurice, if you're looking for recent data, we have a better source: dumps.wikimedia.org/other/pageviews/
This is better than pagecounts-raw because it excludes spider (crawler) traffic, some other automata traffic, and includes mobile traffic. We have been slow to announce this and change the pages because it's a confusing change.
On Thu, Dec 24, 2015 at 4:44 AM, Maurice Vergeer m.vergeer@maw.ru.nl wrote:
Dear Federico Leva, thank you very much for the clarification and the quick reply.
Because I want to relate the pagecounts to events taking place in the Netherlands (e.g televised political debates), knowing what the timezone is is important.
Again thanks and best regards Maurice Vergeer
On Thu, Dec 24, 2015 at 10:38 AM, Federico Leva (Nemo) < nemowiki@gmail.com> wrote:
Maurice Vergeer, 24/12/2015 10:16:
I am looking at your pagecounts as archived on https://dumps.wikimedia.org/other/pagecounts-raw/2015/2015-12/ Can you tell me from what timezone the time stamps originate?
Any and all timestamps in dumps.wikimedia.org are in UTC. Apparently this is not as obvious as generally thought, so I've added a note: https://wikitech.wikimedia.org/w/index.php?title=Analytics%2FData%2FPagecoun...
Nemo
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- ________________________________________________ Maurice Vergeer To contact me, see http://mauricevergeer.nl/node/5 To see my publications, see http://mauricevergeer.nl/node/1 ________________________________________________
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- ________________________________________________ Maurice Vergeer To contact me, see http://mauricevergeer.nl/node/5 To see my publications, see http://mauricevergeer.nl/node/1 ________________________________________________
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Dear Dan, I just read your thread. It's much clearer now to me. As for a possible cleanup I will chime in. Important is to keep old data available in some way, even if they are imperfect. There's no such thing as perfect data.
best regards Maurice
On Thu, Dec 24, 2015 at 2:45 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Maurice, the data per-article is available starting in October: http://dumps.wikimedia.org/other/pageviews/2015/2015-10/
We have the data so we could backfill back to May 2015, but not beyond that. The backfill process would take quite a long time, however, because there's a lot of data to crunch through. So we haven't decided to kick off that back-filling until people ask us for it. So do ask for it if that's useful.
If you need data going further back, we have the pagecounts-all-sites dataset, which at least includes mobile data as well. This is available per-article starting in late 2014.
As you can see, this is confusing which is why I just started a thread on this list about simplifying it. Please chime in there if you have an opinion about the cleanup.
On Thu, Dec 24, 2015 at 8:41 AM, Maurice Vergeer m.vergeer@maw.ru.nl wrote:
Dear Dan,
thanks for this information. I looked at it the new data. Are these data also available per subject or only aggregated for the entire site? I am mainly interested in statistics about specific pages in Wikipedia. For now the traditional measurements are sufficient. I.e. I assume spider traffic is sort of random on a daily basis, and because I will do timeseries analysis, this will have little affect on the findings . On a similar topic, when a user is logged in to perform edits to a page and he refreshes the wikipedia-page, does this register as a visit? Or do only not loggedon visits register as a visit?
To conclude, I think Wikipedia data are very nteresting for scientific study. I've seen some studies in information science, but in my field - Communicatoin science - very little. I hope to change that with my contribution :-)
Again thanks Dan, Maurice
On Thu, Dec 24, 2015 at 2:07 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Maurice, if you're looking for recent data, we have a better source: dumps.wikimedia.org/other/pageviews/
This is better than pagecounts-raw because it excludes spider (crawler) traffic, some other automata traffic, and includes mobile traffic. We have been slow to announce this and change the pages because it's a confusing change.
On Thu, Dec 24, 2015 at 4:44 AM, Maurice Vergeer m.vergeer@maw.ru.nl wrote:
Dear Federico Leva, thank you very much for the clarification and the quick reply.
Because I want to relate the pagecounts to events taking place in the Netherlands (e.g televised political debates), knowing what the timezone is is important.
Again thanks and best regards Maurice Vergeer
On Thu, Dec 24, 2015 at 10:38 AM, Federico Leva (Nemo) < nemowiki@gmail.com> wrote:
Maurice Vergeer, 24/12/2015 10:16:
I am looking at your pagecounts as archived on https://dumps.wikimedia.org/other/pagecounts-raw/2015/2015-12/ Can you tell me from what timezone the time stamps originate?
Any and all timestamps in dumps.wikimedia.org are in UTC. Apparently this is not as obvious as generally thought, so I've added a note: https://wikitech.wikimedia.org/w/index.php?title=Analytics%2FData%2FPagecoun...
Nemo
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- ________________________________________________ Maurice Vergeer To contact me, see http://mauricevergeer.nl/node/5 To see my publications, see http://mauricevergeer.nl/node/1 ________________________________________________
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- ________________________________________________ Maurice Vergeer To contact me, see http://mauricevergeer.nl/node/5 To see my publications, see http://mauricevergeer.nl/node/1 ________________________________________________
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Dan,
Yes this dataset is of paramount importance to us in the research community. It would be awesome if you backfill the newer dataset. As for the older data, this is still extremely useful even with Crawlers. Many people will be more than happy to help here.
Best, Ahmed
On Thu, Dec 24, 2015 at 2:45 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Maurice, the data per-article is available starting in October: http://dumps.wikimedia.org/other/pageviews/2015/2015-10/
We have the data so we could backfill back to May 2015, but not beyond that. The backfill process would take quite a long time, however, because there's a lot of data to crunch through. So we haven't decided to kick off that back-filling until people ask us for it. So do ask for it if that's useful.
If you need data going further back, we have the pagecounts-all-sites dataset, which at least includes mobile data as well. This is available per-article starting in late 2014.
As you can see, this is confusing which is why I just started a thread on this list about simplifying it. Please chime in there if you have an opinion about the cleanup.
On Thu, Dec 24, 2015 at 8:41 AM, Maurice Vergeer m.vergeer@maw.ru.nl wrote:
Dear Dan,
thanks for this information. I looked at it the new data. Are these data also available per subject or only aggregated for the entire site? I am mainly interested in statistics about specific pages in Wikipedia. For now the traditional measurements are sufficient. I.e. I assume spider traffic is sort of random on a daily basis, and because I will do timeseries analysis, this will have little affect on the findings . On a similar topic, when a user is logged in to perform edits to a page and he refreshes the wikipedia-page, does this register as a visit? Or do only not loggedon visits register as a visit?
To conclude, I think Wikipedia data are very nteresting for scientific study. I've seen some studies in information science, but in my field - Communicatoin science - very little. I hope to change that with my contribution :-)
Again thanks Dan, Maurice
On Thu, Dec 24, 2015 at 2:07 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Maurice, if you're looking for recent data, we have a better source: dumps.wikimedia.org/other/pageviews/
This is better than pagecounts-raw because it excludes spider (crawler) traffic, some other automata traffic, and includes mobile traffic. We have been slow to announce this and change the pages because it's a confusing change.
On Thu, Dec 24, 2015 at 4:44 AM, Maurice Vergeer m.vergeer@maw.ru.nl wrote:
Dear Federico Leva, thank you very much for the clarification and the quick reply.
Because I want to relate the pagecounts to events taking place in the Netherlands (e.g televised political debates), knowing what the timezone is is important.
Again thanks and best regards Maurice Vergeer
On Thu, Dec 24, 2015 at 10:38 AM, Federico Leva (Nemo) < nemowiki@gmail.com> wrote:
Maurice Vergeer, 24/12/2015 10:16:
I am looking at your pagecounts as archived on https://dumps.wikimedia.org/other/pagecounts-raw/2015/2015-12/ Can you tell me from what timezone the time stamps originate?
Any and all timestamps in dumps.wikimedia.org are in UTC. Apparently this is not as obvious as generally thought, so I've added a note: https://wikitech.wikimedia.org/w/index.php?title=Analytics%2FData%2FPagecoun...
Nemo
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- ________________________________________________ Maurice Vergeer To contact me, see http://mauricevergeer.nl/node/5 To see my publications, see http://mauricevergeer.nl/node/1 ________________________________________________
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- ________________________________________________ Maurice Vergeer To contact me, see http://mauricevergeer.nl/node/5 To see my publications, see http://mauricevergeer.nl/node/1 ________________________________________________
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics