I would like to graph the correlation between file namespace page views and MediaViewer image views. Back when MediaViewer was launched, I added a namespace parameter to NavigationTiming to be able to track per-namespace pageviews, but I messed up and it only got deployed around the time MediaViewer was enabled on Commons, so we have no data for the early steps of the deploy process.
Do you know of any other source for per-namespace pageview data that is still available for the 2014 April-June period? Technically the raw pagecount files contain the information but aggregating those would be a horribly complicated way of getting this information. Does Hadoop pageview data go back that far?
thanks Gergő
Probably someone can provide details as to your other questions, this is what I think i can help with:
Does Hadoop pageview data go back that far?
Hadoop data is only for the last 30 days.
Back when MediaViewer was launched, I added a namespace parameter to
NavigationTiming to be able to track per-namespace pageviews, Navigation timing is heavily sampled so I am not sure you could estimate pageviews with the scarce dataset it provides, I would say it is not possible.
On Wed, Jan 7, 2015 at 5:52 PM, Gergo Tisza gtisza@wikimedia.org wrote:
I would like to graph the correlation between file namespace page views and MediaViewer image views. Back when MediaViewer was launched, I added a namespace parameter to NavigationTiming to be able to track per-namespace pageviews, but I messed up and it only got deployed around the time MediaViewer was enabled on Commons, so we have no data for the early steps of the deploy process.
Do you know of any other source for per-namespace pageview data that is still available for the 2014 April-June period? Technically the raw pagecount files contain the information but aggregating those would be a horribly complicated way of getting this information. Does Hadoop pageview data go back that far?
thanks Gergő
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
We get 120,000 requests a second. We're not storing them all for six months. But we do have sampled logs going back that far.
On 7 January 2015 at 20:59, Nuria Ruiz nuria@wikimedia.org wrote:
Probably someone can provide details as to your other questions, this is what I think i can help with:
Does Hadoop pageview data go back that far?
Hadoop data is only for the last 30 days.
Back when MediaViewer was launched, I added a namespace parameter to NavigationTiming to be able to track per-namespace pageviews,
Navigation timing is heavily sampled so I am not sure you could estimate pageviews with the scarce dataset it provides, I would say it is not possible.
On Wed, Jan 7, 2015 at 5:52 PM, Gergo Tisza gtisza@wikimedia.org wrote:
I would like to graph the correlation between file namespace page views and MediaViewer image views. Back when MediaViewer was launched, I added a namespace parameter to NavigationTiming to be able to track per-namespace pageviews, but I messed up and it only got deployed around the time MediaViewer was enabled on Commons, so we have no data for the early steps of the deploy process.
Do you know of any other source for per-namespace pageview data that is still available for the 2014 April-June period? Technically the raw pagecount files contain the information but aggregating those would be a horribly complicated way of getting this information. Does Hadoop pageview data go back that far?
thanks Gergő
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Wed, Jan 7, 2015 at 6:25 PM, Oliver Keyes okeyes@wikimedia.org wrote:
We get 120,000 requests a second. We're not storing them all for six months. But we do have sampled logs going back that far.
That would be great! Are those in Hadoop?
On Wed, Jan 7, 2015 at 11:36 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Not particularly, I don't think - except to remember that namespace names are localised, so you're going to have a whale of a time matching them (unless you just look for file endings, I guess).
In the case of NavigationTiming the nsid is recorded, so that wasn't a problem; but it has only been added around May, so for the period before that there is no namespace information at all.
Localized file namespace doesn't sound so bad - I can look up all translations in Translatewiki, and construct a regexp or a similar condition. There could be fun exceptions like namespace translations which have changed recently, but I would be fine with assuming the error caused by that is not significant.
On Thu, Jan 8, 2015 at 3:02 AM, Gergo Tisza gtisza@wikimedia.org wrote:
On Wed, Jan 7, 2015 at 6:25 PM, Oliver Keyes okeyes@wikimedia.org wrote:
We get 120,000 requests a second. We're not storing them all for six months. But we do have sampled logs going back that far.
That would be great! Are those in Hadoop?
They're on stat1002 in /a/squid/archive/sampled/
And the webrequest format is: https://wikitech.wikimedia.org/wiki/Cache_log_format
Note that the namespaces only show up in the title of the pages in the raw URL, so it's still going to be a bit painful to parse them out. But folks around here have done stuff like that, maybe someone can chime in with some handy scripts?
On 8 January 2015 at 03:02, Gergo Tisza gtisza@wikimedia.org wrote:
On Wed, Jan 7, 2015 at 6:25 PM, Oliver Keyes okeyes@wikimedia.org wrote:
We get 120,000 requests a second. We're not storing them all for six months. But we do have sampled logs going back that far.
That would be great! Are those in Hadoop?
On Wed, Jan 7, 2015 at 11:36 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Not particularly, I don't think - except to remember that namespace names are localised, so you're going to have a whale of a time matching them (unless you just look for file endings, I guess).
In the case of NavigationTiming the nsid is recorded, so that wasn't a problem; but it has only been added around May, so for the period before that there is no namespace information at all.
Localized file namespace doesn't sound so bad - I can look up all translations in Translatewiki, and construct a regexp or a similar condition. There could be fun exceptions like namespace translations which have changed recently, but I would be fine with assuming the error caused by that is not significant.
Well, yes; a 750-option regex run over 6 million rows for a day of data. A whale of a time ;p. You can also just use the API's namespaceNames and namespaceAliases code.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Wed, Jan 7, 2015 at 5:59 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Back when MediaViewer was launched, I added a namespace parameter to
NavigationTiming to be able to track per-namespace pageviews, Navigation timing is heavily sampled so I am not sure you could estimate pageviews with the scarce dataset it provides, I would say it is not possible.
It uses 1:1000 random sampling, so I have to count the log events and multiply by 1000 to get a good estimation. Am I missing something?
On 8 January 2015 at 02:12, Gergo Tisza gtisza@wikimedia.org wrote:
On Wed, Jan 7, 2015 at 5:59 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Back when MediaViewer was launched, I added a namespace parameter to NavigationTiming to be able to track per-namespace pageviews,
Navigation timing is heavily sampled so I am not sure you could estimate pageviews with the scarce dataset it provides, I would say it is not possible.
It uses 1:1000 random sampling, so I have to count the log events and multiply by 1000 to get a good estimation. Am I missing something?
Not particularly, I don't think - except to remember that namespace names are localised, so you're going to have a whale of a time matching them (unless you just look for file endings, I guess).
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
It uses 1:1000 random sampling, so I have to count the log events and
multiply by 1000 to get a good estimation. Am I missing something? Quite a bit actually. Mostly that reporting is only available to "some" browsers (the majority but not all) but also only the main document is counted and a pageview is more than the request of the main document. For example, you will not get all 301s/302s or images and there are many, many other details.
See pageview definition: https://meta.wikimedia.org/wiki/Research:Page_view
The good source for Recent pageview data is hadoop, going back a bit the well-loved webstatscollector files provide that info: http://dumps.wikimedia.org/other/pagecounts-all-sites/
On Wed, Jan 7, 2015 at 11:12 PM, Gergo Tisza gtisza@wikimedia.org wrote:
On Wed, Jan 7, 2015 at 5:59 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Back when MediaViewer was launched, I added a namespace parameter to
NavigationTiming to be able to track per-namespace pageviews, Navigation timing is heavily sampled so I am not sure you could estimate pageviews with the scarce dataset it provides, I would say it is not possible.
It uses 1:1000 random sampling, so I have to count the log events and multiply by 1000 to get a good estimation. Am I missing something?
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
The good source for Recent pageview data is hadoop, going back a bit the
well-loved webstatscollector files provide that info:
Sorry, mean to sent two links:
http://dumps.wikimedia.org/other/pagecounts-all-sites/ -> this is data from hadoop
http://dumps.wikimedia.org/other/pagecounts-raw/ -> this is data from webstatscollector
On Thu, Jan 8, 2015 at 7:34 AM, Nuria Ruiz nuria@wikimedia.org wrote:
It uses 1:1000 random sampling, so I have to count the log events and
multiply by 1000 to get a good estimation. Am I missing something? Quite a bit actually. Mostly that reporting is only available to "some" browsers (the majority but not all) but also only the main document is counted and a pageview is more than the request of the main document. For example, you will not get all 301s/302s or images and there are many, many other details.
See pageview definition: https://meta.wikimedia.org/wiki/Research:Page_view
The good source for Recent pageview data is hadoop, going back a bit the well-loved webstatscollector files provide that info: http://dumps.wikimedia.org/other/pagecounts-all-sites/
On Wed, Jan 7, 2015 at 11:12 PM, Gergo Tisza gtisza@wikimedia.org wrote:
On Wed, Jan 7, 2015 at 5:59 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Back when MediaViewer was launched, I added a namespace parameter to
NavigationTiming to be able to track per-namespace pageviews, Navigation timing is heavily sampled so I am not sure you could estimate pageviews with the scarce dataset it provides, I would say it is not possible.
It uses 1:1000 random sampling, so I have to count the log events and multiply by 1000 to get a good estimation. Am I missing something?
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics