This depends on [1] so we're not going to need that immediately, but in order to help Erik Zachte with his RfC [2] to track unique media views in Media Viewer, I'm going to need to use something almost exactly like EventLogging. The main difference being that it should skip writing to the database and write to a log file instead.
That's because we'll be recording around 20-25M image views per day, which would needlessly overload EventLogging for little purpose since the data will be used for offline stats generation and doesn't need to be made available in a relational database. Of course if storage space and EventLogging capacity were no object, we could just use EL and keep the ever-growing table forever, but I have the impression that we want to be reasonable here and only write to a log, since that's what Erik needs.
So here's the question: for a specific schema, can EventLogging work the way it does but only record hits to a log file (maybe it already does that before hitting the DB?) and not write to the DB? If not, how difficult would it be to make EL capable of doing that?
[1] https://phabricator.wikimedia.org/T44815 [2] https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...
Hi Gilles -- why won't the page view logs work by themselves for this purpose? EL can be configured to write into Hadoop which is probably the best way to get the throughput you need but it seems overcomplicated.
-Toby
On Tue, Jan 6, 2015 at 9:41 AM, Gilles Dubuc gilles@wikimedia.org wrote:
This depends on [1] so we're not going to need that immediately, but in order to help Erik Zachte with his RfC [2] to track unique media views in Media Viewer, I'm going to need to use something almost exactly like EventLogging. The main difference being that it should skip writing to the database and write to a log file instead.
That's because we'll be recording around 20-25M image views per day, which would needlessly overload EventLogging for little purpose since the data will be used for offline stats generation and doesn't need to be made available in a relational database. Of course if storage space and EventLogging capacity were no object, we could just use EL and keep the ever-growing table forever, but I have the impression that we want to be reasonable here and only write to a log, since that's what Erik needs.
So here's the question: for a specific schema, can EventLogging work the way it does but only record hits to a log file (maybe it already does that before hitting the DB?) and not write to the DB? If not, how difficult would it be to make EL capable of doing that?
[1] https://phabricator.wikimedia.org/T44815 [2] https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Just to clarify, this is about prefetched images which have not been shown to the public.
They were sent to the browser ahead of a possible request to speed things up but in many cases never actually requested.
https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...
- Erik
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Toby Negrin Sent: Tuesday, January 06, 2015 18:49 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] Making EventLogging output to a log file instead of the DB
Hi Gilles -- why won't the page view logs work by themselves for this purpose? EL can be configured to write into Hadoop which is probably the best way to get the throughput you need but it seems overcomplicated.
-Toby
On Tue, Jan 6, 2015 at 9:41 AM, Gilles Dubuc gilles@wikimedia.org wrote:
This depends on [1] so we're not going to need that immediately, but in order to help Erik Zachte with his RfC [2] to track unique media views in Media Viewer, I'm going to need to use something almost exactly like EventLogging. The main difference being that it should skip writing to the database and write to a log file instead.
That's because we'll be recording around 20-25M image views per day, which would needlessly overload EventLogging for little purpose since the data will be used for offline stats generation and doesn't need to be made available in a relational database. Of course if storage space and EventLogging capacity were no object, we could just use EL and keep the ever-growing table forever, but I have the impression that we want to be reasonable here and only write to a log, since that's what Erik needs.
So here's the question: for a specific schema, can EventLogging work the way it does but only record hits to a log file (maybe it already does that before hitting the DB?) and not write to the DB? If not, how difficult would it be to make EL capable of doing that?
[1] https://phabricator.wikimedia.org/T44815 [2] https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Right -- couldn't we just tag the URL?
On Tue, Jan 6, 2015 at 10:10 AM, Erik Zachte ezachte@wikimedia.org wrote:
Just to clarify, this is about prefetched images which have not been shown to the public.
They were sent to the browser ahead of a possible request to speed things up but in many cases never actually requested.
https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...
- Erik
*From:* analytics-bounces@lists.wikimedia.org [mailto: analytics-bounces@lists.wikimedia.org] *On Behalf Of *Toby Negrin *Sent:* Tuesday, January 06, 2015 18:49 *To:* A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. *Subject:* Re: [Analytics] Making EventLogging output to a log file instead of the DB
Hi Gilles -- why won't the page view logs work by themselves for this purpose? EL can be configured to write into Hadoop which is probably the best way to get the throughput you need but it seems overcomplicated.
-Toby
On Tue, Jan 6, 2015 at 9:41 AM, Gilles Dubuc gilles@wikimedia.org wrote:
This depends on [1] so we're not going to need that immediately, but in order to help Erik Zachte with his RfC [2] to track unique media views in Media Viewer, I'm going to need to use something almost exactly like EventLogging. The main difference being that it should skip writing to the database and write to a log file instead.
That's because we'll be recording around 20-25M image views per day, which would needlessly overload EventLogging for little purpose since the data will be used for offline stats generation and doesn't need to be made available in a relational database. Of course if storage space and EventLogging capacity were no object, we could just use EL and keep the ever-growing table forever, but I have the impression that we want to be reasonable here and only write to a log, since that's what Erik needs.
So here's the question: for a specific schema, can EventLogging work the way it does but only record hits to a log file (maybe it already does that before hitting the DB?) and not write to the DB? If not, how difficult would it be to make EL capable of doing that?
[1] https://phabricator.wikimedia.org/T44815 [2] https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
You mean attach an X-analytics parameter, for extra images beyond the one the user initially requested.
But then we would undercount, basically missing all image views from clicking right arrow in image viewer.
I'm not sure how much we would miss then.
iirc Gilles said this browsing feature was used quite a long, but I'm not sure.
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Toby Negrin Sent: Tuesday, January 06, 2015 19:16 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] Making EventLogging output to a log file instead of the DB
Right -- couldn't we just tag the URL?
On Tue, Jan 6, 2015 at 10:10 AM, Erik Zachte ezachte@wikimedia.org wrote:
Just to clarify, this is about prefetched images which have not been shown to the public.
They were sent to the browser ahead of a possible request to speed things up but in many cases never actually requested.
https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...
- Erik
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Toby Negrin Sent: Tuesday, January 06, 2015 18:49 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] Making EventLogging output to a log file instead of the DB
Hi Gilles -- why won't the page view logs work by themselves for this purpose? EL can be configured to write into Hadoop which is probably the best way to get the throughput you need but it seems overcomplicated.
-Toby
On Tue, Jan 6, 2015 at 9:41 AM, Gilles Dubuc gilles@wikimedia.org wrote:
This depends on [1] so we're not going to need that immediately, but in order to help Erik Zachte with his RfC [2] to track unique media views in Media Viewer, I'm going to need to use something almost exactly like EventLogging. The main difference being that it should skip writing to the database and write to a log file instead.
That's because we'll be recording around 20-25M image views per day, which would needlessly overload EventLogging for little purpose since the data will be used for offline stats generation and doesn't need to be made available in a relational database. Of course if storage space and EventLogging capacity were no object, we could just use EL and keep the ever-growing table forever, but I have the impression that we want to be reasonable here and only write to a log, since that's what Erik needs.
So here's the question: for a specific schema, can EventLogging work the way it does but only record hits to a log file (maybe it already does that before hitting the DB?) and not write to the DB? If not, how difficult would it be to make EL capable of doing that?
[1] https://phabricator.wikimedia.org/T44815 [2] https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Eventlogging data currently does go to files, as well as to the DB. Check it out on stat1003 at /srv/eventlogging/archive.
If you need something with higher throughput then eventlogging itself supports…then let’s talk :D
-Ao
On Jan 6, 2015, at 13:28, Erik Zachte ezachte@wikimedia.org wrote:
You mean attach an X-analytics parameter, for extra images beyond the one the user initially requested.
But then we would undercount, basically missing all image views from clicking right arrow in image viewer. I'm not sure how much we would miss then. iirc Gilles said this browsing feature was used quite a long, but I'm not sure.
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Toby Negrin Sent: Tuesday, January 06, 2015 19:16 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] Making EventLogging output to a log file instead of the DB
Right -- couldn't we just tag the URL?
On Tue, Jan 6, 2015 at 10:10 AM, Erik Zachte <ezachte@wikimedia.org mailto:ezachte@wikimedia.org> wrote: Just to clarify, this is about prefetched images which have not been shown to the public. They were sent to the browser ahead of a possible request to speed things up but in many cases never actually requested. https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count... https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_counts#Prefetched_images
- Erik
From: analytics-bounces@lists.wikimedia.org mailto:analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Toby Negrin Sent: Tuesday, January 06, 2015 18:49 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] Making EventLogging output to a log file instead of the DB
Hi Gilles -- why won't the page view logs work by themselves for this purpose? EL can be configured to write into Hadoop which is probably the best way to get the throughput you need but it seems overcomplicated.
-Toby
On Tue, Jan 6, 2015 at 9:41 AM, Gilles Dubuc <gilles@wikimedia.org mailto:gilles@wikimedia.org> wrote: This depends on [1] so we're not going to need that immediately, but in order to help Erik Zachte with his RfC [2] to track unique media views in Media Viewer, I'm going to need to use something almost exactly like EventLogging. The main difference being that it should skip writing to the database and write to a log file instead.
That's because we'll be recording around 20-25M image views per day, which would needlessly overload EventLogging for little purpose since the data will be used for offline stats generation and doesn't need to be made available in a relational database. Of course if storage space and EventLogging capacity were no object, we could just use EL and keep the ever-growing table forever, but I have the impression that we want to be reasonable here and only write to a log, since that's what Erik needs.
So here's the question: for a specific schema, can EventLogging work the way it does but only record hits to a log file (maybe it already does that before hitting the DB?) and not write to the DB? If not, how difficult would it be to make EL capable of doing that?
[1] https://phabricator.wikimedia.org/T44815 https://phabricator.wikimedia.org/T44815 [2] https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count... https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_counts
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Right -- couldn't we just tag the URL?
The event of the user actually viewing the image is completely disconnected from the URL hit in Media Viewer, which is why we need EL and can't rely on existing server logs.
Eventlogging data currently does go to files, as well as to the DB.
Great, then I guess it's a matter of only making the data go to files and not to DB for the particular schema we'll create. Does that sound like something feasible? How much work would be required to set it up?
On Tue, Jan 6, 2015 at 7:45 PM, Andrew Otto aotto@wikimedia.org wrote:
Eventlogging data currently does go to files, as well as to the DB. Check it out on stat1003 at /srv/eventlogging/archive.
If you need something with higher throughput then eventlogging itself supports…then let’s talk :D
-Ao
On Jan 6, 2015, at 13:28, Erik Zachte ezachte@wikimedia.org wrote:
You mean attach an X-analytics parameter, for extra images beyond the one the user initially requested.
But then we would undercount, basically missing all image views from clicking right arrow in image viewer. I'm not sure how much we would miss then. iirc Gilles said this browsing feature was used quite a long, but I'm not sure.
*From:* analytics-bounces@lists.wikimedia.org [ mailto:analytics-bounces@lists.wikimedia.org analytics-bounces@lists.wikimedia.org] *On Behalf Of *Toby Negrin *Sent:* Tuesday, January 06, 2015 19:16 *To:* A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. *Subject:* Re: [Analytics] Making EventLogging output to a log file instead of the DB
Right -- couldn't we just tag the URL?
On Tue, Jan 6, 2015 at 10:10 AM, Erik Zachte ezachte@wikimedia.org wrote:
Just to clarify, this is about prefetched images which have not been shown to the public.
They were sent to the browser ahead of a possible request to speed things up but in many cases never actually requested.
https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...
- Erik
*From:* analytics-bounces@lists.wikimedia.org [mailto: analytics-bounces@lists.wikimedia.org] *On Behalf Of *Toby Negrin *Sent:* Tuesday, January 06, 2015 18:49 *To:* A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. *Subject:* Re: [Analytics] Making EventLogging output to a log file instead of the DB
Hi Gilles -- why won't the page view logs work by themselves for this purpose? EL can be configured to write into Hadoop which is probably the best way to get the throughput you need but it seems overcomplicated.
-Toby
On Tue, Jan 6, 2015 at 9:41 AM, Gilles Dubuc gilles@wikimedia.org wrote:
This depends on [1] so we're not going to need that immediately, but in order to help Erik Zachte with his RfC [2] to track unique media views in Media Viewer, I'm going to need to use something almost exactly like EventLogging. The main difference being that it should skip writing to the database and write to a log file instead.
That's because we'll be recording around 20-25M image views per day, which would needlessly overload EventLogging for little purpose since the data will be used for offline stats generation and doesn't need to be made available in a relational database. Of course if storage space and EventLogging capacity were no object, we could just use EL and keep the ever-growing table forever, but I have the impression that we want to be reasonable here and only write to a log, since that's what Erik needs.
So here's the question: for a specific schema, can EventLogging work the way it does but only record hits to a log file (maybe it already does that before hitting the DB?) and not write to the DB? If not, how difficult would it be to make EL capable of doing that?
[1] https://phabricator.wikimedia.org/T44815 [2] https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Jan 7, 2015, at 6:42 AM, Gilles Dubuc gilles@wikimedia.org wrote:
Right -- couldn't we just tag the URL?
The event of the user actually viewing the image is completely disconnected from the URL hit in Media Viewer, which is why we need EL and can't rely on existing server logs.
Eventlogging data currently does go to files, as well as to the DB.
Great, then I guess it's a matter of only making the data go to files and not to DB for the particular schema we'll create. Does that sound like something feasible? How much work would be required to set it up?
this is a feature that other teams requested in the past, I agree it would be very helpful. In an ideal world, we would be able to specify the log configuration (where to write the data, pruning requirements, schema ownership) directly via a JSON object associated with the main schema.
Dario
On Tue, Jan 6, 2015 at 7:45 PM, Andrew Otto <aotto@wikimedia.org mailto:aotto@wikimedia.org> wrote: Eventlogging data currently does go to files, as well as to the DB. Check it out on stat1003 at /srv/eventlogging/archive.
If you need something with higher throughput then eventlogging itself supports…then let’s talk :D
-Ao
On Jan 6, 2015, at 13:28, Erik Zachte <ezachte@wikimedia.org mailto:ezachte@wikimedia.org> wrote:
You mean attach an X-analytics parameter, for extra images beyond the one the user initially requested.
But then we would undercount, basically missing all image views from clicking right arrow in image viewer. I'm not sure how much we would miss then. iirc Gilles said this browsing feature was used quite a long, but I'm not sure.
From: analytics-bounces@lists.wikimedia.org mailto:analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Toby Negrin Sent: Tuesday, January 06, 2015 19:16 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] Making EventLogging output to a log file instead of the DB
Right -- couldn't we just tag the URL?
On Tue, Jan 6, 2015 at 10:10 AM, Erik Zachte <ezachte@wikimedia.org mailto:ezachte@wikimedia.org> wrote:
Just to clarify, this is about prefetched images which have not been shown to the public.
They were sent to the browser ahead of a possible request to speed things up but in many cases never actually requested.
https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count... https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_counts#Prefetched_images
- Erik
From: analytics-bounces@lists.wikimedia.org mailto:analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Toby Negrin Sent: Tuesday, January 06, 2015 18:49 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] Making EventLogging output to a log file instead of the DB
Hi Gilles -- why won't the page view logs work by themselves for this purpose? EL can be configured to write into Hadoop which is probably the best way to get the throughput you need but it seems overcomplicated.
-Toby
On Tue, Jan 6, 2015 at 9:41 AM, Gilles Dubuc <gilles@wikimedia.org mailto:gilles@wikimedia.org> wrote:
This depends on [1] so we're not going to need that immediately, but in order to help Erik Zachte with his RfC [2] to track unique media views in Media Viewer, I'm going to need to use something almost exactly like EventLogging. The main difference being that it should skip writing to the database and write to a log file instead.
That's because we'll be recording around 20-25M image views per day, which would needlessly overload EventLogging for little purpose since the data will be used for offline stats generation and doesn't need to be made available in a relational database. Of course if storage space and EventLogging capacity were no object, we could just use EL and keep the ever-growing table forever, but I have the impression that we want to be reasonable here and only write to a log, since that's what Erik needs.
So here's the question: for a specific schema, can EventLogging work the way it does but only record hits to a log file (maybe it already does that before hitting the DB?) and not write to the DB? If not, how difficult would it be to make EL capable of doing that?
[1] https://phabricator.wikimedia.org/T44815 https://phabricator.wikimedia.org/T44815 [2] https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count... https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_counts
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Great, then I guess it's a matter of only making the data go to files and
not to DB for the particular schema we'll create. Does >that sound like something feasible? How much work would be required to set it up? I do not think this is feasible on the near term w/o changes in our end. I also am not sure it is really needed. You are concern about sending stuff to db due to "volume", correct? I do not understand why logging every single data point would be needed. Maybe you can explain that with a bit more detail for us to grasp the use case?
If it is a matter of identifying distinct requests that can be done having sampled your dataset if it is large enough, we can help with that and leila just put together some docs on this regard, while this is for hive queries principles can apply elsewhere: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Counting_uniques
On Wed, Jan 7, 2015 at 6:42 AM, Gilles Dubuc gilles@wikimedia.org wrote:
Right -- couldn't we just tag the URL?
The event of the user actually viewing the image is completely disconnected from the URL hit in Media Viewer, which is why we need EL and can't rely on existing server logs.
Eventlogging data currently does go to files, as well as to the DB.
Great, then I guess it's a matter of only making the data go to files and not to DB for the particular schema we'll create. Does that sound like something feasible? How much work would be required to set it up?
On Tue, Jan 6, 2015 at 7:45 PM, Andrew Otto aotto@wikimedia.org wrote:
Eventlogging data currently does go to files, as well as to the DB. Check it out on stat1003 at /srv/eventlogging/archive.
If you need something with higher throughput then eventlogging itself supports…then let’s talk :D
-Ao
On Jan 6, 2015, at 13:28, Erik Zachte ezachte@wikimedia.org wrote:
You mean attach an X-analytics parameter, for extra images beyond the one the user initially requested.
But then we would undercount, basically missing all image views from clicking right arrow in image viewer. I'm not sure how much we would miss then. iirc Gilles said this browsing feature was used quite a long, but I'm not sure.
*From:* analytics-bounces@lists.wikimedia.org [ mailto:analytics-bounces@lists.wikimedia.org analytics-bounces@lists.wikimedia.org] *On Behalf Of *Toby Negrin *Sent:* Tuesday, January 06, 2015 19:16 *To:* A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. *Subject:* Re: [Analytics] Making EventLogging output to a log file instead of the DB
Right -- couldn't we just tag the URL?
On Tue, Jan 6, 2015 at 10:10 AM, Erik Zachte ezachte@wikimedia.org wrote:
Just to clarify, this is about prefetched images which have not been shown to the public.
They were sent to the browser ahead of a possible request to speed things up but in many cases never actually requested.
https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...
- Erik
*From:* analytics-bounces@lists.wikimedia.org [mailto: analytics-bounces@lists.wikimedia.org] *On Behalf Of *Toby Negrin *Sent:* Tuesday, January 06, 2015 18:49 *To:* A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. *Subject:* Re: [Analytics] Making EventLogging output to a log file instead of the DB
Hi Gilles -- why won't the page view logs work by themselves for this purpose? EL can be configured to write into Hadoop which is probably the best way to get the throughput you need but it seems overcomplicated.
-Toby
On Tue, Jan 6, 2015 at 9:41 AM, Gilles Dubuc gilles@wikimedia.org wrote:
This depends on [1] so we're not going to need that immediately, but in order to help Erik Zachte with his RfC [2] to track unique media views in Media Viewer, I'm going to need to use something almost exactly like EventLogging. The main difference being that it should skip writing to the database and write to a log file instead.
That's because we'll be recording around 20-25M image views per day, which would needlessly overload EventLogging for little purpose since the data will be used for offline stats generation and doesn't need to be made available in a relational database. Of course if storage space and EventLogging capacity were no object, we could just use EL and keep the ever-growing table forever, but I have the impression that we want to be reasonable here and only write to a log, since that's what Erik needs.
So here's the question: for a specific schema, can EventLogging work the way it does but only record hits to a log file (maybe it already does that before hitting the DB?) and not write to the DB? If not, how difficult would it be to make EL capable of doing that?
[1] https://phabricator.wikimedia.org/T44815 [2] https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I'd also like us to consider routing this dataset to hadoop. I believe there is already an EL-Kafka pipeline and this would make it easy to integrate page views with our regular processing.
Gilles -- are mobile page views included in your stream?
-Toby
On Wed, Jan 7, 2015 at 9:27 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Great, then I guess it's a matter of only making the data go to files
and not to DB for the particular schema we'll create. Does >that sound like something feasible? How much work would be required to set it up? I do not think this is feasible on the near term w/o changes in our end. I also am not sure it is really needed. You are concern about sending stuff to db due to "volume", correct? I do not understand why logging every single data point would be needed. Maybe you can explain that with a bit more detail for us to grasp the use case?
If it is a matter of identifying distinct requests that can be done having sampled your dataset if it is large enough, we can help with that and leila just put together some docs on this regard, while this is for hive queries principles can apply elsewhere: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Counting_uniques
On Wed, Jan 7, 2015 at 6:42 AM, Gilles Dubuc gilles@wikimedia.org wrote:
Right -- couldn't we just tag the URL?
The event of the user actually viewing the image is completely disconnected from the URL hit in Media Viewer, which is why we need EL and can't rely on existing server logs.
Eventlogging data currently does go to files, as well as to the DB.
Great, then I guess it's a matter of only making the data go to files and not to DB for the particular schema we'll create. Does that sound like something feasible? How much work would be required to set it up?
On Tue, Jan 6, 2015 at 7:45 PM, Andrew Otto aotto@wikimedia.org wrote:
Eventlogging data currently does go to files, as well as to the DB. Check it out on stat1003 at /srv/eventlogging/archive.
If you need something with higher throughput then eventlogging itself supports…then let’s talk :D
-Ao
On Jan 6, 2015, at 13:28, Erik Zachte ezachte@wikimedia.org wrote:
You mean attach an X-analytics parameter, for extra images beyond the one the user initially requested.
But then we would undercount, basically missing all image views from clicking right arrow in image viewer. I'm not sure how much we would miss then. iirc Gilles said this browsing feature was used quite a long, but I'm not sure.
*From:* analytics-bounces@lists.wikimedia.org [ mailto:analytics-bounces@lists.wikimedia.org analytics-bounces@lists.wikimedia.org] *On Behalf Of *Toby Negrin *Sent:* Tuesday, January 06, 2015 19:16 *To:* A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. *Subject:* Re: [Analytics] Making EventLogging output to a log file instead of the DB
Right -- couldn't we just tag the URL?
On Tue, Jan 6, 2015 at 10:10 AM, Erik Zachte ezachte@wikimedia.org wrote:
Just to clarify, this is about prefetched images which have not been shown to the public.
They were sent to the browser ahead of a possible request to speed things up but in many cases never actually requested.
https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...
- Erik
*From:* analytics-bounces@lists.wikimedia.org [mailto: analytics-bounces@lists.wikimedia.org] *On Behalf Of *Toby Negrin *Sent:* Tuesday, January 06, 2015 18:49 *To:* A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. *Subject:* Re: [Analytics] Making EventLogging output to a log file instead of the DB
Hi Gilles -- why won't the page view logs work by themselves for this purpose? EL can be configured to write into Hadoop which is probably the best way to get the throughput you need but it seems overcomplicated.
-Toby
On Tue, Jan 6, 2015 at 9:41 AM, Gilles Dubuc gilles@wikimedia.org wrote:
This depends on [1] so we're not going to need that immediately, but in order to help Erik Zachte with his RfC [2] to track unique media views in Media Viewer, I'm going to need to use something almost exactly like EventLogging. The main difference being that it should skip writing to the database and write to a log file instead.
That's because we'll be recording around 20-25M image views per day, which would needlessly overload EventLogging for little purpose since the data will be used for offline stats generation and doesn't need to be made available in a relational database. Of course if storage space and EventLogging capacity were no object, we could just use EL and keep the ever-growing table forever, but I have the impression that we want to be reasonable here and only write to a log, since that's what Erik needs.
So here's the question: for a specific schema, can EventLogging work the way it does but only record hits to a log file (maybe it already does that before hitting the DB?) and not write to the DB? If not, how difficult would it be to make EL capable of doing that?
[1] https://phabricator.wikimedia.org/T44815 [2] https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I believe there is already an EL-Kafka pipeline and this would make it
easy to integrate page views with our regular processing.
Note that the pipeline was disabled 6 months ago and thus my comment "in the near term" https://github.com/wikimedia/operations-puppet/commit/f85b1dbcd61bbb58684ff9...
On Wed, Jan 7, 2015 at 9:39 AM, Toby Negrin tnegrin@wikimedia.org wrote:
I'd also like us to consider routing this dataset to hadoop. I believe there is already an EL-Kafka pipeline and this would make it easy to integrate page views with our regular processing.
Gilles -- are mobile page views included in your stream?
-Toby
On Wed, Jan 7, 2015 at 9:27 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Great, then I guess it's a matter of only making the data go to files
and not to DB for the particular schema we'll create. Does >that sound like something feasible? How much work would be required to set it up? I do not think this is feasible on the near term w/o changes in our end. I also am not sure it is really needed. You are concern about sending stuff to db due to "volume", correct? I do not understand why logging every single data point would be needed. Maybe you can explain that with a bit more detail for us to grasp the use case?
If it is a matter of identifying distinct requests that can be done having sampled your dataset if it is large enough, we can help with that and leila just put together some docs on this regard, while this is for hive queries principles can apply elsewhere: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Counting_uniques
On Wed, Jan 7, 2015 at 6:42 AM, Gilles Dubuc gilles@wikimedia.org wrote:
Right -- couldn't we just tag the URL?
The event of the user actually viewing the image is completely disconnected from the URL hit in Media Viewer, which is why we need EL and can't rely on existing server logs.
Eventlogging data currently does go to files, as well as to the DB.
Great, then I guess it's a matter of only making the data go to files and not to DB for the particular schema we'll create. Does that sound like something feasible? How much work would be required to set it up?
On Tue, Jan 6, 2015 at 7:45 PM, Andrew Otto aotto@wikimedia.org wrote:
Eventlogging data currently does go to files, as well as to the DB. Check it out on stat1003 at /srv/eventlogging/archive.
If you need something with higher throughput then eventlogging itself supports…then let’s talk :D
-Ao
On Jan 6, 2015, at 13:28, Erik Zachte ezachte@wikimedia.org wrote:
You mean attach an X-analytics parameter, for extra images beyond the one the user initially requested.
But then we would undercount, basically missing all image views from clicking right arrow in image viewer. I'm not sure how much we would miss then. iirc Gilles said this browsing feature was used quite a long, but I'm not sure.
*From:* analytics-bounces@lists.wikimedia.org [ mailto:analytics-bounces@lists.wikimedia.org analytics-bounces@lists.wikimedia.org] *On Behalf Of *Toby Negrin *Sent:* Tuesday, January 06, 2015 19:16 *To:* A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. *Subject:* Re: [Analytics] Making EventLogging output to a log file instead of the DB
Right -- couldn't we just tag the URL?
On Tue, Jan 6, 2015 at 10:10 AM, Erik Zachte ezachte@wikimedia.org wrote:
Just to clarify, this is about prefetched images which have not been shown to the public.
They were sent to the browser ahead of a possible request to speed things up but in many cases never actually requested.
https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...
- Erik
*From:* analytics-bounces@lists.wikimedia.org [mailto: analytics-bounces@lists.wikimedia.org] *On Behalf Of *Toby Negrin *Sent:* Tuesday, January 06, 2015 18:49 *To:* A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. *Subject:* Re: [Analytics] Making EventLogging output to a log file instead of the DB
Hi Gilles -- why won't the page view logs work by themselves for this purpose? EL can be configured to write into Hadoop which is probably the best way to get the throughput you need but it seems overcomplicated.
-Toby
On Tue, Jan 6, 2015 at 9:41 AM, Gilles Dubuc gilles@wikimedia.org wrote:
This depends on [1] so we're not going to need that immediately, but in order to help Erik Zachte with his RfC [2] to track unique media views in Media Viewer, I'm going to need to use something almost exactly like EventLogging. The main difference being that it should skip writing to the database and write to a log file instead.
That's because we'll be recording around 20-25M image views per day, which would needlessly overload EventLogging for little purpose since the data will be used for offline stats generation and doesn't need to be made available in a relational database. Of course if storage space and EventLogging capacity were no object, we could just use EL and keep the ever-growing table forever, but I have the impression that we want to be reasonable here and only write to a log, since that's what Erik needs.
So here's the question: for a specific schema, can EventLogging work the way it does but only record hits to a log file (maybe it already does that before hitting the DB?) and not write to the DB? If not, how difficult would it be to make EL capable of doing that?
[1] https://phabricator.wikimedia.org/T44815 [2] https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Yes -- we disabled it because there wasn't a use case. We have one now :)
On Wed, Jan 7, 2015 at 10:32 AM, Nuria Ruiz nuria@wikimedia.org wrote:
I believe there is already an EL-Kafka pipeline and this would make it
easy to integrate page views with our regular processing.
Note that the pipeline was disabled 6 months ago and thus my comment "in the near term"
https://github.com/wikimedia/operations-puppet/commit/f85b1dbcd61bbb58684ff9...
On Wed, Jan 7, 2015 at 9:39 AM, Toby Negrin tnegrin@wikimedia.org wrote:
I'd also like us to consider routing this dataset to hadoop. I believe there is already an EL-Kafka pipeline and this would make it easy to integrate page views with our regular processing.
Gilles -- are mobile page views included in your stream?
-Toby
On Wed, Jan 7, 2015 at 9:27 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Great, then I guess it's a matter of only making the data go to files
and not to DB for the particular schema we'll create. Does >that sound like something feasible? How much work would be required to set it up? I do not think this is feasible on the near term w/o changes in our end. I also am not sure it is really needed. You are concern about sending stuff to db due to "volume", correct? I do not understand why logging every single data point would be needed. Maybe you can explain that with a bit more detail for us to grasp the use case?
If it is a matter of identifying distinct requests that can be done having sampled your dataset if it is large enough, we can help with that and leila just put together some docs on this regard, while this is for hive queries principles can apply elsewhere: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Counting_uniques
On Wed, Jan 7, 2015 at 6:42 AM, Gilles Dubuc gilles@wikimedia.org wrote:
Right -- couldn't we just tag the URL?
The event of the user actually viewing the image is completely disconnected from the URL hit in Media Viewer, which is why we need EL and can't rely on existing server logs.
Eventlogging data currently does go to files, as well as to the DB.
Great, then I guess it's a matter of only making the data go to files and not to DB for the particular schema we'll create. Does that sound like something feasible? How much work would be required to set it up?
On Tue, Jan 6, 2015 at 7:45 PM, Andrew Otto aotto@wikimedia.org wrote:
Eventlogging data currently does go to files, as well as to the DB. Check it out on stat1003 at /srv/eventlogging/archive.
If you need something with higher throughput then eventlogging itself supports…then let’s talk :D
-Ao
On Jan 6, 2015, at 13:28, Erik Zachte ezachte@wikimedia.org wrote:
You mean attach an X-analytics parameter, for extra images beyond the one the user initially requested.
But then we would undercount, basically missing all image views from clicking right arrow in image viewer. I'm not sure how much we would miss then. iirc Gilles said this browsing feature was used quite a long, but I'm not sure.
*From:* analytics-bounces@lists.wikimedia.org [ mailto:analytics-bounces@lists.wikimedia.org analytics-bounces@lists.wikimedia.org] *On Behalf Of *Toby Negrin *Sent:* Tuesday, January 06, 2015 19:16 *To:* A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. *Subject:* Re: [Analytics] Making EventLogging output to a log file instead of the DB
Right -- couldn't we just tag the URL?
On Tue, Jan 6, 2015 at 10:10 AM, Erik Zachte ezachte@wikimedia.org wrote:
Just to clarify, this is about prefetched images which have not been shown to the public.
They were sent to the browser ahead of a possible request to speed things up but in many cases never actually requested.
https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...
- Erik
*From:* analytics-bounces@lists.wikimedia.org [mailto: analytics-bounces@lists.wikimedia.org] *On Behalf Of *Toby Negrin *Sent:* Tuesday, January 06, 2015 18:49 *To:* A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. *Subject:* Re: [Analytics] Making EventLogging output to a log file instead of the DB
Hi Gilles -- why won't the page view logs work by themselves for this purpose? EL can be configured to write into Hadoop which is probably the best way to get the throughput you need but it seems overcomplicated.
-Toby
On Tue, Jan 6, 2015 at 9:41 AM, Gilles Dubuc gilles@wikimedia.org wrote:
This depends on [1] so we're not going to need that immediately, but in order to help Erik Zachte with his RfC [2] to track unique media views in Media Viewer, I'm going to need to use something almost exactly like EventLogging. The main difference being that it should skip writing to the database and write to a log file instead.
That's because we'll be recording around 20-25M image views per day, which would needlessly overload EventLogging for little purpose since the data will be used for offline stats generation and doesn't need to be made available in a relational database. Of course if storage space and EventLogging capacity were no object, we could just use EL and keep the ever-growing table forever, but I have the impression that we want to be reasonable here and only write to a log, since that's what Erik needs.
So here's the question: for a specific schema, can EventLogging work the way it does but only record hits to a log file (maybe it already does that before hitting the DB?) and not write to the DB? If not, how difficult would it be to make EL capable of doing that?
[1] https://phabricator.wikimedia.org/T44815 [2] https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I see. My main point was that -regardless of collection method- we might not need every single data point to calculate uniques.
On Wed, Jan 7, 2015 at 10:38 AM, Toby Negrin tnegrin@wikimedia.org wrote:
Yes -- we disabled it because there wasn't a use case. We have one now :)
On Wed, Jan 7, 2015 at 10:32 AM, Nuria Ruiz nuria@wikimedia.org wrote:
I believe there is already an EL-Kafka pipeline and this would make it
easy to integrate page views with our regular processing.
Note that the pipeline was disabled 6 months ago and thus my comment "in the near term"
https://github.com/wikimedia/operations-puppet/commit/f85b1dbcd61bbb58684ff9...
On Wed, Jan 7, 2015 at 9:39 AM, Toby Negrin tnegrin@wikimedia.org wrote:
I'd also like us to consider routing this dataset to hadoop. I believe there is already an EL-Kafka pipeline and this would make it easy to integrate page views with our regular processing.
Gilles -- are mobile page views included in your stream?
-Toby
On Wed, Jan 7, 2015 at 9:27 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Great, then I guess it's a matter of only making the data go to files
and not to DB for the particular schema we'll create. Does >that sound like something feasible? How much work would be required to set it up? I do not think this is feasible on the near term w/o changes in our end. I also am not sure it is really needed. You are concern about sending stuff to db due to "volume", correct? I do not understand why logging every single data point would be needed. Maybe you can explain that with a bit more detail for us to grasp the use case?
If it is a matter of identifying distinct requests that can be done having sampled your dataset if it is large enough, we can help with that and leila just put together some docs on this regard, while this is for hive queries principles can apply elsewhere: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Counting_uniques
On Wed, Jan 7, 2015 at 6:42 AM, Gilles Dubuc gilles@wikimedia.org wrote:
Right -- couldn't we just tag the URL?
The event of the user actually viewing the image is completely disconnected from the URL hit in Media Viewer, which is why we need EL and can't rely on existing server logs.
Eventlogging data currently does go to files, as well as to the DB.
Great, then I guess it's a matter of only making the data go to files and not to DB for the particular schema we'll create. Does that sound like something feasible? How much work would be required to set it up?
On Tue, Jan 6, 2015 at 7:45 PM, Andrew Otto aotto@wikimedia.org wrote:
Eventlogging data currently does go to files, as well as to the DB. Check it out on stat1003 at /srv/eventlogging/archive.
If you need something with higher throughput then eventlogging itself supports…then let’s talk :D
-Ao
On Jan 6, 2015, at 13:28, Erik Zachte ezachte@wikimedia.org wrote:
You mean attach an X-analytics parameter, for extra images beyond the one the user initially requested.
But then we would undercount, basically missing all image views from clicking right arrow in image viewer. I'm not sure how much we would miss then. iirc Gilles said this browsing feature was used quite a long, but I'm not sure.
*From:* analytics-bounces@lists.wikimedia.org [ mailto:analytics-bounces@lists.wikimedia.org analytics-bounces@lists.wikimedia.org] *On Behalf Of *Toby Negrin *Sent:* Tuesday, January 06, 2015 19:16 *To:* A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. *Subject:* Re: [Analytics] Making EventLogging output to a log file instead of the DB
Right -- couldn't we just tag the URL?
On Tue, Jan 6, 2015 at 10:10 AM, Erik Zachte ezachte@wikimedia.org wrote:
Just to clarify, this is about prefetched images which have not been shown to the public.
They were sent to the browser ahead of a possible request to speed things up but in many cases never actually requested.
https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...
- Erik
*From:* analytics-bounces@lists.wikimedia.org [mailto: analytics-bounces@lists.wikimedia.org] *On Behalf Of *Toby Negrin *Sent:* Tuesday, January 06, 2015 18:49 *To:* A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. *Subject:* Re: [Analytics] Making EventLogging output to a log file instead of the DB
Hi Gilles -- why won't the page view logs work by themselves for this purpose? EL can be configured to write into Hadoop which is probably the best way to get the throughput you need but it seems overcomplicated.
-Toby
On Tue, Jan 6, 2015 at 9:41 AM, Gilles Dubuc gilles@wikimedia.org wrote:
This depends on [1] so we're not going to need that immediately, but in order to help Erik Zachte with his RfC [2] to track unique media views in Media Viewer, I'm going to need to use something almost exactly like EventLogging. The main difference being that it should skip writing to the database and write to a log file instead.
That's because we'll be recording around 20-25M image views per day, which would needlessly overload EventLogging for little purpose since the data will be used for offline stats generation and doesn't need to be made available in a relational database. Of course if storage space and EventLogging capacity were no object, we could just use EL and keep the ever-growing table forever, but I have the impression that we want to be reasonable here and only write to a log, since that's what Erik needs.
So here's the question: for a specific schema, can EventLogging work the way it does but only record hits to a log file (maybe it already does that before hitting the DB?) and not write to the DB? If not, how difficult would it be to make EL capable of doing that?
[1] https://phabricator.wikimedia.org/T44815 [2] https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count...
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I think Gilles and Erik want to calculate page views for GLAM mainly (although there are some other good reasons too) -- sampling would probably be ok but we'd miss the long tail of views.
On Wed, Jan 7, 2015 at 10:56 AM, Nuria Ruiz nuria@wikimedia.org wrote:
I see. My main point was that -regardless of collection method- we might not need every single data point to calculate uniques.
On Wed, Jan 7, 2015 at 10:38 AM, Toby Negrin tnegrin@wikimedia.org wrote:
Yes -- we disabled it because there wasn't a use case. We have one now :)
On Wed, Jan 7, 2015 at 10:32 AM, Nuria Ruiz nuria@wikimedia.org wrote:
I believe there is already an EL-Kafka pipeline and this would make
it easy to integrate page views with our regular processing.
Note that the pipeline was disabled 6 months ago and thus my comment "in the near term"
https://github.com/wikimedia/operations-puppet/commit/f85b1dbcd61bbb58684ff9...
On Wed, Jan 7, 2015 at 9:39 AM, Toby Negrin tnegrin@wikimedia.org wrote:
I'd also like us to consider routing this dataset to hadoop. I believe there is already an EL-Kafka pipeline and this would make it easy to integrate page views with our regular processing.
Gilles -- are mobile page views included in your stream?
-Toby
On Wed, Jan 7, 2015 at 9:27 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Great, then I guess it's a matter of only making the data go to
files and not to DB for the particular schema we'll create. Does >that sound like something feasible? How much work would be required to set it up? I do not think this is feasible on the near term w/o changes in our end. I also am not sure it is really needed. You are concern about sending stuff to db due to "volume", correct? I do not understand why logging every single data point would be needed. Maybe you can explain that with a bit more detail for us to grasp the use case?
If it is a matter of identifying distinct requests that can be done having sampled your dataset if it is large enough, we can help with that and leila just put together some docs on this regard, while this is for hive queries principles can apply elsewhere: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Counting_uniques
On Wed, Jan 7, 2015 at 6:42 AM, Gilles Dubuc gilles@wikimedia.org wrote:
Right -- couldn't we just tag the URL? >
The event of the user actually viewing the image is completely disconnected from the URL hit in Media Viewer, which is why we need EL and can't rely on existing server logs.
> Eventlogging data currently does go to files, as well as to the DB. >
Great, then I guess it's a matter of only making the data go to files and not to DB for the particular schema we'll create. Does that sound like something feasible? How much work would be required to set it up?
On Tue, Jan 6, 2015 at 7:45 PM, Andrew Otto aotto@wikimedia.org wrote:
> Eventlogging data currently does go to files, as well as to the DB. > Check it out on stat1003 at /srv/eventlogging/archive. > > If you need something with higher throughput then eventlogging > itself supports…then let’s talk :D > > -Ao > > > > > On Jan 6, 2015, at 13:28, Erik Zachte ezachte@wikimedia.org wrote: > > You mean attach an X-analytics parameter, for extra images beyond > the one the user initially requested. > > But then we would undercount, basically missing all image views from > clicking right arrow in image viewer. > I'm not sure how much we would miss then. > iirc Gilles said this browsing feature was used quite a long, but > I'm not sure. > > > *From:* analytics-bounces@lists.wikimedia.org [ > mailto:analytics-bounces@lists.wikimedia.org > analytics-bounces@lists.wikimedia.org] *On Behalf Of *Toby Negrin > *Sent:* Tuesday, January 06, 2015 19:16 > *To:* A mailing list for the Analytics Team at WMF and everybody > who has an interest in Wikipedia and analytics. > *Subject:* Re: [Analytics] Making EventLogging output to a log file > instead of the DB > > > > Right -- couldn't we just tag the URL? > > > > On Tue, Jan 6, 2015 at 10:10 AM, Erik Zachte ezachte@wikimedia.org > wrote: > > Just to clarify, this is about prefetched images which have not been > shown to the public. > > They were sent to the browser ahead of a possible request to speed > things up but in many cases never actually requested. > > > https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count... > > - Erik > > > > *From:* analytics-bounces@lists.wikimedia.org [mailto: > analytics-bounces@lists.wikimedia.org] *On Behalf Of *Toby Negrin > *Sent:* Tuesday, January 06, 2015 18:49 > *To:* A mailing list for the Analytics Team at WMF and everybody > who has an interest in Wikipedia and analytics. > *Subject:* Re: [Analytics] Making EventLogging output to a log file > instead of the DB > > > > Hi Gilles -- why won't the page view logs work by themselves for > this purpose? EL can be configured to write into Hadoop which is probably > the best way to get the throughput you need but it seems overcomplicated. > > > > -Toby > > > > On Tue, Jan 6, 2015 at 9:41 AM, Gilles Dubuc gilles@wikimedia.org > wrote: > > This depends on [1] so we're not going to need that immediately, but > in order to help Erik Zachte with his RfC [2] to track unique media views > in Media Viewer, I'm going to need to use something almost exactly like > EventLogging. The main difference being that it should skip writing to the > database and write to a log file instead. > > That's because we'll be recording around 20-25M image views per day, > which would needlessly overload EventLogging for little purpose since the > data will be used for offline stats generation and doesn't need to be made > available in a relational database. Of course if storage space and > EventLogging capacity were no object, we could just use EL and keep the > ever-growing table forever, but I have the impression that we want to be > reasonable here and only write to a log, since that's what Erik needs. > > So here's the question: for a specific schema, can EventLogging work > the way it does but only record hits to a log file (maybe it already does > that before hitting the DB?) and not write to the DB? If not, how difficult > would it be to make EL capable of doing that? > > > [1] https://phabricator.wikimedia.org/T44815 > [2] > https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count... > > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > >
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I think Gilles and Erik want to calculate page views for GLAM mainly (although there are some other good reasons too) -- sampling would probably be ok but we'd miss the long tail of views.
That's correct. We're looking to compile media view counts as accurate as the ones we have for article views at the moment. Sampling would be fine to identify the X most viewed media across a wiki, but it definitely wouldn't help small GLAMs who want to get that information about their own collection, if their media happen to be "low traffic" in the grand scheme of things. I think that the latter is the main use case for doing this, which is why I'm looking for a solution that wouldn't involve sampling.
Compiling the top list has entertainment value, letting GLAM contributors get accurate statistics about their content improves the chances that they will keep contributing more. I think that's more valuable than the entertainment factor of the top list.
On Wed, Jan 7, 2015 at 8:02 PM, Toby Negrin tnegrin@wikimedia.org wrote:
I think Gilles and Erik want to calculate page views for GLAM mainly (although there are some other good reasons too) -- sampling would probably be ok but we'd miss the long tail of views.
On Wed, Jan 7, 2015 at 10:56 AM, Nuria Ruiz nuria@wikimedia.org wrote:
I see. My main point was that -regardless of collection method- we might not need every single data point to calculate uniques.
On Wed, Jan 7, 2015 at 10:38 AM, Toby Negrin tnegrin@wikimedia.org wrote:
Yes -- we disabled it because there wasn't a use case. We have one now :)
On Wed, Jan 7, 2015 at 10:32 AM, Nuria Ruiz nuria@wikimedia.org wrote:
I believe there is already an EL-Kafka pipeline and this would make
it easy to integrate page views with our regular processing.
Note that the pipeline was disabled 6 months ago and thus my comment "in the near term"
https://github.com/wikimedia/operations-puppet/commit/f85b1dbcd61bbb58684ff9...
On Wed, Jan 7, 2015 at 9:39 AM, Toby Negrin tnegrin@wikimedia.org wrote:
I'd also like us to consider routing this dataset to hadoop. I believe there is already an EL-Kafka pipeline and this would make it easy to integrate page views with our regular processing.
Gilles -- are mobile page views included in your stream?
-Toby
On Wed, Jan 7, 2015 at 9:27 AM, Nuria Ruiz nuria@wikimedia.org wrote:
>Great, then I guess it's a matter of only making the data go to files and not to DB for the particular schema we'll create. Does >that sound like something feasible? How much work would be required to set it up? I do not think this is feasible on the near term w/o changes in our end. I also am not sure it is really needed. You are concern about sending stuff to db due to "volume", correct? I do not understand why logging every single data point would be needed. Maybe you can explain that with a bit more detail for us to grasp the use case?
If it is a matter of identifying distinct requests that can be done having sampled your dataset if it is large enough, we can help with that and leila just put together some docs on this regard, while this is for hive queries principles can apply elsewhere: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Counting_uniques
On Wed, Jan 7, 2015 at 6:42 AM, Gilles Dubuc gilles@wikimedia.org wrote:
> Right -- couldn't we just tag the URL? >> > > The event of the user actually viewing the image is completely > disconnected from the URL hit in Media Viewer, which is why we need EL and > can't rely on existing server logs. > > >> Eventlogging data currently does go to files, as well as to the DB. >> > > Great, then I guess it's a matter of only making the data go to > files and not to DB for the particular schema we'll create. Does that sound > like something feasible? How much work would be required to set it up? > > On Tue, Jan 6, 2015 at 7:45 PM, Andrew Otto aotto@wikimedia.org > wrote: > >> Eventlogging data currently does go to files, as well as to the >> DB. Check it out on stat1003 at /srv/eventlogging/archive. >> >> If you need something with higher throughput then eventlogging >> itself supports…then let’s talk :D >> >> -Ao >> >> >> >> >> On Jan 6, 2015, at 13:28, Erik Zachte ezachte@wikimedia.org >> wrote: >> >> You mean attach an X-analytics parameter, for extra images beyond >> the one the user initially requested. >> >> But then we would undercount, basically missing all image views >> from clicking right arrow in image viewer. >> I'm not sure how much we would miss then. >> iirc Gilles said this browsing feature was used quite a long, but >> I'm not sure. >> >> >> *From:* analytics-bounces@lists.wikimedia.org [ >> mailto:analytics-bounces@lists.wikimedia.org >> analytics-bounces@lists.wikimedia.org] *On Behalf Of *Toby Negrin >> *Sent:* Tuesday, January 06, 2015 19:16 >> *To:* A mailing list for the Analytics Team at WMF and everybody >> who has an interest in Wikipedia and analytics. >> *Subject:* Re: [Analytics] Making EventLogging output to a log >> file instead of the DB >> >> >> >> Right -- couldn't we just tag the URL? >> >> >> >> On Tue, Jan 6, 2015 at 10:10 AM, Erik Zachte ezachte@wikimedia.org >> wrote: >> >> Just to clarify, this is about prefetched images which have not >> been shown to the public. >> >> They were sent to the browser ahead of a possible request to speed >> things up but in many cases never actually requested. >> >> >> https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count... >> >> - Erik >> >> >> >> *From:* analytics-bounces@lists.wikimedia.org [mailto: >> analytics-bounces@lists.wikimedia.org] *On Behalf Of *Toby Negrin >> *Sent:* Tuesday, January 06, 2015 18:49 >> *To:* A mailing list for the Analytics Team at WMF and everybody >> who has an interest in Wikipedia and analytics. >> *Subject:* Re: [Analytics] Making EventLogging output to a log >> file instead of the DB >> >> >> >> Hi Gilles -- why won't the page view logs work by themselves for >> this purpose? EL can be configured to write into Hadoop which is probably >> the best way to get the throughput you need but it seems overcomplicated. >> >> >> >> -Toby >> >> >> >> On Tue, Jan 6, 2015 at 9:41 AM, Gilles Dubuc gilles@wikimedia.org >> wrote: >> >> This depends on [1] so we're not going to need that immediately, >> but in order to help Erik Zachte with his RfC [2] to track unique media >> views in Media Viewer, I'm going to need to use something almost exactly >> like EventLogging. The main difference being that it should skip writing to >> the database and write to a log file instead. >> >> That's because we'll be recording around 20-25M image views per >> day, which would needlessly overload EventLogging for little purpose since >> the data will be used for offline stats generation and doesn't need to be >> made available in a relational database. Of course if storage space and >> EventLogging capacity were no object, we could just use EL and keep the >> ever-growing table forever, but I have the impression that we want to be >> reasonable here and only write to a log, since that's what Erik needs. >> >> So here's the question: for a specific schema, can EventLogging >> work the way it does but only record hits to a log file (maybe it already >> does that before hitting the DB?) and not write to the DB? If not, how >> difficult would it be to make EL capable of doing that? >> >> >> [1] https://phabricator.wikimedia.org/T44815 >> [2] >> https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count... >> >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> >> >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > >
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
If I were to venture into writing a changeset for this (made into a task: https://phabricator.wikimedia.org/T87177 ), is everything self-contained in the EventLogging extension or are there external parts involved in the current pipeline sending events to the DB in production that I need to be aware of?
On Fri, Jan 9, 2015 at 8:40 AM, Gilles Dubuc gilles@wikimedia.org wrote:
I think Gilles and Erik want to calculate page views for GLAM mainly
(although there are some other good reasons too) -- sampling would probably be ok but we'd miss the long tail of views.
That's correct. We're looking to compile media view counts as accurate as the ones we have for article views at the moment. Sampling would be fine to identify the X most viewed media across a wiki, but it definitely wouldn't help small GLAMs who want to get that information about their own collection, if their media happen to be "low traffic" in the grand scheme of things. I think that the latter is the main use case for doing this, which is why I'm looking for a solution that wouldn't involve sampling.
Compiling the top list has entertainment value, letting GLAM contributors get accurate statistics about their content improves the chances that they will keep contributing more. I think that's more valuable than the entertainment factor of the top list.
On Wed, Jan 7, 2015 at 8:02 PM, Toby Negrin tnegrin@wikimedia.org wrote:
I think Gilles and Erik want to calculate page views for GLAM mainly (although there are some other good reasons too) -- sampling would probably be ok but we'd miss the long tail of views.
On Wed, Jan 7, 2015 at 10:56 AM, Nuria Ruiz nuria@wikimedia.org wrote:
I see. My main point was that -regardless of collection method- we might not need every single data point to calculate uniques.
On Wed, Jan 7, 2015 at 10:38 AM, Toby Negrin tnegrin@wikimedia.org wrote:
Yes -- we disabled it because there wasn't a use case. We have one now :)
On Wed, Jan 7, 2015 at 10:32 AM, Nuria Ruiz nuria@wikimedia.org wrote:
I believe there is already an EL-Kafka pipeline and this would make
it easy to integrate page views with our regular processing.
Note that the pipeline was disabled 6 months ago and thus my comment "in the near term"
https://github.com/wikimedia/operations-puppet/commit/f85b1dbcd61bbb58684ff9...
On Wed, Jan 7, 2015 at 9:39 AM, Toby Negrin tnegrin@wikimedia.org wrote:
I'd also like us to consider routing this dataset to hadoop. I believe there is already an EL-Kafka pipeline and this would make it easy to integrate page views with our regular processing.
Gilles -- are mobile page views included in your stream?
-Toby
On Wed, Jan 7, 2015 at 9:27 AM, Nuria Ruiz nuria@wikimedia.org wrote:
> >Great, then I guess it's a matter of only making the data go to > files and not to DB for the particular schema we'll create. Does >that > sound like something feasible? How much work would be required to set it up? > I do not think this is feasible on the near term w/o changes in our > end. I also am not sure it is really needed. You are concern about sending > stuff to db due to "volume", correct? I do not understand why logging every > single data point would be needed. Maybe you can explain that with a bit > more detail for us to grasp the use case? > > If it is a matter of identifying distinct requests that can be done > having sampled your dataset if it is large enough, we can help with that > and leila just put together some docs on this regard, while this is for > hive queries principles can apply elsewhere: > https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Counting_uniques > > > > On Wed, Jan 7, 2015 at 6:42 AM, Gilles Dubuc gilles@wikimedia.org > wrote: > >> Right -- couldn't we just tag the URL? >>> >> >> The event of the user actually viewing the image is completely >> disconnected from the URL hit in Media Viewer, which is why we need EL and >> can't rely on existing server logs. >> >> >>> Eventlogging data currently does go to files, as well as to the DB. >>> >> >> Great, then I guess it's a matter of only making the data go to >> files and not to DB for the particular schema we'll create. Does that sound >> like something feasible? How much work would be required to set it up? >> >> On Tue, Jan 6, 2015 at 7:45 PM, Andrew Otto aotto@wikimedia.org >> wrote: >> >>> Eventlogging data currently does go to files, as well as to the >>> DB. Check it out on stat1003 at /srv/eventlogging/archive. >>> >>> If you need something with higher throughput then eventlogging >>> itself supports…then let’s talk :D >>> >>> -Ao >>> >>> >>> >>> >>> On Jan 6, 2015, at 13:28, Erik Zachte ezachte@wikimedia.org >>> wrote: >>> >>> You mean attach an X-analytics parameter, for extra images beyond >>> the one the user initially requested. >>> >>> But then we would undercount, basically missing all image views >>> from clicking right arrow in image viewer. >>> I'm not sure how much we would miss then. >>> iirc Gilles said this browsing feature was used quite a long, but >>> I'm not sure. >>> >>> >>> *From:* analytics-bounces@lists.wikimedia.org [ >>> mailto:analytics-bounces@lists.wikimedia.org >>> analytics-bounces@lists.wikimedia.org] *On Behalf Of *Toby >>> Negrin >>> *Sent:* Tuesday, January 06, 2015 19:16 >>> *To:* A mailing list for the Analytics Team at WMF and everybody >>> who has an interest in Wikipedia and analytics. >>> *Subject:* Re: [Analytics] Making EventLogging output to a log >>> file instead of the DB >>> >>> >>> >>> Right -- couldn't we just tag the URL? >>> >>> >>> >>> On Tue, Jan 6, 2015 at 10:10 AM, Erik Zachte < >>> ezachte@wikimedia.org> wrote: >>> >>> Just to clarify, this is about prefetched images which have not >>> been shown to the public. >>> >>> They were sent to the browser ahead of a possible request to speed >>> things up but in many cases never actually requested. >>> >>> >>> https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count... >>> >>> - Erik >>> >>> >>> >>> *From:* analytics-bounces@lists.wikimedia.org [mailto: >>> analytics-bounces@lists.wikimedia.org] *On Behalf Of *Toby Negrin >>> *Sent:* Tuesday, January 06, 2015 18:49 >>> *To:* A mailing list for the Analytics Team at WMF and everybody >>> who has an interest in Wikipedia and analytics. >>> *Subject:* Re: [Analytics] Making EventLogging output to a log >>> file instead of the DB >>> >>> >>> >>> Hi Gilles -- why won't the page view logs work by themselves for >>> this purpose? EL can be configured to write into Hadoop which is probably >>> the best way to get the throughput you need but it seems overcomplicated. >>> >>> >>> >>> -Toby >>> >>> >>> >>> On Tue, Jan 6, 2015 at 9:41 AM, Gilles Dubuc gilles@wikimedia.org >>> wrote: >>> >>> This depends on [1] so we're not going to need that immediately, >>> but in order to help Erik Zachte with his RfC [2] to track unique media >>> views in Media Viewer, I'm going to need to use something almost exactly >>> like EventLogging. The main difference being that it should skip writing to >>> the database and write to a log file instead. >>> >>> That's because we'll be recording around 20-25M image views per >>> day, which would needlessly overload EventLogging for little purpose since >>> the data will be used for offline stats generation and doesn't need to be >>> made available in a relational database. Of course if storage space and >>> EventLogging capacity were no object, we could just use EL and keep the >>> ever-growing table forever, but I have the impression that we want to be >>> reasonable here and only write to a log, since that's what Erik needs. >>> >>> So here's the question: for a specific schema, can EventLogging >>> work the way it does but only record hits to a log file (maybe it already >>> does that before hitting the DB?) and not write to the DB? If not, how >>> difficult would it be to make EL capable of doing that? >>> >>> >>> [1] https://phabricator.wikimedia.org/T44815 >>> [2] >>> https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count... >>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> Analytics@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> Analytics@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> Analytics@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> Analytics@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > >
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
If I were to venture into writing a changeset for this (made into a task:
https://phabricator.wikimedia.org/T87177 ), is everything >self-contained in the EventLogging extension For the proposed solution of sending events to kafka/hadoop thee answer will that there is work to do in the EL extension, puppet and likely refinery as you would need to create a partition where your data might go. I think a meeting will be in order to get a concrete idea of what is what we want to do.
On Mon, Jan 19, 2015 at 5:44 AM, Gilles Dubuc gilles@wikimedia.org wrote:
If I were to venture into writing a changeset for this (made into a task: https://phabricator.wikimedia.org/T87177 ), is everything self-contained in the EventLogging extension or are there external parts involved in the current pipeline sending events to the DB in production that I need to be aware of?
On Fri, Jan 9, 2015 at 8:40 AM, Gilles Dubuc gilles@wikimedia.org wrote:
I think Gilles and Erik want to calculate page views for GLAM mainly
(although there are some other good reasons too) -- sampling would probably be ok but we'd miss the long tail of views.
That's correct. We're looking to compile media view counts as accurate as the ones we have for article views at the moment. Sampling would be fine to identify the X most viewed media across a wiki, but it definitely wouldn't help small GLAMs who want to get that information about their own collection, if their media happen to be "low traffic" in the grand scheme of things. I think that the latter is the main use case for doing this, which is why I'm looking for a solution that wouldn't involve sampling.
Compiling the top list has entertainment value, letting GLAM contributors get accurate statistics about their content improves the chances that they will keep contributing more. I think that's more valuable than the entertainment factor of the top list.
On Wed, Jan 7, 2015 at 8:02 PM, Toby Negrin tnegrin@wikimedia.org wrote:
I think Gilles and Erik want to calculate page views for GLAM mainly (although there are some other good reasons too) -- sampling would probably be ok but we'd miss the long tail of views.
On Wed, Jan 7, 2015 at 10:56 AM, Nuria Ruiz nuria@wikimedia.org wrote:
I see. My main point was that -regardless of collection method- we might not need every single data point to calculate uniques.
On Wed, Jan 7, 2015 at 10:38 AM, Toby Negrin tnegrin@wikimedia.org wrote:
Yes -- we disabled it because there wasn't a use case. We have one now :)
On Wed, Jan 7, 2015 at 10:32 AM, Nuria Ruiz nuria@wikimedia.org wrote:
> I believe there is already an EL-Kafka pipeline and this would make it easy to integrate page views with our regular processing.
Note that the pipeline was disabled 6 months ago and thus my comment "in the near term"
https://github.com/wikimedia/operations-puppet/commit/f85b1dbcd61bbb58684ff9...
On Wed, Jan 7, 2015 at 9:39 AM, Toby Negrin tnegrin@wikimedia.org wrote:
> I'd also like us to consider routing this dataset to hadoop. I > believe there is already an EL-Kafka pipeline and this would make it easy > to integrate page views with our regular processing. > > Gilles -- are mobile page views included in your stream? > > -Toby > > On Wed, Jan 7, 2015 at 9:27 AM, Nuria Ruiz nuria@wikimedia.org > wrote: > >> >Great, then I guess it's a matter of only making the data go to >> files and not to DB for the particular schema we'll create. Does >that >> sound like something feasible? How much work would be required to set it up? >> I do not think this is feasible on the near term w/o changes in our >> end. I also am not sure it is really needed. You are concern about sending >> stuff to db due to "volume", correct? I do not understand why logging every >> single data point would be needed. Maybe you can explain that with a bit >> more detail for us to grasp the use case? >> >> If it is a matter of identifying distinct requests that can be done >> having sampled your dataset if it is large enough, we can help with that >> and leila just put together some docs on this regard, while this is for >> hive queries principles can apply elsewhere: >> https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Counting_uniques >> >> >> >> On Wed, Jan 7, 2015 at 6:42 AM, Gilles Dubuc gilles@wikimedia.org >> wrote: >> >>> Right -- couldn't we just tag the URL? >>>> >>> >>> The event of the user actually viewing the image is completely >>> disconnected from the URL hit in Media Viewer, which is why we need EL and >>> can't rely on existing server logs. >>> >>> >>>> Eventlogging data currently does go to files, as well as to the >>>> DB. >>>> >>> >>> Great, then I guess it's a matter of only making the data go to >>> files and not to DB for the particular schema we'll create. Does that sound >>> like something feasible? How much work would be required to set it up? >>> >>> On Tue, Jan 6, 2015 at 7:45 PM, Andrew Otto aotto@wikimedia.org >>> wrote: >>> >>>> Eventlogging data currently does go to files, as well as to the >>>> DB. Check it out on stat1003 at /srv/eventlogging/archive. >>>> >>>> If you need something with higher throughput then eventlogging >>>> itself supports…then let’s talk :D >>>> >>>> -Ao >>>> >>>> >>>> >>>> >>>> On Jan 6, 2015, at 13:28, Erik Zachte ezachte@wikimedia.org >>>> wrote: >>>> >>>> You mean attach an X-analytics parameter, for extra images beyond >>>> the one the user initially requested. >>>> >>>> But then we would undercount, basically missing all image views >>>> from clicking right arrow in image viewer. >>>> I'm not sure how much we would miss then. >>>> iirc Gilles said this browsing feature was used quite a long, but >>>> I'm not sure. >>>> >>>> >>>> *From:* analytics-bounces@lists.wikimedia.org [ >>>> mailto:analytics-bounces@lists.wikimedia.org >>>> analytics-bounces@lists.wikimedia.org] *On Behalf Of *Toby >>>> Negrin >>>> *Sent:* Tuesday, January 06, 2015 19:16 >>>> *To:* A mailing list for the Analytics Team at WMF and everybody >>>> who has an interest in Wikipedia and analytics. >>>> *Subject:* Re: [Analytics] Making EventLogging output to a log >>>> file instead of the DB >>>> >>>> >>>> >>>> Right -- couldn't we just tag the URL? >>>> >>>> >>>> >>>> On Tue, Jan 6, 2015 at 10:10 AM, Erik Zachte < >>>> ezachte@wikimedia.org> wrote: >>>> >>>> Just to clarify, this is about prefetched images which have not >>>> been shown to the public. >>>> >>>> They were sent to the browser ahead of a possible request to >>>> speed things up but in many cases never actually requested. >>>> >>>> >>>> https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count... >>>> >>>> - Erik >>>> >>>> >>>> >>>> *From:* analytics-bounces@lists.wikimedia.org [mailto: >>>> analytics-bounces@lists.wikimedia.org] *On Behalf Of *Toby Negrin >>>> *Sent:* Tuesday, January 06, 2015 18:49 >>>> *To:* A mailing list for the Analytics Team at WMF and everybody >>>> who has an interest in Wikipedia and analytics. >>>> *Subject:* Re: [Analytics] Making EventLogging output to a log >>>> file instead of the DB >>>> >>>> >>>> >>>> Hi Gilles -- why won't the page view logs work by themselves for >>>> this purpose? EL can be configured to write into Hadoop which is probably >>>> the best way to get the throughput you need but it seems overcomplicated. >>>> >>>> >>>> >>>> -Toby >>>> >>>> >>>> >>>> On Tue, Jan 6, 2015 at 9:41 AM, Gilles Dubuc < >>>> gilles@wikimedia.org> wrote: >>>> >>>> This depends on [1] so we're not going to need that immediately, >>>> but in order to help Erik Zachte with his RfC [2] to track unique media >>>> views in Media Viewer, I'm going to need to use something almost exactly >>>> like EventLogging. The main difference being that it should skip writing to >>>> the database and write to a log file instead. >>>> >>>> That's because we'll be recording around 20-25M image views per >>>> day, which would needlessly overload EventLogging for little purpose since >>>> the data will be used for offline stats generation and doesn't need to be >>>> made available in a relational database. Of course if storage space and >>>> EventLogging capacity were no object, we could just use EL and keep the >>>> ever-growing table forever, but I have the impression that we want to be >>>> reasonable here and only write to a log, since that's what Erik needs. >>>> >>>> So here's the question: for a specific schema, can EventLogging >>>> work the way it does but only record hits to a log file (maybe it already >>>> does that before hitting the DB?) and not write to the DB? If not, how >>>> difficult would it be to make EL capable of doing that? >>>> >>>> >>>> [1] https://phabricator.wikimedia.org/T44815 >>>> [2] >>>> https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count... >>>> >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> Analytics@lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> Analytics@lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> Analytics@lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> Analytics@lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> Analytics@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > >
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics