Hello!
Today qchris and I deployed some changes[1][2] to bring logs from misc_web varnishes into HDFS and Hive via Kafka. This isn’t a huge deal, but it does mean that we are now collecting webrequest logs for things like phabricator, annual.wikimedia.org http://annual.wikimedia.org/, graphite, stats.wikimedia.org http://stats.wikimedia.org/, etc.
That is all,
:) -Ao
[1] https://gerrit.wikimedia.org/r/#/c/184183 https://gerrit.wikimedia.org/r/#/c/184183 [2] https://gerrit.wikimedia.org/r/#/c/184191 https://gerrit.wikimedia.org/r/#/c/184191
On Mon, Jan 26, 2015 at 03:54:42PM -0800, Andrew Otto wrote:
Today qchris and I deployed some changes[1][2] to bring logs from misc_web varnishes into HDFS and Hive via Kafka. This isn’t a huge deal, but it does mean that we are now collecting webrequest logs for things like phabricator, annual.wikimedia.org http://annual.wikimedia.org/, graphite, stats.wikimedia.org http://stats.wikimedia.org/, etc.
What is this going to be used for? Are we just logging everything from misc? I personally hate to log things unless we absolutely need to.
Regards, Faidon
I’ll let qchris respond in more detail, but I believe(maybe?) there was a request to be able to do some analysis for a site or sites served by misc-web. More immediately, it will be used to generate the ‘5xx’ legacy tsv file. This is one of the outputs still generated by udp2log, and is part of the effort to turn off udp2log.
On Jan 26, 2015, at 15:56, Faidon Liambotis faidon@wikimedia.org wrote:
On Mon, Jan 26, 2015 at 03:54:42PM -0800, Andrew Otto wrote:
Today qchris and I deployed some changes[1][2] to bring logs from misc_web varnishes into HDFS and Hive via Kafka. This isn’t a huge deal, but it does mean that we are now collecting webrequest logs for things like phabricator, annual.wikimedia.org http://annual.wikimedia.org/, graphite, stats.wikimedia.org http://stats.wikimedia.org/, etc.
What is this going to be used for? Are we just logging everything from misc? I personally hate to log things unless we absolutely need to.
Regards, Faidon
Hi Faidon,
On Mon, Jan 26, 2015 at 04:09:32PM -0800, Andrew Otto wrote:
I’ll let qchris respond in more detail, [...]
I do not have much further details.
Currently, udp2log contains misc (not directly via varnish, but indirectly via nginx) and hence misc logs can be queried live, and they also make it onto disk. Like in oxygen's 5xx tsvs. (~1.8K misc log lines/day in the 5xx tsvs).
When preparing switching the tsvs from udp2log to kafka, the guiding principle was that the kafka-based tsvs should not unneededly discard parts of the traffic that have been in the udp2log-based tsvs before.
Hence, when recreating the 5xx tsvs using kafka, it seemed expected to continue to have misc logs in those tsvs.
But if you want to make the point that misc need not be logged and misc wasn't intentionally in udp2log and the 5xx tsvs, then by all means: Yes, agreed, let's remove it. From both kafka and udp2log. I am all for it.
The less we need to log, the better.
Have fun, Christian
P.S.: Bits and misc are quite alike in terms of logging setup [1] and from my point of view also in terms motivation for being in udp2log/kafka. Does this mean bits can/should be dropped too from udp2log/kafka for the same reasoning?
(This is totally not sarcastic. I am serious. If there is a chance of logging less, we should consider it.)
P.P.S.: There are occasional one-off requests on both misc and bits (like “Are people still requesting $DEPRECATED_URL_FOO?”) but those can also be answered through temporary means instead of permanent logging.
[1] Both are in udp2log not because of the varnishes, but because of the nginxs. But into kafka, both of them feed their varnish logs.
(Ok, kafka's bits is currently temporarily turned off. But still.)
On Tue, Jan 27, 2015 at 01:23:10PM +0100, Christian Aistleitner wrote:
But if you want to make the point that misc need not be logged and misc wasn't intentionally in udp2log and the 5xx tsvs, then by all means: Yes, agreed, let's remove it. From both kafka and udp2log. I am all for it.
I don't think it was intentional, no. Even if it was at the time, I think it'd be wrong to put everything into the same pool of logs/statistics. Production should be separate and we shouldn't have to grep production 5xxs in the same log that also has e.g. git.wm.org's 5xx.
All that said, a (separate) 5xx log of misc services can be useful, so I wouldn't object.
Faidon
Just speaking for the Kafka/Hadoop use case, you'd be perfectly able to grep through without having to hit production-level requests; HDFS files are very deliberately partitioned on the class of source varnish (mobile, text, misc, upload, etc): you can just grep through the misc files.
(Unless you meant a literal grep rather than a figurative one. In which case, ignore this ;p)
On 30 January 2015 at 04:51, Faidon Liambotis faidon@wikimedia.org wrote:
On Tue, Jan 27, 2015 at 01:23:10PM +0100, Christian Aistleitner wrote:
But if you want to make the point that misc need not be logged and misc wasn't intentionally in udp2log and the 5xx tsvs, then by all means: Yes, agreed, let's remove it. From both kafka and udp2log. I am all for it.
I don't think it was intentional, no. Even if it was at the time, I think it'd be wrong to put everything into the same pool of logs/statistics. Production should be separate and we shouldn't have to grep production 5xxs in the same log that also has e.g. git.wm.org's 5xx.
All that said, a (separate) 5xx log of misc services can be useful, so I wouldn't object.
Faidon
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics