http://arxiv.org/pdf/1208.4171.pdf
This is a pretty interesting and accessible description of best practices and design decisions driven by practical problems they had to solve at Twitter in the area of client-side event logging, funnel analysis, user modeling. E3: check out section "3.2 Client Events" in particular, which is quite relevant to EventLogging.
Dario
Yes! We've talked a bit about this paper when thinking about the structure of our data storage and processing. To me the path Twitter followed seems very reasonable, so it's encouraging to hear that it looks that way to someone who gets dirty with data on a daily basis.
As it stands now, we weren't planning on enforcing any schema requirements in Kraken, but it'd be interesting to experiment with a standardized event-data format if y'all were in favor of it. Our most recent pass at a schema[1] -- mostly for binary serialization, to save bits -- has an otherwise-untyped (String-String) map for the KV pairs of the data payload. We intended to use an additional, optional field to permit specifying a sub-schema to apply strong typing to incoming event data. (We plan on storing things with Avro, but it's easy enough to convert between it and JSONSchema.) Event subclasses would be more flexible but require custom processing for each class. I'd normally oppose a standard model (Google doesn't use one internally, for example) but as Twitter made it work, I think it's worth exploring.
Thoughts?
[1] https://www.mediawiki.org/wiki/Analytics/Kraken/Data_Formats#Event_Data_Sche...
Sort out the field separator issue in your handling of squid logs first.
To summarize: 1) Kafka byte offset is separated from hostname by a tab. 2) Other fields are separated by a space. 3) The content-type field contains unescaped spaces. 4) Beeswax only supports splitting on a single character.
As a result: 1) Byte offset is not separable from the hostname ("316554683463cp1043.wikimedia.org") 2) Spaces in content-type causes the field to span a variable number of columns, making it impossible to select the user agent string.
I'd like a solution to this that does not require that I provide a jar file for customized string processing.
-- Ori Livneh
On Tuesday, January 22, 2013 at 2:05 AM, David Schoonover wrote:
Yes! We've talked a bit about this paper when thinking about the structure of our data storage and processing. To me the path Twitter followed seems very reasonable, so it's encouraging to hear that it looks that way to someone who gets dirty with data on a daily basis.
As it stands now, we weren't planning on enforcing any schema requirements in Kraken, but it'd be interesting to experiment with a standardized event-data format if y'all were in favor of it. Our most recent pass at a schema[1] -- mostly for binary serialization, to save bits -- has an otherwise-untyped (String-String) map for the KV pairs of the data payload. We intended to use an additional, optional field to permit specifying a sub-schema to apply strong typing to incoming event data. (We plan on storing things with Avro, but it's easy enough to convert between it and JSONSchema.) Event subclasses would be more flexible but require custom processing for each class. I'd normally oppose a standard model (Google doesn't use one internally, for example) but as Twitter made it work, I think it's worth exploring.
Thoughts?
[1] https://www.mediawiki.org/wiki/Analytics/Kraken/Data_Formats#Event_Data_Sche...
-- David Schoonover dsc@wikimedia.org (mailto:dsc@wikimedia.org)
On Thursday, 17 January 2013 at 2:00 p, Dario Taraborelli wrote:
http://arxiv.org/pdf/1208.4171.pdf
This is a pretty interesting and accessible description of best practices and design decisions driven by practical problems they had to solve at Twitter in the area of client-side event logging, funnel analysis, user modeling. E3: check out section "3.2 Client Events" in particular, which is quite relevant to EventLogging.
Dario _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org (mailto:Analytics@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/analytics
E3-team mailing list E3-team@lists.wikimedia.org (mailto:E3-team@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/e3-team
Yeah, irregularities like that are obviously an issue. I believe the inclusion of the byte-offset at all (and thus, the tab character) is an artifact of the Kafka2Hadoop importer; it's certainly not intended to be included in the files at all. The use of semicolon+space in extended headers like "Content-Type: text/plain; charset=utf8;" is in-spec, but the edge should obviously be escaping the space. Additionally, we've been waiting to migrate edge logging to a tab delimiter until after the main fundraiser concluded. I believe that should move forward in the next week.
That said, I was more interested in whether a unified event format (with limited and standardized fields) seems like a good idea. Twitter's data stream doesn't seem to look all that different from ours, and the six fields they propose seem like they're close to our needs.
On Tue, Jan 22, 2013 at 2:18 AM, Ori Livneh ori@wikimedia.org wrote:
Sort out the field separator issue in your handling of squid logs first.
To summarize:
- Kafka byte offset is separated from hostname by a tab.
- Other fields are separated by a space.
- The content-type field contains unescaped spaces.
- Beeswax only supports splitting on a single character.
As a result:
- Byte offset is not separable from the hostname ("
316554683463cp1043.wikimedia.org") 2) Spaces in content-type causes the field to span a variable number of columns, making it impossible to select the user agent string.
I'd like a solution to this that does not require that I provide a jar file for customized string processing.
-- Ori Livneh
On Tuesday, January 22, 2013 at 2:05 AM, David Schoonover wrote:
Yes! We've talked a bit about this paper when thinking about the
structure of our data storage and processing. To me the path Twitter followed seems very reasonable, so it's encouraging to hear that it looks that way to someone who gets dirty with data on a daily basis.
As it stands now, we weren't planning on enforcing any schema
requirements in Kraken, but it'd be interesting to experiment with a standardized event-data format if y'all were in favor of it. Our most recent pass at a schema[1] -- mostly for binary serialization, to save bits -- has an otherwise-untyped (String-String) map for the KV pairs of the data payload. We intended to use an additional, optional field to permit specifying a sub-schema to apply strong typing to incoming event data. (We plan on storing things with Avro, but it's easy enough to convert between it and JSONSchema.) Event subclasses would be more flexible but require custom processing for each class. I'd normally oppose a standard model (Google doesn't use one internally, for example) but as Twitter made it work, I think it's worth exploring.
Thoughts?
[1]
https://www.mediawiki.org/wiki/Analytics/Kraken/Data_Formats#Event_Data_Sche...
-- David Schoonover dsc@wikimedia.org (mailto:dsc@wikimedia.org)
On Thursday, 17 January 2013 at 2:00 p, Dario Taraborelli wrote:
http://arxiv.org/pdf/1208.4171.pdf
This is a pretty interesting and accessible description of best
practices and design decisions driven by practical problems they had to solve at Twitter in the area of client-side event logging, funnel analysis, user modeling.
E3: check out section "3.2 Client Events" in particular, which is
quite relevant to EventLogging.
Dario _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org (mailto:Analytics@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/analytics
E3-team mailing list E3-team@lists.wikimedia.org (mailto:E3-team@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/e3-team
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Tuesday, January 22, 2013 at 2:28 AM, David Schoonover wrote:
Yeah, irregularities like that are obviously an issue. I believe the inclusion of the byte-offset at all (and thus, the tab character) is an artifact of the Kafka2Hadoop importer; it's certainly not intended to be included in the files at all. The use of semicolon+space in extended headers like "Content-Type: text/plain; charset=utf8;" is in-spec, but the edge should obviously be escaping the space.
I filed a bug to track this issue here: https://bugzilla.wikimedia.org/show_bug.cgi?id=44236
-- Ori Livneh
Cool, thanks. Yeah we know these are problems for sure. We're really focused on trying to make the data you see reliable right now, which is one of the reasons why this has not been a priority.
- Kafka byte offset is separated from hostname by a tab.
This is annoying, but since thus far I haven't cared about doing any analysis on hostname or byte offset, I can split on space and treat these as a single field that I ignore anyway.
- Other fields are separated by a space.
- The content-type field contains unescaped spaces.
We know that there are spaces, but I had no idea they were coming from Varnish! I thought Varnish was the best at escaping all of its fields. Grrrrr! Using tabs as the field delimiter is high on our priority list. We didn't change it a few months ago because the fundraiser was happening, and also changing this can break other downstream scripts, especially Erik Zachte's workflow that generates everything you see on stats.wikimedia.org. We want to get this taken care of in the next few weeks.
On Jan 22, 2013, at 5:18 AM, Ori Livneh ori@wikimedia.org wrote:
Sort out the field separator issue in your handling of squid logs first.
To summarize:
- Kafka byte offset is separated from hostname by a tab.
- Other fields are separated by a space.
- The content-type field contains unescaped spaces.
- Beeswax only supports splitting on a single character.
As a result:
- Byte offset is not separable from the hostname ("316554683463cp1043.wikimedia.org")
- Spaces in content-type causes the field to span a variable number of columns, making it impossible to select the user agent string.
I'd like a solution to this that does not require that I provide a jar file for customized string processing.
-- Ori Livneh
On Tuesday, January 22, 2013 at 2:05 AM, David Schoonover wrote:
Yes! We've talked a bit about this paper when thinking about the structure of our data storage and processing. To me the path Twitter followed seems very reasonable, so it's encouraging to hear that it looks that way to someone who gets dirty with data on a daily basis.
As it stands now, we weren't planning on enforcing any schema requirements in Kraken, but it'd be interesting to experiment with a standardized event-data format if y'all were in favor of it. Our most recent pass at a schema[1] -- mostly for binary serialization, to save bits -- has an otherwise-untyped (String-String) map for the KV pairs of the data payload. We intended to use an additional, optional field to permit specifying a sub-schema to apply strong typing to incoming event data. (We plan on storing things with Avro, but it's easy enough to convert between it and JSONSchema.) Event subclasses would be more flexible but require custom processing for each class. I'd normally oppose a standard model (Google doesn't use one internally, for example) but as Twitter made it work, I think it's worth exploring.
Thoughts?
[1] https://www.mediawiki.org/wiki/Analytics/Kraken/Data_Formats#Event_Data_Sche...
-- David Schoonover dsc@wikimedia.org (mailto:dsc@wikimedia.org)
On Thursday, 17 January 2013 at 2:00 p, Dario Taraborelli wrote:
http://arxiv.org/pdf/1208.4171.pdf
This is a pretty interesting and accessible description of best practices and design decisions driven by practical problems they had to solve at Twitter in the area of client-side event logging, funnel analysis, user modeling. E3: check out section "3.2 Client Events" in particular, which is quite relevant to EventLogging.
Dario _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org (mailto:Analytics@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/analytics
E3-team mailing list E3-team@lists.wikimedia.org (mailto:E3-team@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/e3-team
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
- Kafka byte offset is separated from hostname by a tab.
This is annoying, but since thus far I haven't cared about doing any analysis on hostname or byte offset, I can split on space and treat these as a single field that I ignore anyway.
Also, the way we are importing stuff into Kraken right now is really hacky. I've been struggling over the last months to try and figure out how to make a single machine that has to look at ALL of the udp2log webrequest packets save unsampled data into HDFS. Its tough. Another reason I haven't worried much about the byteoffset\thostname problem is because I haven't been sure if we will be using the Kafka Hadoop Consumer class (that inserts the byte offsets) in the long term.
On Jan 22, 2013, at 10:13 AM, Andrew Otto otto@wikimedia.org wrote:
Cool, thanks. Yeah we know these are problems for sure. We're really focused on trying to make the data you see reliable right now, which is one of the reasons why this has not been a priority.
- Kafka byte offset is separated from hostname by a tab.
This is annoying, but since thus far I haven't cared about doing any analysis on hostname or byte offset, I can split on space and treat these as a single field that I ignore anyway.
- Other fields are separated by a space.
- The content-type field contains unescaped spaces.
We know that there are spaces, but I had no idea they were coming from Varnish! I thought Varnish was the best at escaping all of its fields. Grrrrr! Using tabs as the field delimiter is high on our priority list. We didn't change it a few months ago because the fundraiser was happening, and also changing this can break other downstream scripts, especially Erik Zachte's workflow that generates everything you see on stats.wikimedia.org. We want to get this taken care of in the next few weeks.
On Jan 22, 2013, at 5:18 AM, Ori Livneh ori@wikimedia.org wrote:
Sort out the field separator issue in your handling of squid logs first.
To summarize:
- Kafka byte offset is separated from hostname by a tab.
- Other fields are separated by a space.
- The content-type field contains unescaped spaces.
- Beeswax only supports splitting on a single character.
As a result:
- Byte offset is not separable from the hostname ("316554683463cp1043.wikimedia.org")
- Spaces in content-type causes the field to span a variable number of columns, making it impossible to select the user agent string.
I'd like a solution to this that does not require that I provide a jar file for customized string processing.
-- Ori Livneh
On Tuesday, January 22, 2013 at 2:05 AM, David Schoonover wrote:
Yes! We've talked a bit about this paper when thinking about the structure of our data storage and processing. To me the path Twitter followed seems very reasonable, so it's encouraging to hear that it looks that way to someone who gets dirty with data on a daily basis.
As it stands now, we weren't planning on enforcing any schema requirements in Kraken, but it'd be interesting to experiment with a standardized event-data format if y'all were in favor of it. Our most recent pass at a schema[1] -- mostly for binary serialization, to save bits -- has an otherwise-untyped (String-String) map for the KV pairs of the data payload. We intended to use an additional, optional field to permit specifying a sub-schema to apply strong typing to incoming event data. (We plan on storing things with Avro, but it's easy enough to convert between it and JSONSchema.) Event subclasses would be more flexible but require custom processing for each class. I'd normally oppose a standard model (Google doesn't use one internally, for example) but as Twitter made it work, I think it's worth exploring.
Thoughts?
[1] https://www.mediawiki.org/wiki/Analytics/Kraken/Data_Formats#Event_Data_Sche...
-- David Schoonover dsc@wikimedia.org (mailto:dsc@wikimedia.org)
On Thursday, 17 January 2013 at 2:00 p, Dario Taraborelli wrote:
http://arxiv.org/pdf/1208.4171.pdf
This is a pretty interesting and accessible description of best practices and design decisions driven by practical problems they had to solve at Twitter in the area of client-side event logging, funnel analysis, user modeling. E3: check out section "3.2 Client Events" in particular, which is quite relevant to EventLogging.
Dario _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org (mailto:Analytics@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/analytics
E3-team mailing list E3-team@lists.wikimedia.org (mailto:E3-team@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/e3-team
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics