Re: [Analytics] [E3-team] The Uniﬁed Logging Infrastructure for Data Analytics at Twitter

22 Jan 2013

Yeah, irregularities like that are obviously an issue. I believe the
inclusion of the byte-offset at all (and thus, the tab character) is an
artifact of the Kafka2Hadoop importer; it's certainly not intended to be
included in the files at all. The use of semicolon+space in extended
headers like "Content-Type: text/plain; charset=utf8;" is in-spec, but the
edge should obviously be escaping the space. Additionally, we've been
waiting to migrate edge logging to a tab delimiter until after the main
fundraiser concluded. I believe that should move forward in the next week.

That said, I was more interested in whether a unified event format (with
limited and standardized fields) seems like a good idea. Twitter's data
stream doesn't seem to look all that different from ours, and the six
fields they propose seem like they're close to our needs.

On Tue, Jan 22, 2013 at 2:18 AM, Ori Livneh &lt;ori(a)wikimedia.org&gt; wrote:

...
  Sort out the field separator issue in your handling of
squid logs first.

 To summarize:
 1) Kafka byte offset is separated from hostname by a tab.
 2) Other fields are separated by a space.
 3) The content-type field contains unescaped spaces.
 4) Beeswax only supports splitting on a single character.

 As a result:
 1) Byte offset is not separable from the hostname ("
 316554683463cp1043.wikimedia.org")
 2) Spaces in content-type causes the field to span a variable number of
 columns, making it impossible to select the user agent string.

 I'd like a solution to this that does not require that I provide a jar
 file for customized string processing.

 --
 Ori Livneh

 On Tuesday, January 22, 2013 at 2:05 AM, David Schoonover wrote:

  Yes! We've talked a bit about this paper when
thinking about the  structure of our data storage and processing. To me the path
Twitter
 followed seems very reasonable, so it's encouraging to hear that it looks
 that way to someone who gets dirty with data on a daily basis.

 As it stands now, we weren't planning on enforcing any schema  requirements in
Kraken, but it'd be interesting to experiment with a
 standardized event-data format if y'all were in favor of it. Our most
 recent pass at a schema[1] -- mostly for binary serialization, to save bits
 -- has an otherwise-untyped (String-String) map for the KV pairs of the
 data payload. We intended to use an additional, optional field to permit
 specifying a sub-schema to apply strong typing to incoming event data. (We
 plan on storing things with Avro, but it's easy enough to convert between
 it and JSONSchema.) Event subclasses would be more flexible but require
 custom processing for each class. I'd normally oppose a standard model
 (Google doesn't use one internally, for example) but as Twitter made it
 work, I think it's worth exploring.

 Thoughts?

 [1] 
https://www.mediawiki.org/wiki/Analytics/Kraken/Data_Formats#Event_Data_Sch…

 --
 David Schoonover
 dsc(a)wikimedia.org (mailto:dsc@wikimedia.org)

 On Thursday, 17 January 2013 at 2:00 p, Dario Taraborelli wrote:

 > http://arxiv.org/pdf/1208.4171.pdf
 >
 > This is a pretty interesting and accessible description of best  practices and
design decisions driven by practical problems they had to
 solve at Twitter in the area of client-side event logging, funnel analysis,
 user modeling.
  > E3: check out section "3.2 Client
Events" in particular, which is  quite relevant to EventLogging.

 Dario
 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org (mailto:Analytics@lists.wikimedia.org)
 https://lists.wikimedia.org/mailman/listinfo/analytics 

 _______________________________________________
 E3-team mailing list
 E3-team(a)lists.wikimedia.org (mailto:E3-team@lists.wikimedia.org)
 https://lists.wikimedia.org/mailman/listinfo/e3-team 

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

-- 
--
David Schoonover
dsc(a)wikimedia.org

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] [E3-team] The Uniﬁed Logging Infrastructure for Data Analytics at Twitter