Cool, thanks. Yeah we know these are problems for sure. We're really focused on
trying to make the data you see reliable right now, which is one of the reasons why this
has not been a priority.
1) Kafka byte offset is separated from hostname by a
tab.
This is annoying, but since thus far I haven't cared about doing any
analysis on hostname or byte offset, I can split on space and treat these as a single
field that I ignore anyway.
2) Other fields are separated by a space.
3) The content-type field contains unescaped spaces.
We know that there are spaces, but I had no idea they were coming from Varnish! I
thought Varnish was the best at escaping all of its fields. Grrrrr! Using tabs as the
field delimiter is high on our priority list. We didn't change it a few months ago
because the fundraiser was happening, and also changing this can break other downstream
scripts, especially Erik Zachte's workflow that generates everything you see on
stats.wikimedia.org. We want to get this taken care of in the next few weeks.
On Jan 22, 2013, at 5:18 AM, Ori Livneh <ori(a)wikimedia.org> wrote:
> Sort out the field separator issue in your handling of squid logs first.
>
> To summarize:
1) Kafka byte offset is separated from hostname by a
tab.
2) Other fields are
separated by a space.
3) The content-type field
contains unescaped spaces.
> 4) Beeswax only supports splitting on a single
character.
>
> As a result:
> 1) Byte offset is not separable from the hostname
("316554683463cp1043.wikimedia.org")
> 2) Spaces in content-type causes the field to span a variable number of columns,
making it impossible to select the user agent string.
>
> I'd like a solution to this that does not require that I provide a jar file for
customized string processing.
>
> --
> Ori Livneh
>
>
> On Tuesday, January 22, 2013 at 2:05 AM, David Schoonover wrote:
>
>> Yes! We've talked a bit about this paper when thinking about the structure of
our data storage and processing. To me the path Twitter followed seems very reasonable, so
it's encouraging to hear that it looks that way to someone who gets dirty with data on
a daily basis.
>>
>> As it stands now, we weren't planning on enforcing any schema requirements in
Kraken, but it'd be interesting to experiment with a standardized event-data format if
y'all were in favor of it. Our most recent pass at a schema[1] -- mostly for binary
serialization, to save bits -- has an otherwise-untyped (String-String) map for the KV
pairs of the data payload. We intended to use an additional, optional field to permit
specifying a sub-schema to apply strong typing to incoming event data. (We plan on storing
things with Avro, but it's easy enough to convert between it and JSONSchema.) Event
subclasses would be more flexible but require custom processing for each class. I'd
normally oppose a standard model (Google doesn't use one internally, for example) but
as Twitter made it work, I think it's worth exploring.
>>
>> Thoughts?
>>
>> [1]
https://www.mediawiki.org/wiki/Analytics/Kraken/Data_Formats#Event_Data_Sch…
>>
>> --
>> David Schoonover
>> dsc(a)wikimedia.org (mailto:dsc@wikimedia.org)
>>
>>
>> On Thursday, 17 January 2013 at 2:00 p, Dario Taraborelli wrote:
>>
>>>
http://arxiv.org/pdf/1208.4171.pdf
>>>
>>> This is a pretty interesting and accessible description of best practices and
design decisions driven by practical problems they had to solve at Twitter in the area of
client-side event logging, funnel analysis, user modeling.
>>> E3: check out section "3.2 Client Events" in particular, which is
quite relevant to EventLogging.
>>>
>>> Dario
>>> _______________________________________________
>>> Analytics mailing list
>>> Analytics(a)lists.wikimedia.org (mailto:Analytics@lists.wikimedia.org)
>>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>>
>> _______________________________________________
>> E3-team mailing list
>> E3-team(a)lists.wikimedia.org (mailto:E3-team@lists.wikimedia.org)
>>
https://lists.wikimedia.org/mailman/listinfo/e3-team
>
>
>
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/analytics