Thanks for such comprehensive notes! 

On Friday, June 20, 2014, Christian Aistleitner <christian@quelltextlich.at> wrote:
Hi,

TL;DR: When consuming EventLogging data, only rely on the 'log'
database available from m2 replicas, like analytics-store.eqiad.wmnet.

Other representations might not get updated, might not get fix-ups or
may (on purpose) give you unvalidated data.


----------------------------------


Due to the versatile design of EventLogging, its data exists/existed
in many different representations, which got me confused around the
data quality expectations. Also I could not find them publicly
documented. After talking about different aspects with a few people, I
wanted to put my current understanding of it up for public discussion.

Please let me know (either in private or on list), if something looks
wrong or does not match your use of EventLogging data.


* MySQL / MariaDB database on m2

This database is the best place to consume EventLogging data from.

Available as 'log' database on m2 replicas, such as
analytics-store.eqiad.wmnet.

Only validated events enter the database.

In case of bugs, this database is the only place that gets fixes like
cleanup of historic data, or live fixes.



* 'all-events' JSON log files [1]

Use this data source only to debug issues around ingestion into the m2
database.

Entries are JSON objects.

Only validated events get written.

In case of bugs, historic data does not get fixed.



* Raw client and server side log files [2]

Use this data source only to debug issues around ingestion into the m2
database.

Entries are parameters to the event.gif's request. They are not
decoded at all.

In case of bugs, historic data does not get fixed. Neither need
hot-fixes reach those files.



* Kafka:
EventLogging data is no longer fed into Kafka since 2014-06-12 [3].
The EventLogging data in Kafka had no users.
Turning it on again is tracked in bug 66528 [4].



* MongoDB:
EventLogging data is no longer fed into MongoDB since 2014-02-13 [5].
The EventLogging data in MongoDB did not appear to get used.
I am not aware of plans to revive feeding the data into MongoDB.



* ZMQ:
ZMQ is available from vanadium.
In case of bugs, historic data cannot get fixed :-)
Data coming from the forwarders (ports 8421, 8422) is not validated
and need not see hot-fixes.
Data coming from processors (port 8521, 8522) and multiplexer (port
8600) is validated.



Have fun,
Christian



[1] Available as
  stats1002:/a/eventlogging/archive/all-events.log-$DATE.gz
  stats1003:/srv/eventlogging/archive/all-events.log-$DATE.gz
  vanadium:/var/log/eventlogging/...

[2] Available as
  stats1002:/a/eventlogging/archive/client-side-events.log-$DATE.gz
  stats1002:/a/eventlogging/archive/server-side-events.log-$DATE.gz
  stats1003:/srv/eventlogging/archive/client-side-events.log-$DATE.gz
  stats1003:/srv/eventlogging/archive/server-side-events.log-$DATE.gz
  vanadium:/var/log/eventlogging/...

[3] https://git.wikimedia.org/commitdiff/operations%2Fpuppet.git/f85b1dbcd61bbb58684ff93704c1804e808a5d6e

[4] https://bugzilla.wikimedia.org/show_bug.cgi?id=66528

[5] https://git.wikimedia.org/commitdiff/operations%2Fpuppet.git/05b4027973c59b0a786433f8dae2bd1fe28b614f




--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
                           Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3     Email:  christian@quelltextlich.at
4293 Gutau, Austria          Phone:          +43 7946 / 20 5 81
                             Fax:            +43 7946 / 20 5 81
                             Homepage: http://quelltextlich.at/
---------------------------------------------------------------


--
Oliver Keyes
Research Analyst
Wikimedia Foundation