I am kicking off this thread after a good conversation with Nuria and Kaldari on pain points and opportunities we have around data QA for EventLogging.
Kaldari, Leila and I have gone through several rounds of data QA before and after the deployment of new features on Mobile and we haven’t found yet a good solution to catch data quality issues early enough in the deployment cycle. Data quality issues with EventLogging typically fall under one of these 5 scenarios:
1) events are logged and schema-compliant but don’t capture data correctly (for example: a wrong value is logged; event counts that should match don’t)
2) events are logged but are not schema-compliant (e.g.: a required field is missing)
3) events are missing due to issues with the instrumentation (e.g.: a UI element is not instrumented)
4) events are missing due to client issues (a specific UI element is not correctly rendered on a given browser/platform and as a result the event is not fired)
5) events are missing due to EventLogging outages
In the early days, Ori and I floated the idea of unit tests for instrumentation to capture constraint violations that are not easily detected via manual testing or the existing client-side validation, but this never happened. When it comes to feature deployments, beta labs is a great starting point for running manual data QA in an environment that is as close as possible to prod. However, there are types of data quality issues that we only discover when collecting data at scale and in the wild (on browsers/platforms that we don’t necessarily test for internally).
Having a full-fledged set of unit tests for data would be terrific, but in the short term I’d like to find a better way to at least identify events that fail validation as early as possible.
- the SQL log database has real-time data but only for event that pass client-side validation
- the JSON logfiles on stat1003 include invalid events, but the data is only rsync’ed from vanadium once a day
is there a way to inspect invalid events in near real time without having access to vanadium? For example, could we create either a dedicated database to write invalid events only or a logfile for validation errors rsync’ed to stat1003 more frequently than once a day?
Thoughts?
Dario