EventLogging data QA

List overview All Threads
Download

newer

older

Adventures in Clusterland...

Switching the R&D team to...

Dario Taraborelli

11 Dec 2014 11 Dec '14

4:11 p.m.

I am kicking off this thread after a good conversation with Nuria and Kaldari on pain points and opportunities we have around data QA for EventLogging.

Kaldari, Leila and I have gone through several rounds of data QA before and after the deployment of new features on Mobile and we haven’t found yet a good solution to catch data quality issues early enough in the deployment cycle. Data quality issues with EventLogging typically fall under one of these 5 scenarios:

1) events are logged and schema-compliant but don’t capture data correctly (for example: a wrong value is logged; event counts that should match don’t) 2) events are logged but are not schema-compliant (e.g.: a required field is missing) 3) events are missing due to issues with the instrumentation (e.g.: a UI element is not instrumented) 4) events are missing due to client issues (a specific UI element is not correctly rendered on a given browser/platform and as a result the event is not fired) 5) events are missing due to EventLogging outages

In the early days, Ori and I floated the idea of unit tests for instrumentation to capture constraint violations that are not easily detected via manual testing or the existing client-side validation, but this never happened. When it comes to feature deployments, beta labs is a great starting point for running manual data QA in an environment that is as close as possible to prod. However, there are types of data quality issues that we only discover when collecting data at scale and in the wild (on browsers/platforms that we don’t necessarily test for internally).

Having a full-fledged set of unit tests for data would be terrific, but in the short term I’d like to find a better way to at least identify events that fail validation as early as possible.

- the SQL log database has real-time data but only for event that pass client-side validation - the JSON logfiles on stat1003 include invalid events, but the data is only rsync’ed from vanadium once a day

is there a way to inspect invalid events in near real time without having access to vanadium? For example, could we create either a dedicated database to write invalid events only or a logfile for validation errors rsync’ed to stat1003 more frequently than once a day?

Thoughts?

Dario

Attachments:

attachment.htm (text/html — 3.0 KB)

Show replies by date

Toby Negrin

11 Dec 11 Dec

4:15 p.m.

Thanks Dario, et al.

A +1 from me -- this will make integration a lot easier. Let's see if we can address this in the Q3 project about dashboarding.

-Toby

On Thu, Dec 11, 2014 at 4:11 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:

...

I am kicking off this thread after a good conversation with Nuria and Kaldari on pain points and opportunities we have around *data QA for EventLogging*.

Kaldari, Leila and I have gone through several rounds of data QA before and after the deployment of new features on Mobile and we haven’t found yet a good solution to catch data quality issues early enough in the deployment cycle. Data quality issues with EventLogging typically fall under one of these 5 scenarios:

events are logged and schema-compliant but don’t capture data correctly

(for example: a wrong value is logged; event counts that should match don’t) 2) events are logged but are not schema-compliant (e.g.: a required field is missing) 3) events are missing due to issues with the instrumentation (e.g.: a UI element is not instrumented) 4) events are missing due to client issues (a specific UI element is not correctly rendered on a given browser/platform and as a result the event is not fired) 5) events are missing due to EventLogging outages

In the early days, Ori and I floated the idea of unit tests for instrumentation to capture constraint violations that are not easily detected via manual testing or the existing client-side validation, but this never happened. When it comes to feature deployments, beta labs is a great starting point for running manual data QA in an environment that is as close as possible to prod. However, there are types of data quality issues that we only discover when collecting data at scale and in the wild (on browsers/platforms that we don’t necessarily test for internally).

Having a full-fledged set of unit tests for data would be terrific, but in the short term I’d like to find a better way to at least *identify events that fail validation as early as possible*.

the SQL log database has real-time data but only for event that pass

client-side validation

the JSON logfiles on stat1003 include invalid events, but the data is

only rsync’ed from vanadium once a day

is there a way to inspect invalid events in near real time without having access to vanadium? For example, could we create either a dedicated database to write invalid events only or a logfile for validation errors rsync’ed to stat1003 more frequently than once a day?

Thoughts?

Dario

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Ori Livneh

4:28 p.m.

On Thu, Dec 11, 2014 at 4:11 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:

...

is there a way to inspect invalid events in near real time without having access to vanadium?

There's this graph: https://graphite.wikimedia.org/render/?width=586&height=308&_salt=14...)

The key is 'diffSeries(eventlogging.overall.raw.rate,eventlogging.overall.valid.rate)', which gets you the rate of invalid events per second.

It is not broken down by schema, though.

We can't write invalid events to a database -- at least not the same way we write well-formed events. The table schema is derived from the event schema, so an invalid event would violate the constraints of the table as well.

It's possible (and easy) to set something up that watches invalid events in real-time and does something with them. The question is: what? E-mail an alert? Produce a daily report? Generate a graph?

If you describe how you'd like to consume the data, I can try to hash out an an implementation with Nuria and Christian.

Grace Gellerman

4:41 p.m.

Captured in Phab:

https://phabricator.wikimedia.org/T78355

Please wordsmith and add other projects as appropriate. Thanks!

On Thu, Dec 11, 2014 at 4:28 PM, Ori Livneh ori@wikimedia.org wrote:

...

On Thu, Dec 11, 2014 at 4:11 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:

...
is there a way to inspect invalid events in near real time without having access to vanadium?

There's this graph: https://graphite.wikimedia.org/render/?width=586&height=308&_salt=14...)

The key is 'diffSeries(eventlogging.overall.raw.rate,eventlogging.overall.valid.rate)', which gets you the rate of invalid events per second.

It is not broken down by schema, though.

We can't write invalid events to a database -- at least not the same way we write well-formed events. The table schema is derived from the event schema, so an invalid event would violate the constraints of the table as well.

It's possible (and easy) to set something up that watches invalid events in real-time and does something with them. The question is: what? E-mail an alert? Produce a daily report? Generate a graph?

If you describe how you'd like to consume the data, I can try to hash out an an implementation with Nuria and Christian.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Dario Taraborelli

4:44 p.m.

thanks for the quick turnaround.

On Dec 11, 2014, at 4:28 PM, Ori Livneh ori@wikimedia.org wrote:

...

There's this graph: https://graphite.wikimedia.org/render/?width=586&height=308&_salt=14...) https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1418343627.977&from=-1weeks&target=movingMedian(diffSeries(eventlogging.overall.raw.rate%2Ceventlogging.overall.valid.rate)%2C20)

The key is 'diffSeries(eventlogging.overall.raw.rate,eventlogging.overall.valid.rate)', which gets you the rate of invalid events per second.

It is not broken down by schema, though.

this is great for monitoring, for QA purposes we really need the raw data

...

We can't write invalid events to a database -- at least not the same way we write well-formed events. The table schema is derived from the event schema, so an invalid event would violate the constraints of the table as well.

rrright

...

It's possible (and easy) to set something up that watches invalid events in real-time and does something with them. The question is: what? E-mail an alert? Produce a daily report? Generate a graph?

If you describe how you’d like to consume the data, I can try to hash out an an implementation with Nuria and Christian.

a JSON log like all-events.log but sync’ed from vanadium more frequently would do the job for me. It can also be truncated as we probably only need a relatively short time window and the complete data is captured in all-events anyway.

Nuria Ruiz

6:03 p.m.

Team:

Besides the ability of testing in beta labs and the monitoring that ori highlited the incoming raw stream of events is available in 1003/1002 on port 8600.

...

From 1002 or 1003 you can run: zsub vanadium.eqiad.wmnet:8600 and see the

incoming stream.

I am not sure that something beyond that is needed, please check it out and let us know.

Thanks,

Nuria

On Thu, Dec 11, 2014 at 4:44 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:

...

thanks for the quick turnaround.

On Dec 11, 2014, at 4:28 PM, Ori Livneh ori@wikimedia.org wrote:

There's this graph: https://graphite.wikimedia.org/render/?width=586&height=308&_salt=14...)

The key is 'diffSeries(eventlogging.overall.raw.rate,eventlogging.overall.valid.rate)', which gets you the rate of invalid events per second.

It is not broken down by schema, though.

this is great for monitoring, for QA purposes we really need the raw data

We can't write invalid events to a database -- at least not the same way we write well-formed events. The table schema is derived from the event schema, so an invalid event would violate the constraints of the table as well.

rrright

It's possible (and easy) to set something up that watches invalid events in real-time and does something with them. The question is: what? E-mail an alert? Produce a daily report? Generate a graph?

If you describe how you’d like to consume the data, I can try to hash out an an implementation with Nuria and Christian.

a JSON log like all-events.log but sync’ed from vanadium more frequently would do the job for me. It can also be truncated as we probably only need a relatively short time window and the complete data is captured in all-events anyway.

D

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Christian Aistleitner

15 Dec 15 Dec

6:46 a.m.

Hi,

On Thu, Dec 11, 2014 at 06:03:15PM -0800, Nuria Ruiz wrote:

...

Besides the ability of testing in beta labs and the monitoring that ori highlited the incoming raw stream of events is available in 1003/1002 on port 8600.

That's not the raw stream, but the multiplexed stream of validated events. Hence, it does not contain the invalid events Dario is looking for.

...

From 1002 or 1003 you can run: zsub vanadium.eqiad.wmnet:8600 and see the incoming stream.

This command adds to the load of vanadium--especially the network load, which jumps up by 20% when running the command. Since some parts on vanadium are UDP, it's better not to saturate the network.

So please only run the command if you have to, and only as long as you have to.

Have fun, Christian

-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

Christian Aistleitner

6:45 a.m.

Hi Dario,

On Thu, Dec 11, 2014 at 04:11:49PM -0800, Dario Taraborelli wrote:

...

I am kicking off this thread [...]

Thanks!

...

However, there are types of data quality issues that we only discover when collecting data at scale and in the wild (on browsers/platforms that we don’t necessarily test for internally).

Full ACK.

However, that sounds like we're only talking about schemas where the collection code got tested using Vagrant or beta, and is known to work on the relevant portion of the traffic.

And since you say that it's on browsers/platforms that we don't necessarily test for internally, I assume we're actually talking only about a small fraction of the traffic.

I assume that scope for the rest of the reply.

...

is there a way to inspect invalid events in near real time without having access to vanadium?

* Urgent, ad-hoc needs

For urgent, ad-hoc needs, (which should happen really seldom, given the scope), ping us in IRC in #wikimedia-analytics. At least qchris, milimetric, and nuria should be able to ssh into vanadium and can take a look right away.

If none of them are around, Ops of course have access to the relevant files on vanadium [1]. And since we're in the case of urgent, ad-hoc needs, I am sure they'd help out.

* Not so urgent needs

For not so urgent needs, since it's only a small fraction of the traffic, I am not sure real-time need is worth it.

Sure it would be nice to provide near real-time access to those files, but we should also get the cluster into a more reliable state, implement UDFs for researches to make their lives easier, and get the data-warehouse up and running ;-)

But I see that meanwhile a Phabricator task got added, and I guess I am alone with my judgement :-)

Have fun, Christian

[1] Either

/srv/log/eventlogging/client-side-events.log

/srv/log/eventlogging/server-side-events.log

depending on the kind of event you're looking for.

Nuria Ruiz

7:35 a.m.

...

But I see that meanwhile a Phabricator task got added, and I guess I am alone with my judgement :-)

Actually, I fully agree with you than no more infrastructure in this regard is needed and I think we were a little fast filing tasks here. I really think that every time we find ourselves testing in production we should evaluate what can do better in the testing pipeline but not augment production with more "testing" tools.

For now we should be able to help in irc and do as much testing as possible in beta labs. How to access data in beta labs is documented here: https://wikitech.wikimedia.org/wiki/EventLogging/Testing/BetaLabs

I talked to mobile team about testing in beta labs (as it was an issue with mobile instrumentation what sprang this discussion) and they have used it as of recent.

Thanks,

Nuria

On Mon, Dec 15, 2014 at 6:45 AM, Christian Aistleitner < christian@quelltextlich.at> wrote:

...

Hi Dario,

On Thu, Dec 11, 2014 at 04:11:49PM -0800, Dario Taraborelli wrote:

...
I am kicking off this thread [...]

Thanks!

...
However, there are types of data quality issues that we only discover when collecting data at scale and in the wild (on browsers/platforms that we don’t necessarily test for internally).

Full ACK.

However, that sounds like we're only talking about schemas where the collection code got tested using Vagrant or beta, and is known to work on the relevant portion of the traffic.

And since you say that it's on browsers/platforms that we don't necessarily test for internally, I assume we're actually talking only about a small fraction of the traffic.

I assume that scope for the rest of the reply.

...
is there a way to inspect invalid events in near real time without having access to vanadium?

Urgent, ad-hoc needs

For urgent, ad-hoc needs, (which should happen really seldom, given the scope), ping us in IRC in #wikimedia-analytics. At least qchris, milimetric, and nuria should be able to ssh into vanadium and can take a look right away.

If none of them are around, Ops of course have access to the relevant files on vanadium [1]. And since we're in the case of urgent, ad-hoc needs, I am sure they'd help out.

Not so urgent needs

For not so urgent needs, since it's only a small fraction of the traffic, I am not sure real-time need is worth it.

Sure it would be nice to provide near real-time access to those files, but we should also get the cluster into a more reliable state, implement UDFs for researches to make their lives easier, and get the data-warehouse up and running ;-)

But I see that meanwhile a Phabricator task got added, and I guess I am alone with my judgement :-)

Have fun, Christian

[1] Either

/srv/log/eventlogging/client-side-events.log

or

/srv/log/eventlogging/server-side-events.log

depending on the kind of event you're looking for.

-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Kevin Leduc

8:34 a.m.

I closed the Phabricator task with a links to this thread and the wikitech doc for testing on beta cluster. https://phabricator.wikimedia.org/T78355

On Mon, Dec 15, 2014 at 7:35 AM, Nuria Ruiz nuria@wikimedia.org wrote:

...

...
But I see that meanwhile a Phabricator task got added, and I guess I am alone with my judgement :-)

Actually, I fully agree with you than no more infrastructure in this regard is needed and I think we were a little fast filing tasks here. I really think that every time we find ourselves testing in production we should evaluate what can do better in the testing pipeline but not augment production with more "testing" tools.

For now we should be able to help in irc and do as much testing as possible in beta labs. How to access data in beta labs is documented here: https://wikitech.wikimedia.org/wiki/EventLogging/Testing/BetaLabs

I talked to mobile team about testing in beta labs (as it was an issue with mobile instrumentation what sprang this discussion) and they have used it as of recent.

Thanks,

Nuria

On Mon, Dec 15, 2014 at 6:45 AM, Christian Aistleitner < christian@quelltextlich.at> wrote:

...
Hi Dario,

On Thu, Dec 11, 2014 at 04:11:49PM -0800, Dario Taraborelli wrote:

...
I am kicking off this thread [...]

Thanks!

...
However, there are types of data quality issues that we only discover when collecting data at scale and in the wild (on browsers/platforms that we don’t necessarily test for internally).

Full ACK.

However, that sounds like we're only talking about schemas where the collection code got tested using Vagrant or beta, and is known to work on the relevant portion of the traffic.

And since you say that it's on browsers/platforms that we don't necessarily test for internally, I assume we're actually talking only about a small fraction of the traffic.

I assume that scope for the rest of the reply.

...
is there a way to inspect invalid events in near real time without having access to vanadium?

Urgent, ad-hoc needs

For urgent, ad-hoc needs, (which should happen really seldom, given the scope), ping us in IRC in #wikimedia-analytics. At least qchris, milimetric, and nuria should be able to ssh into vanadium and can take a look right away.

If none of them are around, Ops of course have access to the relevant files on vanadium [1]. And since we're in the case of urgent, ad-hoc needs, I am sure they'd help out.

Not so urgent needs

For not so urgent needs, since it's only a small fraction of the traffic, I am not sure real-time need is worth it.

Sure it would be nice to provide near real-time access to those files, but we should also get the cluster into a more reliable state, implement UDFs for researches to make their lives easier, and get the data-warehouse up and running ;-)

But I see that meanwhile a Phabricator task got added, and I guess I am alone with my judgement :-)

Have fun, Christian

[1] Either

/srv/log/eventlogging/client-side-events.log

or

/srv/log/eventlogging/server-side-events.log

depending on the kind of event you're looking for.

-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Christian Aistleitner

9:42 a.m.

Hi,

On Mon, Dec 15, 2014 at 08:34:39AM -0800, Kevin Leduc wrote:

...

I closed the Phabricator task with a links to this thread and the wikitech doc for testing on beta cluster.

I am fine with keeping the task closed.

But I am somewhat surprised to see beta mentioned in the resolution. Note that Dario's request set scope as [1]

However, there are types of data quality issues that we only discover when collecting data at scale and in the wild (on browsers/platforms that we don’t necessarily test for internally).

. That's a valid scope, but from my point of view, beta does not match that scope.

Neither is beta large scale, nor is it hammered on with crazy devices.

Beta is just a halfing the distance between EventLogging's devserver (Vagrant!) and production.

Have fun, Christian

[1] https://lists.wikimedia.org/pipermail/analytics/2014-December/002884.html

Toby Negrin

10:06 a.m.

I share Christian's concerns -

Dario/Leila - can you comment based on your recent experiences with WikiGrok?

Thanks

-Toby

...

On Dec 15, 2014, at 9:42 AM, Christian Aistleitner christian@quelltextlich.at wrote:

Hi,

...
On Mon, Dec 15, 2014 at 08:34:39AM -0800, Kevin Leduc wrote: I closed the Phabricator task with a links to this thread and the wikitech doc for testing on beta cluster.

I am fine with keeping the task closed.

But I am somewhat surprised to see beta mentioned in the resolution. Note that Dario's request set scope as [1]

However, there are types of data quality issues that we only discover when collecting data at scale and in the wild (on browsers/platforms that we don’t necessarily test for internally).

. That's a valid scope, but from my point of view, beta does not match that scope.

Neither is beta large scale, nor is it hammered on with crazy devices.

Beta is just a halfing the distance between EventLogging's devserver (Vagrant!) and production.

Have fun, Christian

[1] https://lists.wikimedia.org/pipermail/analytics/2014-December/002884.html

-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Ryan Kaldari

1:04 p.m.

I filed a bug about the difficultly of debugging Schema failure back in November, but no one ever responded to it: https://phabricator.wikimedia.org/T75678

On Mon, Dec 15, 2014 at 10:06 AM, Toby Negrin tnegrin@wikimedia.org wrote:

...

I share Christian's concerns -

Dario/Leila - can you comment based on your recent experiences with WikiGrok?

Thanks

-Toby

...
On Dec 15, 2014, at 9:42 AM, Christian Aistleitner <

christian@quelltextlich.at> wrote:

...
Hi,

...
On Mon, Dec 15, 2014 at 08:34:39AM -0800, Kevin Leduc wrote: I closed the Phabricator task with a links to this thread and the

wikitech

...
...
doc for testing on beta cluster.

I am fine with keeping the task closed.

But I am somewhat surprised to see beta mentioned in the resolution. Note that Dario's request set scope as [1]

However, there are types of data quality issues that we only discover when collecting data at scale and in the wild (on browsers/platforms that we don’t necessarily test for internally).

. That's a valid scope, but from my point of view, beta does not match that scope.

Neither is beta large scale, nor is it hammered on with crazy devices.

Beta is just a halfing the distance between EventLogging's devserver (Vagrant!) and production.

Have fun, Christian

[1]

https://lists.wikimedia.org/pipermail/analytics/2014-December/002884.html

...
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Leila Zia

4:18 p.m.

On Mon, Dec 15, 2014 at 10:06 AM, Toby Negrin tnegrin@wikimedia.org wrote:

...

I share Christian's concerns -

Dario/Leila - can you comment based on your recent experiences with WikiGrok?

I agree with Christian.

QA in beta labs is good but not enough. We still need to do QA when a feature goes to production and currently, it's very hard to figure out if there's a problem with logging. An example:

While testing WikiGrok in production, we learned that after some point tests from Firefox browser from my machine were not logged. We did not get any errors for this. I found out about this because I was trying to manually make a trace of activities and see if I can stitch them together and make sense of them. We eventually figured out what was going on in that case [1], but it concerns me that there may be other important events that we don't log in the DB and we never know that we're not logging.

Leila [1] https://lists.wikimedia.org/pipermail/analytics/2014-December/002864.html

...

Thanks

-Toby

...
On Dec 15, 2014, at 9:42 AM, Christian Aistleitner <

christian@quelltextlich.at> wrote:

...
Hi,

...
On Mon, Dec 15, 2014 at 08:34:39AM -0800, Kevin Leduc wrote: I closed the Phabricator task with a links to this thread and the

wikitech

...
...
doc for testing on beta cluster.

I am fine with keeping the task closed.

But I am somewhat surprised to see beta mentioned in the resolution. Note that Dario's request set scope as [1]

However, there are types of data quality issues that we only discover when collecting data at scale and in the wild (on browsers/platforms that we don’t necessarily test for internally).

. That's a valid scope, but from my point of view, beta does not match that scope.

Neither is beta large scale, nor is it hammered on with crazy devices.

Beta is just a halfing the distance between EventLogging's devserver (Vagrant!) and production.

Have fun, Christian

[1]

https://lists.wikimedia.org/pipermail/analytics/2014-December/002884.html

...
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Nuria Ruiz

4:37 p.m.

...

QA in beta labs is good but not enough. We still need to do QA when a

feature goes to production and currently This is true but at the same time, I do not see anything in the description of your FF events that could not be tested on beta-labs. If we are talking add-block that can be tested even earlier, vagrant will be a fine venue. All the issues related to the client (browser) not emitting events can be tested on the development environment with ease.

On Mon, Dec 15, 2014 at 4:18 PM, Leila Zia leila@wikimedia.org wrote:

...

On Mon, Dec 15, 2014 at 10:06 AM, Toby Negrin tnegrin@wikimedia.org wrote:

...
I share Christian's concerns -

Dario/Leila - can you comment based on your recent experiences with WikiGrok?

I agree with Christian.

QA in beta labs is good but not enough. We still need to do QA when a feature goes to production and currently, it's very hard to figure out if there's a problem with logging. An example:

While testing WikiGrok in production, we learned that after some point tests from Firefox browser from my machine were not logged. We did not get any errors for this. I found out about this because I was trying to manually make a trace of activities and see if I can stitch them together and make sense of them. We eventually figured out what was going on in that case [1], but it concerns me that there may be other important events that we don't log in the DB and we never know that we're not logging.

Leila [1] https://lists.wikimedia.org/pipermail/analytics/2014-December/002864.html

...
Thanks

-Toby

...
On Dec 15, 2014, at 9:42 AM, Christian Aistleitner <

christian@quelltextlich.at> wrote:

...
Hi,

...
On Mon, Dec 15, 2014 at 08:34:39AM -0800, Kevin Leduc wrote: I closed the Phabricator task with a links to this thread and the

wikitech

...
...
doc for testing on beta cluster.

I am fine with keeping the task closed.

But I am somewhat surprised to see beta mentioned in the resolution. Note that Dario's request set scope as [1]

However, there are types of data quality issues that we only discover when collecting data at scale and in the wild (on browsers/platforms that we don’t necessarily test for internally).

. That's a valid scope, but from my point of view, beta does not match that scope.

Neither is beta large scale, nor is it hammered on with crazy devices.

Beta is just a halfing the distance between EventLogging's devserver (Vagrant!) and production.

Have fun, Christian

[1]

https://lists.wikimedia.org/pipermail/analytics/2014-December/002884.html

...
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Kevin Leduc

5:47 p.m.

I reopened the task because discussions on this are still ongoing and the issue isn't entirely resolved.

I'd like to move this to a video conference call between analytics developers and analytics engineering to come to a mutual understanding of what the current pain points are and what's the biggest priority. We'll then communicate a plan back to the list and update the tasks involved.

On Mon, Dec 15, 2014 at 4:37 PM, Nuria Ruiz nuria@wikimedia.org wrote:

...

...
QA in beta labs is good but not enough. We still need to do QA when a

feature goes to production and currently This is true but at the same time, I do not see anything in the description of your FF events that could not be tested on beta-labs. If we are talking add-block that can be tested even earlier, vagrant will be a fine venue. All the issues related to the client (browser) not emitting events can be tested on the development environment with ease.

On Mon, Dec 15, 2014 at 4:18 PM, Leila Zia leila@wikimedia.org wrote:

...
On Mon, Dec 15, 2014 at 10:06 AM, Toby Negrin tnegrin@wikimedia.org wrote:

...
I share Christian's concerns -

Dario/Leila - can you comment based on your recent experiences with WikiGrok?

I agree with Christian.

QA in beta labs is good but not enough. We still need to do QA when a feature goes to production and currently, it's very hard to figure out if there's a problem with logging. An example:

While testing WikiGrok in production, we learned that after some point tests from Firefox browser from my machine were not logged. We did not get any errors for this. I found out about this because I was trying to manually make a trace of activities and see if I can stitch them together and make sense of them. We eventually figured out what was going on in that case [1], but it concerns me that there may be other important events that we don't log in the DB and we never know that we're not logging.

Leila [1] https://lists.wikimedia.org/pipermail/analytics/2014-December/002864.html

...
Thanks

-Toby

...
On Dec 15, 2014, at 9:42 AM, Christian Aistleitner <

christian@quelltextlich.at> wrote:

...
Hi,

...
On Mon, Dec 15, 2014 at 08:34:39AM -0800, Kevin Leduc wrote: I closed the Phabricator task with a links to this thread and the

wikitech

...
...
doc for testing on beta cluster.

I am fine with keeping the task closed.

But I am somewhat surprised to see beta mentioned in the resolution. Note that Dario's request set scope as [1]

However, there are types of data quality issues that we only discover when collecting data at scale and in the wild (on browsers/platforms that we don’t necessarily test for internally).

. That's a valid scope, but from my point of view, beta does not match that scope.

Neither is beta large scale, nor is it hammered on with crazy devices.

Beta is just a halfing the distance between EventLogging's devserver (Vagrant!) and production.

Have fun, Christian

[1]

https://lists.wikimedia.org/pipermail/analytics/2014-December/002884.html

...
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Leila Zia

5:58 p.m.

On Monday, December 15, 2014, Kevin Leduc kevin@wikimedia.org wrote:

...

I'd like to move this to a video conference call between analytics developers and analytics engineering to come to a mutual understanding of what the current pain points are and what's the biggest priority.

It probably makes sense to have someone from R&D with experience in QA in that meeting (Dario if you want a more experienced person, myself otherwise). Not sure if you meant the same when you said analytics engineering.

Leila

...

On Mon, Dec 15, 2014 at 4:37 PM, Nuria Ruiz <nuria@wikimedia.org javascript:_e(%7B%7D,'cvml','nuria@wikimedia.org');> wrote:

...
...
QA in beta labs is good but not enough. We still need to do QA when a

feature goes to production and currently This is true but at the same time, I do not see anything in the description of your FF events that could not be tested on beta-labs. If we are talking add-block that can be tested even earlier, vagrant will be a fine venue. All the issues related to the client (browser) not emitting events can be tested on the development environment with ease.

On Mon, Dec 15, 2014 at 4:18 PM, Leila Zia <leila@wikimedia.org javascript:_e(%7B%7D,'cvml','leila@wikimedia.org');> wrote:

...
On Mon, Dec 15, 2014 at 10:06 AM, Toby Negrin <tnegrin@wikimedia.org javascript:_e(%7B%7D,'cvml','tnegrin@wikimedia.org');> wrote:

...
I share Christian's concerns -

Dario/Leila - can you comment based on your recent experiences with WikiGrok?

I agree with Christian.

QA in beta labs is good but not enough. We still need to do QA when a feature goes to production and currently, it's very hard to figure out if there's a problem with logging. An example:

While testing WikiGrok in production, we learned that after some point tests from Firefox browser from my machine were not logged. We did not get any errors for this. I found out about this because I was trying to manually make a trace of activities and see if I can stitch them together and make sense of them. We eventually figured out what was going on in that case [1], but it concerns me that there may be other important events that we don't log in the DB and we never know that we're not logging.

Leila [1] https://lists.wikimedia.org/pipermail/analytics/2014-December/002864.html

...
Thanks

-Toby

...
On Dec 15, 2014, at 9:42 AM, Christian Aistleitner <

christian@quelltextlich.at javascript:_e(%7B%7D,'cvml','christian@quelltextlich.at');> wrote:

...
Hi,

...
On Mon, Dec 15, 2014 at 08:34:39AM -0800, Kevin Leduc wrote: I closed the Phabricator task with a links to this thread and the

wikitech

...
...
doc for testing on beta cluster.

I am fine with keeping the task closed.

But I am somewhat surprised to see beta mentioned in the resolution. Note that Dario's request set scope as [1]

However, there are types of data quality issues that we only discover when collecting data at scale and in the wild (on browsers/platforms that we don’t necessarily test for internally).

. That's a valid scope, but from my point of view, beta does not match that scope.

Neither is beta large scale, nor is it hammered on with crazy devices.

Beta is just a halfing the distance between EventLogging's devserver (Vagrant!) and production.

Have fun, Christian

[1]

https://lists.wikimedia.org/pipermail/analytics/2014-December/002884.html

...
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at

javascript:_e(%7B%7D,'cvml','christian@quelltextlich.at');

...
4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/

Analytics mailing list Analytics@lists.wikimedia.org

javascript:_e(%7B%7D,'cvml','Analytics@lists.wikimedia.org');

...
https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org javascript:_e(%7B%7D,'cvml','Analytics@lists.wikimedia.org'); https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org javascript:_e(%7B%7D,'cvml','Analytics@lists.wikimedia.org'); https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org javascript:_e(%7B%7D,'cvml','Analytics@lists.wikimedia.org'); https://lists.wikimedia.org/mailman/listinfo/analytics

Ryan Kaldari

16 Dec 16 Dec

10:07 a.m.

I added a comment to the ticket requesting a simple error log for validation errors. I think that would solve about 50% of the problem and should be easy to implement.

Kaldari

On Mon, Dec 15, 2014 at 5:58 PM, Leila Zia leila@wikimedia.org wrote:

...

On Monday, December 15, 2014, Kevin Leduc kevin@wikimedia.org wrote:

...
I'd like to move this to a video conference call between analytics developers and analytics engineering to come to a mutual understanding of what the current pain points are and what's the biggest priority.

It probably makes sense to have someone from R&D with experience in QA in that meeting (Dario if you want a more experienced person, myself otherwise). Not sure if you meant the same when you said analytics engineering.

Leila

...
On Mon, Dec 15, 2014 at 4:37 PM, Nuria Ruiz nuria@wikimedia.org wrote:

...
...
QA in beta labs is good but not enough. We still need to do QA when a

feature goes to production and currently This is true but at the same time, I do not see anything in the description of your FF events that could not be tested on beta-labs. If we are talking add-block that can be tested even earlier, vagrant will be a fine venue. All the issues related to the client (browser) not emitting events can be tested on the development environment with ease.

On Mon, Dec 15, 2014 at 4:18 PM, Leila Zia leila@wikimedia.org wrote:

...
On Mon, Dec 15, 2014 at 10:06 AM, Toby Negrin tnegrin@wikimedia.org wrote:

...
I share Christian's concerns -

Dario/Leila - can you comment based on your recent experiences with WikiGrok?

I agree with Christian.

QA in beta labs is good but not enough. We still need to do QA when a feature goes to production and currently, it's very hard to figure out if there's a problem with logging. An example:

While testing WikiGrok in production, we learned that after some point tests from Firefox browser from my machine were not logged. We did not get any errors for this. I found out about this because I was trying to manually make a trace of activities and see if I can stitch them together and make sense of them. We eventually figured out what was going on in that case [1], but it concerns me that there may be other important events that we don't log in the DB and we never know that we're not logging.

Leila [1] https://lists.wikimedia.org/pipermail/analytics/2014-December/002864.html

...
Thanks

-Toby

...
On Dec 15, 2014, at 9:42 AM, Christian Aistleitner <

christian@quelltextlich.at> wrote:

...
Hi,

> On Mon, Dec 15, 2014 at 08:34:39AM -0800, Kevin Leduc wrote: > I closed the Phabricator task with a links to this thread and the

wikitech

...
> doc for testing on beta cluster.

I am fine with keeping the task closed.

But I am somewhat surprised to see beta mentioned in the resolution. Note that Dario's request set scope as [1]

However, there are types of data quality issues that we only discover when collecting data at scale and in the wild (on browsers/platforms that we don’t necessarily test for internally).

. That's a valid scope, but from my point of view, beta does not

match

...
that scope.

Neither is beta large scale, nor is it hammered on with crazy

devices.

...
Beta is just a halfing the distance between EventLogging's devserver (Vagrant!) and production.

Have fun, Christian

[1]

https://lists.wikimedia.org/pipermail/analytics/2014-December/002884.html

...
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

3676

Age (days ago)

3680

Last active (days ago)

analytics@lists.wikimedia.org

17 comments

9 participants

tags (0)

participants (9)

Christian Aistleitner
Dario Taraborelli
Grace Gellerman
Kevin Leduc
Leila Zia
Nuria Ruiz
Ori Livneh
Ryan Kaldari
Toby Negrin