(This is a little note I have meant to write for a while. Sending it both a heads-up for other people who work with this data - many may have encountered some of these issues, but not everybody may be aware of all of them - and a contribution to the discussion about the Analytics team's "operational excellence" quarterly goal https://www.mediawiki.org/wiki/Wikimedia_Engineering/2016-17_Q3_Goals#Analytics for Q3.)
So, EventLogging has been a highly useful part of our analytics infrastructure for years now, critical for the work of many teams. However, over the course of this year there have been several longstanding issues that make me wonder if we are giving it enough attention infrastructure-wise.
1. https://phabricator.wikimedia.org/T146840 Major loss of events in many different schemas, apparently differing by browser family. This affected e.g. one of the main metrics we've been using to evaluate hovercards (page previews) in the reading Web team and was the reason we had to restrict the analysis of recent A/B tests there to Firefox only. It also created confusion for users of the Discovery department's mobile search dashboard and affected the Edit schema as well. No reaction on the task from Analytics since September 28.
2. https://phabricator.wikimedia.org/T142667 Duplicate (spurious) EventLogging rows, a longterm issue first observed, independently, by people from the Reading web team and myself around April/May. The effect on query results is small in most cases, but significant in some, and in any case does not raise confidence in the quality of the data - we would at least like to know what the most likely explanations are. No reaction from Analytics since August, despite four "The World Burns" tokens by other data analysts and a reminder from Reading management.
3. "ERROR 2013 (HY000): Lost connection to MySQL server during query" and "ERROR 2006 (HY000): MySQL server has gone away" when trying to query EL data from stat1003. Happening infrequently but often enough to be a major nuisance at times. (I haven't filed a Phabricator task for this yet, but brought it up on IRC various times. Arguably a more database/service quality issue, but I'm not certain it can't affect query results as well.)
There are various other EL issues I have been encountering more sporadically (and in some cases still need to file Phabricator tasks for), but these are some of the most important.
I am wondering whether this list may be a better venue for raising awareness when things get stale on Phabricator.
Thanks for the email, Tilman. I will read it in depth and look closely at the issues, but I want to point out something majorly important:
*** We are NOT certain to see tasks unless they're tagged with "ANALYTICS". We have an outstanding ask from the phab team and upstream to solve issues that will help us get around this limitation. But for the meantime, if you want us to see a task you MUST tag it with Analytics. ***
As a result, I personally didn't see these tasks until your email just now. I hope my instant response and reaction will help prove that I take them seriously. I have tagged those tasks with Analytics and also put them in our working board to give them immediate priority.
p.s. the "ERROR 2013 (HY000): Lost connection to MySQL server during query" errors are, as far as I understand, just time-outs that help the DBA teams manage performance on the analytics servers. I have never seen them affect results, and wikimetrics has a way of actively waking up connections that die in this way.
On Fri, Dec 16, 2016 at 1:59 PM, Tilman Bayer tbayer@wikimedia.org wrote:
(This is a little note I have meant to write for a while. Sending it both a heads-up for other people who work with this data - many may have encountered some of these issues, but not everybody may be aware of all of them - and a contribution to the discussion about the Analytics team's "operational excellence" quarterly goal https://www.mediawiki.org/wiki/Wikimedia_Engineering/2016-17_Q3_Goals#Analytics for Q3.)
So, EventLogging has been a highly useful part of our analytics infrastructure for years now, critical for the work of many teams. However, over the course of this year there have been several longstanding issues that make me wonder if we are giving it enough attention infrastructure-wise.
- https://phabricator.wikimedia.org/T146840 Major loss of events in many
different schemas, apparently differing by browser family. This affected e.g. one of the main metrics we've been using to evaluate hovercards (page previews) in the reading Web team and was the reason we had to restrict the analysis of recent A/B tests there to Firefox only. It also created confusion for users of the Discovery department's mobile search dashboard and affected the Edit schema as well. No reaction on the task from Analytics since September 28.
- https://phabricator.wikimedia.org/T142667 Duplicate (spurious)
EventLogging rows, a longterm issue first observed, independently, by people from the Reading web team and myself around April/May. The effect on query results is small in most cases, but significant in some, and in any case does not raise confidence in the quality of the data - we would at least like to know what the most likely explanations are. No reaction from Analytics since August, despite four "The World Burns" tokens by other data analysts and a reminder from Reading management.
- "ERROR 2013 (HY000): Lost connection to MySQL server during query" and
"ERROR 2006 (HY000): MySQL server has gone away" when trying to query EL data from stat1003. Happening infrequently but often enough to be a major nuisance at times. (I haven't filed a Phabricator task for this yet, but brought it up on IRC various times. Arguably a more database/service quality issue, but I'm not certain it can't affect query results as well.)
There are various other EL issues I have been encountering more sporadically (and in some cases still need to file Phabricator tasks for), but these are some of the most important.
I am wondering whether this list may be a better venue for raising awareness when things get stale on Phabricator. -- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I was about to send similar e-mail, unless you tag us with Analytics we will not see the issues. Thus, if issues are important enough and we have not responded (for the most part we respond quite fast to operational issues) DO ping us on irc. That is what the channel is for.
It is very unlikely that 3 months after the fact we would know what happen on EL on September 9th, we do not retain neither operational logs nor data logs that long so probably that ticket would be closed w/o resolution cause we did not learn about it promptly enough.
Thanks,
Nuria
On Fri, Dec 16, 2016 at 11:16 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
Thanks for the email, Tilman. I will read it in depth and look closely at the issues, but I want to point out something majorly important:
*** We are NOT certain to see tasks unless they're tagged with "ANALYTICS". We have an outstanding ask from the phab team and upstream to solve issues that will help us get around this limitation. But for the meantime, if you want us to see a task you MUST tag it with Analytics. ***
As a result, I personally didn't see these tasks until your email just now. I hope my instant response and reaction will help prove that I take them seriously. I have tagged those tasks with Analytics and also put them in our working board to give them immediate priority.
p.s. the "ERROR 2013 (HY000): Lost connection to MySQL server during query" errors are, as far as I understand, just time-outs that help the DBA teams manage performance on the analytics servers. I have never seen them affect results, and wikimetrics has a way of actively waking up connections that die in this way.
On Fri, Dec 16, 2016 at 1:59 PM, Tilman Bayer tbayer@wikimedia.org wrote:
(This is a little note I have meant to write for a while. Sending it both a heads-up for other people who work with this data - many may have encountered some of these issues, but not everybody may be aware of all of them - and a contribution to the discussion about the Analytics team's "operational excellence" quarterly goal https://www.mediawiki.org/wiki/Wikimedia_Engineering/2016-17_Q3_Goals#Analytics for Q3.)
So, EventLogging has been a highly useful part of our analytics infrastructure for years now, critical for the work of many teams. However, over the course of this year there have been several longstanding issues that make me wonder if we are giving it enough attention infrastructure-wise.
- https://phabricator.wikimedia.org/T146840 Major loss of events in
many different schemas, apparently differing by browser family. This affected e.g. one of the main metrics we've been using to evaluate hovercards (page previews) in the reading Web team and was the reason we had to restrict the analysis of recent A/B tests there to Firefox only. It also created confusion for users of the Discovery department's mobile search dashboard and affected the Edit schema as well. No reaction on the task from Analytics since September 28.
- https://phabricator.wikimedia.org/T142667 Duplicate (spurious)
EventLogging rows, a longterm issue first observed, independently, by people from the Reading web team and myself around April/May. The effect on query results is small in most cases, but significant in some, and in any case does not raise confidence in the quality of the data - we would at least like to know what the most likely explanations are. No reaction from Analytics since August, despite four "The World Burns" tokens by other data analysts and a reminder from Reading management.
- "ERROR 2013 (HY000): Lost connection to MySQL server during query" and
"ERROR 2006 (HY000): MySQL server has gone away" when trying to query EL data from stat1003. Happening infrequently but often enough to be a major nuisance at times. (I haven't filed a Phabricator task for this yet, but brought it up on IRC various times. Arguably a more database/service quality issue, but I'm not certain it can't affect query results as well.)
There are various other EL issues I have been encountering more sporadically (and in some cases still need to file Phabricator tasks for), but these are some of the most important.
I am wondering whether this list may be a better venue for raising awareness when things get stale on Phabricator. -- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Update on https://phabricator.wikimedia.org/T146840
I will be closing this ticket shortly as we actually already looked at that problem. It was related to a mediawiki performance regression ultimately caused by a bug on Chrome.
On Fri, Dec 16, 2016 at 11:35 AM, Nuria Ruiz nuria@wikimedia.org wrote:
I was about to send similar e-mail, unless you tag us with Analytics we will not see the issues. Thus, if issues are important enough and we have not responded (for the most part we respond quite fast to operational issues) DO ping us on irc. That is what the channel is for.
It is very unlikely that 3 months after the fact we would know what happen on EL on September 9th, we do not retain neither operational logs nor data logs that long so probably that ticket would be closed w/o resolution cause we did not learn about it promptly enough.
Thanks,
Nuria
On Fri, Dec 16, 2016 at 11:16 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
Thanks for the email, Tilman. I will read it in depth and look closely at the issues, but I want to point out something majorly important:
*** We are NOT certain to see tasks unless they're tagged with "ANALYTICS". We have an outstanding ask from the phab team and upstream to solve issues that will help us get around this limitation. But for the meantime, if you want us to see a task you MUST tag it with Analytics. ***
As a result, I personally didn't see these tasks until your email just now. I hope my instant response and reaction will help prove that I take them seriously. I have tagged those tasks with Analytics and also put them in our working board to give them immediate priority.
p.s. the "ERROR 2013 (HY000): Lost connection to MySQL server during query" errors are, as far as I understand, just time-outs that help the DBA teams manage performance on the analytics servers. I have never seen them affect results, and wikimetrics has a way of actively waking up connections that die in this way.
On Fri, Dec 16, 2016 at 1:59 PM, Tilman Bayer tbayer@wikimedia.org wrote:
(This is a little note I have meant to write for a while. Sending it both a heads-up for other people who work with this data - many may have encountered some of these issues, but not everybody may be aware of all of them - and a contribution to the discussion about the Analytics team's "operational excellence" quarterly goal https://www.mediawiki.org/wiki/Wikimedia_Engineering/2016-17_Q3_Goals#Analytics for Q3.)
So, EventLogging has been a highly useful part of our analytics infrastructure for years now, critical for the work of many teams. However, over the course of this year there have been several longstanding issues that make me wonder if we are giving it enough attention infrastructure-wise.
- https://phabricator.wikimedia.org/T146840 Major loss of events in
many different schemas, apparently differing by browser family. This affected e.g. one of the main metrics we've been using to evaluate hovercards (page previews) in the reading Web team and was the reason we had to restrict the analysis of recent A/B tests there to Firefox only. It also created confusion for users of the Discovery department's mobile search dashboard and affected the Edit schema as well. No reaction on the task from Analytics since September 28.
- https://phabricator.wikimedia.org/T142667 Duplicate (spurious)
EventLogging rows, a longterm issue first observed, independently, by people from the Reading web team and myself around April/May. The effect on query results is small in most cases, but significant in some, and in any case does not raise confidence in the quality of the data - we would at least like to know what the most likely explanations are. No reaction from Analytics since August, despite four "The World Burns" tokens by other data analysts and a reminder from Reading management.
- "ERROR 2013 (HY000): Lost connection to MySQL server during query"
and "ERROR 2006 (HY000): MySQL server has gone away" when trying to query EL data from stat1003. Happening infrequently but often enough to be a major nuisance at times. (I haven't filed a Phabricator task for this yet, but brought it up on IRC various times. Arguably a more database/service quality issue, but I'm not certain it can't affect query results as well.)
There are various other EL issues I have been encountering more sporadically (and in some cases still need to file Phabricator tasks for), but these are some of the most important.
I am wondering whether this list may be a better venue for raising awareness when things get stale on Phabricator. -- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Thanks for looking into these issues, folks, really appreciated!
On Fri, Dec 16, 2016 at 11:35 AM, Nuria Ruiz nuria@wikimedia.org wrote:
I was about to send similar e-mail, unless you tag us with Analytics we
will
not see the issues.
Indeed, the tasks were not tagged with "Analytics" - only with "Analytics-EventLogging." Thanks to you and Dan for explaining the process and the Phabricator issue that prevents the team from seeing new tasks on "Analytics-EventLogging" - I will try to remember to add everything to "Analytics" from now on. While we are waiting for Phacility to fix this upstream, how about documenting this somewhere, for example by adding a warning to the Analytics-Eventlogging board https://phabricator.wikimedia.org/project/profile/589/ that it is not being watched?
Thus, if issues are important enough and we have not responded (for the most part we respond quite fast to operational issues)
DO
ping us on irc. That is what the channel is for.
But Nuria, Jon had pinged you directly at https://phabricator.wikimedia. org/T142667#2555765 four months ago. Are you saying that personal Phabricator pings are also not a reliable way to get the Analytics team's attention?
In any case, I'm a regular IRC user and know how to find you folks there, so yes, I can go that extra step too (in this case, I chose another public venue, this list, as do other people to get the team's attention). I'll note though that these tasks were not "operational issues" like say an analytics server outage that needs immediate attention and justifies interrupting the team's work - but also not something that should have been left unattended for many months.
I will also note that the wikibugs bot in #wikimedia-analytics does actually capture tasks from Analytics-EventLogging too, i.e. over the course of this year numerous updates about these tasks have scrolled by in this channel with a subject line indicating serious data quality issues. Sure, not a formal notification, but one would have hoped it increased the chances of these tasks catching some attention (as they did from several people outside the team, including one DBA).
It is very unlikely that 3 months after the fact we would know what happen on EL on September 9th, we do not retain neither operational logs nor data logs that long so probably that ticket would be closed w/o resolution
cause
we did not learn about it promptly enough.
Thanks,
Nuria
On Fri, Dec 16, 2016 at 11:16 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
Thanks for the email, Tilman. I will read it in depth and look closely
at
the issues, but I want to point out something majorly important:
*** We are NOT certain to see tasks unless they're tagged with "ANALYTICS". We have an outstanding ask from the phab team and upstream
to
solve issues that will help us get around this limitation. But for the meantime, if you want us to see a task you MUST tag it with Analytics.
***
As a result, I personally didn't see these tasks until your email just now. I hope my instant response and reaction will help prove that I take them seriously. I have tagged those tasks with Analytics and also put
them
in our working board to give them immediate priority.
p.s. the "ERROR 2013 (HY000): Lost connection to MySQL server during query" errors are, as far as I understand, just time-outs that help the
DBA
teams manage performance on the analytics servers. I have never seen
them
affect results, and wikimetrics has a way of actively waking up
connections
that die in this way.
On Fri, Dec 16, 2016 at 1:59 PM, Tilman Bayer tbayer@wikimedia.org wrote:
(This is a little note I have meant to write for a while. Sending it
both
a heads-up for other people who work with this data - many may have encountered some of these issues, but not everybody may be aware of all
of
them - and a contribution to the discussion about the Analytics team's "operational excellence" quarterly goal for Q3.)
So, EventLogging has been a highly useful part of our analytics infrastructure for years now, critical for the work of many teams.
However,
over the course of this year there have been several longstanding issues that make me wonder if we are giving it enough attention infrastructure-wise.
- https://phabricator.wikimedia.org/T146840 Major loss of events in
many
different schemas, apparently differing by browser family. This affected e.g. one of the main metrics we've been using to evaluate hovercards
(page
previews) in the reading Web team and was the reason we had to restrict
the
analysis of recent A/B tests there to Firefox only. It also created confusion for users of the Discovery department's mobile search
dashboard
and affected the Edit schema as well. No reaction on the task from
Analytics
since September 28.
- https://phabricator.wikimedia.org/T142667 Duplicate (spurious)
EventLogging rows, a longterm issue first observed, independently, by
people
from the Reading web team and myself around April/May. The effect on
query
results is small in most cases, but significant in some, and in any case does not raise confidence in the quality of the data - we would at least like to know what the most likely explanations are. No reaction from Analytics since August, despite four "The World Burns" tokens by other
data
analysts and a reminder from Reading management.
- "ERROR 2013 (HY000): Lost connection to MySQL server during query"
and
"ERROR 2006 (HY000): MySQL server has gone away" when trying to query EL data from stat1003. Happening infrequently but often enough to be a
major
nuisance at times. (I haven't filed a Phabricator task for this yet, but brought it up on IRC various times. Arguably a more database/service
quality
issue, but I'm not certain it can't affect query results as well.)
There are various other EL issues I have been encountering more sporadically (and in some cases still need to file Phabricator tasks
for),
but these are some of the most important.
I am wondering whether this list may be a better venue for raising awareness when things get stale on Phabricator. -- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Sat, Dec 17, 2016 at 12:47 AM, Tilman Bayer tbayer@wikimedia.org wrote:
While we are waiting for Phacility to fix this upstream, how about documenting this somewhere, for example by adding a warning to the Analytics-Eventlogging board https://phabricator.wikimedia.org/project/profile/589/ that it is not being watched?
Done
Are you saying that personal Phabricator pings are also not a reliable way
to get the Analytics team's attention? If they are not answered, no. I will either ping again or ping on irc. I get ping-ed a lot in Phab and while I try to respond promptly I might miss a ping. I am sorry I missed this one instance.
On Sat, Dec 17, 2016 at 8:31 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Sat, Dec 17, 2016 at 12:47 AM, Tilman Bayer tbayer@wikimedia.org wrote:
While we are waiting for Phacility to fix this upstream, how about documenting this somewhere, for example by adding a warning to the Analytics-Eventlogging board https://phabricator.wikimedia.org/project/profile/589/ that it is not being watched?
Done
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Fri, 2016-12-16 at 21:47 -0800, Tilman Bayer wrote:
While we are waiting for Phacility to fix this upstream
Only for the records: There is nothing to fix "upstream".
andre
Yes, we were working on some outdated information that seems to have been fixed since we originally had the problem. I have a request to get permissions and then I'll set up the rules in phab so that this kind of thing doesn't happen again.
Original Message From: Andre Klapper Sent: Monday, December 19, 2016 22:20 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Reply To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] EventLogging data quality issues
On Fri, 2016-12-16 at 21:47 -0800, Tilman Bayer wrote:
While we are waiting for Phacility to fix this upstream
Only for the records: There is nothing to fix "upstream".
andre
On Fri, 2016-12-16 at 14:16 -0500, Dan Andreescu wrote:
*** We are NOT certain to see tasks unless they're tagged with "ANALYTICS". We have an outstanding ask from the phab team and upstream to solve issues that will help us get around this limitation.
Can you please provide more context (= a link to a task explaining the problem)?
andre
Can you please provide more context (= a link to a task explaining the
problem)?
Here it is: https://phabricator.wikimedia.org/T146042
On Mon, Dec 19, 2016 at 8:28 AM, Andre Klapper aklapper@wikimedia.org wrote:
On Fri, 2016-12-16 at 14:16 -0500, Dan Andreescu wrote:
*** We are NOT certain to see tasks unless they're tagged with "ANALYTICS". We have an outstanding ask from the phab team and upstream to solve issues that will help us get around this limitation.
Can you please provide more context (= a link to a task explaining the problem)?
andre
Andre Klapper | Wikimedia Bugwrangler http://blogs.gnome.org/aklapper/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
<quote name="Nuria Ruiz" date="2016-12-19" time="09:02:00 -0800">
Can you please provide more context (= a link to a task explaining the
problem)?
Here it is: https://phabricator.wikimedia.org/T146042
This is just a task about removing one herald rule that applies #analytics to everything reported in #pageview-api. I found the relevant Herald rule and will disable if someone with a stake in it replies on that task.
You also mentioned an upstream issue, which one is that?
Greg
On Mon, Dec 19, 2016 at 1:03 PM, Greg Grossmeier greg@wikimedia.org wrote:
<quote name="Nuria Ruiz" date="2016-12-19" time="09:02:00 -0800"> > >Can you please provide more context (= a link to a task explaining the > problem)? > > Here it is: > https://phabricator.wikimedia.org/T146042
This is just a task about removing one herald rule that applies #analytics to everything reported in #pageview-api. I found the relevant Herald rule and will disable if someone with a stake in it replies on that task.
You also mentioned an upstream issue, which one is that?
The problem is, we use #Analytics as our backlog and #Analytics-Kanban as our working board. When we decide to work on something, we remove #Analytics and add #Analytics-Kanban. But if there's a Herald rule in place to auto-tag it, it will add #Analytics to it again if it's also tagged in such a way as to trigger the Herald rule. Therefore, we can't use #Analytics-EventLogging and such a rule, so we didn't make these rules.
I tried to find a task for this in Phab but failed. Sorry if my memory invented such a task, should I file it now?
<quote name="Dan Andreescu" date="2016-12-19" time="17:25:37 -0500">
I tried to find a task for this in Phab but failed. Sorry if my memory invented such a task, should I file it now?
Yeah, that'd be useful. We can't help work on things we don't know about :) (I sense a theme in this thread....)
Feel free to ping me on IRC, we can leave this list for more on-topic things now.
Greg