On Thu, Jan 14, 2016 at 1:10 AM, Faidon Liambotis <faidon@wikimedia.org> wrote:

Well, this was a snapshot of the situation then. It doesn't preclude
other issues (possibly caused by other, similarly-sized queries) in the
previous hours/days. That said, tendril show definitely a correlation
between all kinds of metrics (page/disk I/O, write traffic etc.) and the
aforementioned timeline of the past 1.5 days.

Plus, replag for s1 was at the time ~133,900 seconds and rising, which
matches the timeline of that large query too. Since I killed it it has
been steadily dropping, albeit slowly (currently at 132,445). It will
probably take a couple of days to recover. Since the server is both
backlogged by and I/O-saturated, it will depend a lot on how much load
the server will get by other queries (currently it's getting hammered by
two other large queries that have been running for over 27,000 and 4,000
seconds respectively, for example).

Faidon

On Wed, Jan 13, 2016 at 09:44:50PM -0800, Oliver Keyes wrote:
> Indeed, but 1.5 days is <half the time the problem has been occurring for.
>
> On 13 January 2016 at 21:01, Faidon Liambotis <faidon@wikimedia.org> wrote:
> > "SELECT * FROM information_schema.processlist ORDER BY time DESC"
> > informs us of this:
> >
> > | 5599890 | research | 10.64.36.103:53669 | enwiki | Query | 133527 | Queried about 890000 rows | CREATE TEMPORARY TA
> > SELECT
> > page_id,
> > views
> > FROM (
> > SELECT
> > page_namespace,
> > page_title,
> > SUM(views) AS views
> > FROM staging.page_name_views_dupes
> > WHERE page_namespace = 0
> > GROUP BY 1,2
> > ) AS group_page_name_views
> > INNER JOIN enwiki.page USING (page_namespace, page_title)
> >
> > Column 6 is "time", i.e. this query was running for 133527 seconds at
> > the time (i.e. ~1.5 days!), which is obviously Not Good™. I just ran
> > "KILL QUERY 5599890;", hopefully this will help.
> >
> > The second-next long-standing query has been running for over 6 hours
> > now and it way too long to paste (87 lines, starts with "INSERT INTO
> > editor_month_global", inline comments, all kinds of subselects in inner
> > joins etc., queried "about 2470000 rows"). I left it be for now, we'll
> > see how that goes and I may eventually kill it too, as I/O is still
> > pegged at 100%.
> >
> > I believe long-running queries targetted at the research slaves isn't
> > particularly new but is often the source of such problems, so it's a
> > good place to start when investigating such issues. There is only so
> > much a poor database server (and software) can do :)
> >
> > Regards,
> > Faidon
> >
> > On Wed, Jan 13, 2016 at 06:55:26PM -0500, Andrew Otto wrote:
> >> Hi all,
> >>
> >> Replication to dbstore1002 is having a lot of trouble. From
> >> https://tendril.wikimedia.org/host/view/dbstore1002.eqiad.wmnet/3306, we
> >> see that normal replication is about 9 hours behind at the moment.
> >> However, the EventLogging `log` database is not replicated with usual MySQL
> >> replication. Instead, a custom bash script[1] periodically uses mysqldump
> >> to copy data from m4-master (dbproxy1004) into dbstore1002. (I just
> >> recently found out that this wasn’t regular replication, and I’m not
> >> familiar with the reasoning behind using a custom script.)
> >>
> >> The EventLogging `log` custom replication has been lagging for days now.
> >> Also, at about 18:00 UTC today (Jan 13), we can see a huge increase in
> >> write traffic on dbstore1002. I looked at each of the normal replication
> >> masters, and don’t see this write traffic there. EventLogging traffic also
> >> seems to be about the same over the last week or so (although there was an
> >> in increase in events being produced by the MobileWebSectionUsage schema
> >> starting Dec 18, but I don’t think this is the problem).
> >>
> >> Disk util is around 100%, but this has been the case for a while now.
> >> Today I filed https://phabricator.wikimedia.org/T123546 for a bad mem chip
> >> or slot on dbproxy1004, but this also seems to have been the status quo for
> >> quite a while, and doesn’t correlate with this lag.
> >>
> >> I’m not sure where else to look at the moment, and I need to run for the
> >> day. I’ll try to look at this more tomorrow in my morning.
> >>
> >> -AO
> >>
> >> [1]
> >> https://github.com/wikimedia/operations-puppet/blob/f0df1ec45b3f70a5c041cef217751014824ca6ec/files/mariadb/eventlogging_sync.sh
> >
> >> _______________________________________________
> >> Ops mailing list
> >> Ops@lists.wikimedia.org
> >> https://lists.wikimedia.org/mailman/listinfo/ops
> >
> >
> > _______________________________________________
> > Ops mailing list
> > Ops@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/ops
>
>
>
> --
> Oliver Keyes
> Count Logula
> Wikimedia Foundation