Adding Analytics list and Neil P Quinn,
FYI, Nuria filed this ticket to track this issue: https://phabricator.wikimedia.org/T123634 Moving discussion there.
On Thu, Jan 14, 2016 at 1:10 AM, Faidon Liambotis faidon@wikimedia.org wrote:
Well, this was a snapshot of the situation then. It doesn't preclude other issues (possibly caused by other, similarly-sized queries) in the previous hours/days. That said, tendril show definitely a correlation between all kinds of metrics (page/disk I/O, write traffic etc.) and the aforementioned timeline of the past 1.5 days.
Plus, replag for s1 was at the time ~133,900 seconds and rising, which matches the timeline of that large query too. Since I killed it it has been steadily dropping, albeit slowly (currently at 132,445). It will probably take a couple of days to recover. Since the server is both backlogged by and I/O-saturated, it will depend a lot on how much load the server will get by other queries (currently it's getting hammered by two other large queries that have been running for over 27,000 and 4,000 seconds respectively, for example).
Faidon
On Wed, Jan 13, 2016 at 09:44:50PM -0800, Oliver Keyes wrote:
Indeed, but 1.5 days is <half the time the problem has been occurring
for.
On 13 January 2016 at 21:01, Faidon Liambotis faidon@wikimedia.org
wrote:
"SELECT * FROM information_schema.processlist ORDER BY time DESC" informs us of this:
| 5599890 | research | 10.64.36.103:53669 | enwiki
| Query | 133527 | Queried about 890000 rows | CREATE TEMPORARY TA
SELECT page_id, views FROM ( SELECT page_namespace, page_title, SUM(views) AS views FROM staging.page_name_views_dupes WHERE page_namespace = 0 GROUP BY 1,2 ) AS group_page_name_views INNER JOIN enwiki.page USING (page_namespace, page_title)
Column 6 is "time", i.e. this query was running for 133527 seconds at the time (i.e. ~1.5 days!), which is obviously Not Good™. I just ran "KILL QUERY 5599890;", hopefully this will help.
The second-next long-standing query has been running for over 6 hours now and it way too long to paste (87 lines, starts with "INSERT INTO editor_month_global", inline comments, all kinds of subselects in inner joins etc., queried "about 2470000 rows"). I left it be for now, we'll see how that goes and I may eventually kill it too, as I/O is still pegged at 100%.
I believe long-running queries targetted at the research slaves isn't particularly new but is often the source of such problems, so it's a good place to start when investigating such issues. There is only so much a poor database server (and software) can do :)
Regards, Faidon
On Wed, Jan 13, 2016 at 06:55:26PM -0500, Andrew Otto wrote:
Hi all,
Replication to dbstore1002 is having a lot of trouble. From https://tendril.wikimedia.org/host/view/dbstore1002.eqiad.wmnet/3306,
we
see that normal replication is about 9 hours behind at the moment. However, the EventLogging `log` database is not replicated with usual
MySQL
replication. Instead, a custom bash script[1] periodically uses
mysqldump
to copy data from m4-master (dbproxy1004) into dbstore1002. (I just recently found out that this wasn’t regular replication, and I’m not familiar with the reasoning behind using a custom script.)
The EventLogging `log` custom replication has been lagging for days
now.
Also, at about 18:00 UTC today (Jan 13), we can see a huge increase in write traffic on dbstore1002. I looked at each of the normal
replication
masters, and don’t see this write traffic there. EventLogging
traffic also
seems to be about the same over the last week or so (although there
was an
in increase in events being produced by the MobileWebSectionUsage
schema
starting Dec 18, but I don’t think this is the problem).
Disk util is around 100%, but this has been the case for a while now. Today I filed https://phabricator.wikimedia.org/T123546 for a bad
mem chip
or slot on dbproxy1004, but this also seems to have been the status
quo for
quite a while, and doesn’t correlate with this lag.
I’m not sure where else to look at the moment, and I need to run for
the
day. I’ll try to look at this more tomorrow in my morning.
-AO
[1]
https://github.com/wikimedia/operations-puppet/blob/f0df1ec45b3f70a5c041cef2...
Ops mailing list Ops@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ops
Ops mailing list Ops@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ops
-- Oliver Keyes Count Logula Wikimedia Foundation