enwiki revision schema migration in progress

List overview All Threads
Download

newer

older

Strange API bevahiour reported by...

Setting $wgCategoryCollation to...

Asher Feldman

19 Mar 2012 19 Mar '12

10:06 p.m.

Just a heads up that the last of the 1.19 migrations, to add the sha1 column to enwiki.revision is going to be running throughout this week. Don't be alarmed by replication lag messages for s1 dbs in irc. I'm going to juggle which db watchlist queries go to during the migration, so nothing should be noticeable on the site.

-Asher

Show replies by date

MZMcBride

20 Mar 20 Mar

5:06 a.m.

Asher Feldman wrote:

...

Just a heads up that the last of the 1.19 migrations, to add the sha1 column to enwiki.revision is going to be running throughout this week. Don't be alarmed by replication lag messages for s1 dbs in irc. I'm going to juggle which db watchlist queries go to during the migration, so nothing should be noticeable on the site.

Hi.

This is causing significant problems for the scripts that I run that rely on the MediaWiki API. The scripts were designed to be respectful of database lag, but with db36 reporting 20879 seconds of lag (and rising), the scripts are all getting hung as they patiently wait for lag to go back down.

I'm not sure of the exact configuration, but it seems like nearly every API request is being handled by the lagged server (db36)? Or perhaps my scripts just have terrible luck.

In any case, is it possible to depool the server or do something to mitigate this? I don't think we want to encourage script writers to write scripts that simply ignore the lag, but there aren't many other options when it's outrageously high. Alternately, could the other servers be assigned to API duty?

MZMcBride

5:28 a.m.

MZMcBride wrote:

...

I'm not sure of the exact configuration, but it seems like nearly every API request is being handled by the lagged server (db36)? Or perhaps my scripts just have terrible luck.

I added some prints to the code. Different servers are responding, but they're all unable to get past the lag, apparently:

{u'servedby': u'srv234', u'error': {u'info': u'Waiting for 10.0.6.46: 21948 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'srv242', u'error': {u'info': u'Waiting for 10.0.6.46: 21982 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'mw20', u'error': {u'info': u'Waiting for 10.0.6.46: 21984 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'mw45', u'error': {u'info': u'Waiting for 10.0.6.46: 21986 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'mw14', u'error': {u'info': u'Waiting for 10.0.6.46: 21988 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'mw42', u'error': {u'info': u'Waiting for 10.0.6.46: 21989 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'mw3', u'error': {u'info': u'Waiting for 10.0.6.46: 21991 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'srv230', u'error': {u'info': u'Waiting for 10.0.6.46: 22005 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'srv259', u'error': {u'info': u'Waiting for 10.0.6.46: 22006 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'srv274', u'error': {u'info': u'Waiting for 10.0.6.46: 22008 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'srv280', u'error': {u'info': u'Waiting for 10.0.6.46: 22009 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'srv236', u'error': {u'info': u'Waiting for 10.0.6.46: 22010 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'srv230', u'error': {u'info': u'Waiting for 10.0.6.46: 22011 seconds lagged', u'code': u'maxlag'}}

And it goes on and on.

The relevant branch of code is:

--- def __parseJSON(self, data): maxlag = True while maxlag: try: maxlag = False parsed = json.loads(data.read()) content = None if isinstance(parsed, dict): content = APIResult(parsed) content.response = self.response.items() elif isinstance(parsed, list): content = APIListResult(parsed) content.response = self.response.items() else: content = parsed if 'error' in content: error = content['error']['code'] if error == "maxlag": lagtime = int(re.search("(\d+) seconds", content['error']['info']).group(1)) if lagtime > self.wiki.maxwaittime: lagtime = self.wiki.maxwaittime print("Server lag, sleeping for "+str(lagtime)+" seconds") maxlag = True time.sleep(int(lagtime)+0.5) return False ---

MZMcBride

Asher Feldman

6:52 a.m.

I've temporarily commented out db36 from db.php on the cluster.

This is a flaw in the how the client-side use of maxlag interacts with our schema migration process - we run migrations on slaves one by one in an automated fashion, only moving to the next after replication lag catches up. Mediawiki takes care of not sending queries to the lagged slave that is under migration. Meanwhile, maxlag always reports the value of the most lagged slave. Not a new issue, but this particular alter table on enwiki is likely the most time intensive ever run at wmf. It's slightly ridiculous.

For this one alter, I can stop the migration script and run each statement by hand, pulling and re-adding db's one by one along the way, but this isn't a sustainable process. Perhaps we can add a migration flag to mediawiki, which if enabled, changes the behavior of maxlag and wfWaitForSlaves() to ignore one highly lagged slave so long as others are available without lag.

-A

On Mon, Mar 19, 2012 at 9:28 PM, MZMcBride z@mzmcbride.com wrote:

...

MZMcBride wrote:

...
I'm not sure of the exact configuration, but it seems like nearly every

API

...
request is being handled by the lagged server (db36)? Or perhaps my

scripts

...
just have terrible luck.

I added some prints to the code. Different servers are responding, but they're all unable to get past the lag, apparently:

{u'servedby': u'srv234', u'error': {u'info': u'Waiting for 10.0.6.46: 21948 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'srv242', u'error': {u'info': u'Waiting for 10.0.6.46: 21982 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'mw20', u'error': {u'info': u'Waiting for 10.0.6.46: 21984 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'mw45', u'error': {u'info': u'Waiting for 10.0.6.46: 21986 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'mw14', u'error': {u'info': u'Waiting for 10.0.6.46: 21988 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'mw42', u'error': {u'info': u'Waiting for 10.0.6.46: 21989 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'mw3', u'error': {u'info': u'Waiting for 10.0.6.46: 21991 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'srv230', u'error': {u'info': u'Waiting for 10.0.6.46: 22005 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'srv259', u'error': {u'info': u'Waiting for 10.0.6.46: 22006 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'srv274', u'error': {u'info': u'Waiting for 10.0.6.46: 22008 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'srv280', u'error': {u'info': u'Waiting for 10.0.6.46: 22009 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'srv236', u'error': {u'info': u'Waiting for 10.0.6.46: 22010 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'srv230', u'error': {u'info': u'Waiting for 10.0.6.46: 22011 seconds lagged', u'code': u'maxlag'}}

And it goes on and on.

The relevant branch of code is:

def __parseJSON(self, data): maxlag = True while maxlag: try: maxlag = False parsed = json.loads(data.read()) content = None if isinstance(parsed, dict): content = APIResult(parsed) content.response = self.response.items() elif isinstance(parsed, list): content = APIListResult(parsed) content.response = self.response.items() else: content = parsed if 'error' in content: error = content['error']['code'] if error == "maxlag": lagtime = int(re.search("(\d+) seconds", content['error']['info']).group(1)) if lagtime > self.wiki.maxwaittime: lagtime = self.wiki.maxwaittime print("Server lag, sleeping for "+str(lagtime)+" seconds") maxlag = True time.sleep(int(lagtime)+0.5) return False

MZMcBride

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

MZMcBride

3:12 p.m.

Asher Feldman wrote:

...

I've temporarily commented out db36 from db.php on the cluster.

This is a flaw in the how the client-side use of maxlag interacts with our schema migration process - we run migrations on slaves one by one in an automated fashion, only moving to the next after replication lag catches up. Mediawiki takes care of not sending queries to the lagged slave that is under migration. Meanwhile, maxlag always reports the value of the most lagged slave. Not a new issue, but this particular alter table on enwiki is likely the most time intensive ever run at wmf. It's slightly ridiculous.

For this one alter, I can stop the migration script and run each statement by hand, pulling and re-adding db's one by one along the way, but this isn't a sustainable process. Perhaps we can add a migration flag to mediawiki, which if enabled, changes the behavior of maxlag and wfWaitForSlaves() to ignore one highly lagged slave so long as others are available without lag.

Thank you! My scripts are running fine again. :-)

Some people have suggested that the current API behavior is intentional. That is, that having different servers return the same error code is better than having different servers return an error code or not. I think this is flawed logic due to the problems that it presents (scripts unable to get past the error code), but it's definitely something that needs investigation for the future.

MZMcBride

Roan Kattouw

7:20 p.m.

On Tue, Mar 20, 2012 at 7:12 AM, MZMcBride z@mzmcbride.com wrote:

...

Some people have suggested that the current API behavior is intentional. That is, that having different servers return the same error code is better than having different servers return an error code or not. I think this is flawed logic due to the problems that it presents (scripts unable to get past the error code), but it's definitely something that needs investigation for the future.

You're confusing Apache servers and database servers. You're hitting different Apache servers at random, those are the srv numbers. Each Apache requests the current lag from all the DB servers serving that wiki, computes the maximum (in this case db36 wins easily) and checks that against your maxlag parameter. Obviously the highest lag is greater than your maxlag setting, so it throws an error and bails.

Had it continued, or had you not set a maxlag parameter, you would have hit a DB server at random (except that it wouldn't have let you hit db36 because it was lagged to hell and back). So yeah /normally/ you hit DB servers at random and different servers might respond differently (or be lagged to different degrees), but in this particular case it was always the same DB server returning the same lag value. Nothing strange going on here, this is how the maxlag parameter works.

Roan

Asher Feldman

7:35 p.m.

On Tuesday, March 20, 2012, Roan Kattouw roan.kattouw@gmail.com wrote:

...

So yeah /normally/ you hit DB servers at random and different servers might respond differently (or be lagged to different degrees), but in this particular case it was always the same DB server returning the same lag value. Nothing strange going on here, this is how the maxlag parameter works.

How do you feel about a switch to change that behavior (maxlag - 1)? It would be nice to be continue guiding developers towards throttling API requests around maxlag without complicating schema migrations by requiring config deployments before and after every db for this reason only.

Roan Kattouw

7:38 p.m.

On Tue, Mar 20, 2012 at 11:35 AM, Asher Feldman afeldman@wikimedia.org wrote:

...

How do you feel about a switch to change that behavior (maxlag - 1)? It would be nice to be continue guiding developers towards throttling API requests around maxlag without complicating schema migrations by requiring config deployments before and after every db for this reason only.

That sounds reasonable to me, what alternative behavior do you propose? A flag that, when enabled, causes maxlag to use the 2nd highest lag instead of the highest lag?

Roan

Asher Feldman

8:49 p.m.

On Tuesday, March 20, 2012, Roan Kattouw roan.kattouw@gmail.com wrote:

...

On Tue, Mar 20, 2012 at 11:35 AM, Asher Feldman afeldman@wikimedia.org

wrote:

...

...
How do you feel about a switch to change that behavior (maxlag - 1)? It would be nice to be continue guiding developers towards throttling API requests around maxlag without complicating schema migrations by

requiring

...

...
config deployments before and after every db for this reason only.

That sounds reasonable to me, what alternative behavior do you propose? A flag that, when enabled, causes maxlag to use the 2nd highest lag instead of the highest lag?

That was my original thought. Jeremy's idea is good too, though I wonder if we could do something similar without depending on deployments.

Daniel Friesen

7:42 p.m.

On Tue, 20 Mar 2012 11:35:02 -0700, Asher Feldman afeldman@wikimedia.org wrote:

...

On Tuesday, March 20, 2012, Roan Kattouw roan.kattouw@gmail.com wrote:

...
So yeah /normally/ you hit DB servers at random and different servers might respond differently (or be lagged to different degrees), but in this particular case it was always the same DB server returning the same lag value. Nothing strange going on here, this is how the maxlag parameter works.

How do you feel about a switch to change that behavior (maxlag - 1)? It would be nice to be continue guiding developers towards throttling API requests around maxlag without complicating schema migrations by requiring config deployments before and after every db for this reason only.

Why don't we have a system to allow databases to be automatically taken out of the LB temporarily?

-- ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

Jeremy Baron

7:44 p.m.

On Tue, Mar 20, 2012 at 14:35, Asher Feldman afeldman@wikimedia.org wrote:

...

On Tuesday, March 20, 2012, Roan Kattouw roan.kattouw@gmail.com wrote:

...
So yeah /normally/ you hit DB servers at random and different servers might respond differently (or be lagged to different degrees), but in this particular case it was always the same DB server returning the same lag value. Nothing strange going on here, this is how the maxlag parameter works.

How do you feel about a switch to change that behavior (maxlag - 1)? It would be nice to be continue guiding developers towards throttling API requests around maxlag without complicating schema migrations by requiring config deployments before and after every db for this reason only.

How about just having a new file in the config that lists the servers currently migrating and the migration script can update and deploy as it goes. Then apaches can check if a high maxlag is on the lag's expected list and ignore if it is. Should be fairly safe as long as it's a file that no human will edit and the deploy only pushes the one file. But, I don't actually know if it's possible (with current scripts) to push just one file for the config.

-Jeremy

MZMcBride

11:50 p.m.

Roan Kattouw wrote:

...

On Tue, Mar 20, 2012 at 7:12 AM, MZMcBride z@mzmcbride.com wrote:

...
Some people have suggested that the current API behavior is intentional. That is, that having different servers return the same error code is better than having different servers return an error code or not. I think this is flawed logic due to the problems that it presents (scripts unable to get past the error code), but it's definitely something that needs investigation for the future.

You're confusing Apache servers and database servers. You're hitting different Apache servers at random, those are the srv numbers. Each Apache requests the current lag from all the DB servers serving that wiki, computes the maximum (in this case db36 wins easily) and checks that against your maxlag parameter. Obviously the highest lag is greater than your maxlag setting, so it throws an error and bails.

Had it continued, or had you not set a maxlag parameter, you would have hit a DB server at random (except that it wouldn't have let you hit db36 because it was lagged to hell and back). So yeah /normally/ you hit DB servers at random and different servers might respond differently (or be lagged to different degrees), but in this particular case it was always the same DB server returning the same lag value. Nothing strange going on here, this is how the maxlag parameter works.

Thank you for this explanation. It was very helpful.

I had some follow-up questions about what to do to mitigate this issue going forward (particularly server-side), but you and others are already brainstorming in this thread, so I'll slink back to my quiet world of client-side scripting. :-)

(And thanks again to Asher for depooling the lagged server, even as a temporary fix. I was able to do about 2500 talk page deliveries this morning without any issue.)

MZMcBride

4658

Age (days ago)

4659

Last active (days ago)

wikitech-l@lists.wikimedia.org

11 comments

5 participants

tags (0)

participants (5)

Asher Feldman
Daniel Friesen
Jeremy Baron
MZMcBride
Roan Kattouw