I've temporarily commented out db36 from db.php on the cluster.
This is a flaw in the how the client-side use of maxlag interacts with our schema migration process - we run migrations on slaves one by one in an automated fashion, only moving to the next after replication lag catches up. Mediawiki takes care of not sending queries to the lagged slave that is under migration. Meanwhile, maxlag always reports the value of the most lagged slave. Not a new issue, but this particular alter table on enwiki is likely the most time intensive ever run at wmf. It's slightly ridiculous.
For this one alter, I can stop the migration script and run each statement by hand, pulling and re-adding db's one by one along the way, but this isn't a sustainable process. Perhaps we can add a migration flag to mediawiki, which if enabled, changes the behavior of maxlag and wfWaitForSlaves() to ignore one highly lagged slave so long as others are available without lag.
-A
On Mon, Mar 19, 2012 at 9:28 PM, MZMcBride z@mzmcbride.com wrote:
MZMcBride wrote:
I'm not sure of the exact configuration, but it seems like nearly every
API
request is being handled by the lagged server (db36)? Or perhaps my
scripts
just have terrible luck.
I added some prints to the code. Different servers are responding, but they're all unable to get past the lag, apparently:
{u'servedby': u'srv234', u'error': {u'info': u'Waiting for 10.0.6.46: 21948 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'srv242', u'error': {u'info': u'Waiting for 10.0.6.46: 21982 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'mw20', u'error': {u'info': u'Waiting for 10.0.6.46: 21984 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'mw45', u'error': {u'info': u'Waiting for 10.0.6.46: 21986 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'mw14', u'error': {u'info': u'Waiting for 10.0.6.46: 21988 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'mw42', u'error': {u'info': u'Waiting for 10.0.6.46: 21989 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'mw3', u'error': {u'info': u'Waiting for 10.0.6.46: 21991 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'srv230', u'error': {u'info': u'Waiting for 10.0.6.46: 22005 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'srv259', u'error': {u'info': u'Waiting for 10.0.6.46: 22006 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'srv274', u'error': {u'info': u'Waiting for 10.0.6.46: 22008 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'srv280', u'error': {u'info': u'Waiting for 10.0.6.46: 22009 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'srv236', u'error': {u'info': u'Waiting for 10.0.6.46: 22010 seconds lagged', u'code': u'maxlag'}} {u'servedby': u'srv230', u'error': {u'info': u'Waiting for 10.0.6.46: 22011 seconds lagged', u'code': u'maxlag'}}
And it goes on and on.
The relevant branch of code is:
def __parseJSON(self, data): maxlag = True while maxlag: try: maxlag = False parsed = json.loads(data.read()) content = None if isinstance(parsed, dict): content = APIResult(parsed) content.response = self.response.items() elif isinstance(parsed, list): content = APIListResult(parsed) content.response = self.response.items() else: content = parsed if 'error' in content: error = content['error']['code'] if error == "maxlag": lagtime = int(re.search("(\d+) seconds", content['error']['info']).group(1)) if lagtime > self.wiki.maxwaittime: lagtime = self.wiki.maxwaittime print("Server lag, sleeping for "+str(lagtime)+" seconds") maxlag = True time.sleep(int(lagtime)+0.5) return False
MZMcBride
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l