I've temporarily commented out db36 from db.php on the cluster.
This is a flaw in the how the client-side use of maxlag interacts with our
schema migration process - we run migrations on slaves one by one in an
automated fashion, only moving to the next after replication lag catches
up. Mediawiki takes care of not sending queries to the lagged slave that
is under migration. Meanwhile, maxlag always reports the value of the most
lagged slave. Not a new issue, but this particular alter table on enwiki
is likely the most time intensive ever run at wmf. It's slightly
ridiculous.
For this one alter, I can stop the migration script and run each statement
by hand, pulling and re-adding db's one by one along the way, but this
isn't a sustainable process. Perhaps we can add a migration flag to
mediawiki, which if enabled, changes the behavior of maxlag and
wfWaitForSlaves() to ignore one highly lagged slave so long as others are
available without lag.
-A
On Mon, Mar 19, 2012 at 9:28 PM, MZMcBride <z(a)mzmcbride.com> wrote:
MZMcBride wrote:
I'm not sure of the exact configuration, but
it seems like nearly every
API
request is being handled by the lagged server
(db36)? Or perhaps my
scripts
just have terrible luck.
I added some prints to the code. Different servers are responding, but
they're all unable to get past the lag, apparently:
{u'servedby': u'srv234', u'error': {u'info':
u'Waiting for 10.0.6.46:
21948
seconds lagged', u'code': u'maxlag'}}
{u'servedby': u'srv242', u'error': {u'info':
u'Waiting for 10.0.6.46:
21982
seconds lagged', u'code': u'maxlag'}}
{u'servedby': u'mw20', u'error': {u'info': u'Waiting
for 10.0.6.46: 21984
seconds lagged', u'code': u'maxlag'}}
{u'servedby': u'mw45', u'error': {u'info': u'Waiting
for 10.0.6.46: 21986
seconds lagged', u'code': u'maxlag'}}
{u'servedby': u'mw14', u'error': {u'info': u'Waiting
for 10.0.6.46: 21988
seconds lagged', u'code': u'maxlag'}}
{u'servedby': u'mw42', u'error': {u'info': u'Waiting
for 10.0.6.46: 21989
seconds lagged', u'code': u'maxlag'}}
{u'servedby': u'mw3', u'error': {u'info': u'Waiting
for 10.0.6.46: 21991
seconds lagged', u'code': u'maxlag'}}
{u'servedby': u'srv230', u'error': {u'info':
u'Waiting for 10.0.6.46:
22005
seconds lagged', u'code': u'maxlag'}}
{u'servedby': u'srv259', u'error': {u'info':
u'Waiting for 10.0.6.46:
22006
seconds lagged', u'code': u'maxlag'}}
{u'servedby': u'srv274', u'error': {u'info':
u'Waiting for 10.0.6.46:
22008
seconds lagged', u'code': u'maxlag'}}
{u'servedby': u'srv280', u'error': {u'info':
u'Waiting for 10.0.6.46:
22009
seconds lagged', u'code': u'maxlag'}}
{u'servedby': u'srv236', u'error': {u'info':
u'Waiting for 10.0.6.46:
22010
seconds lagged', u'code': u'maxlag'}}
{u'servedby': u'srv230', u'error': {u'info':
u'Waiting for 10.0.6.46:
22011
seconds lagged', u'code': u'maxlag'}}
And it goes on and on.
The relevant branch of code is:
---
def __parseJSON(self, data):
maxlag = True
while maxlag:
try:
maxlag = False
parsed = json.loads(data.read())
content = None
if isinstance(parsed, dict):
content = APIResult(parsed)
content.response = self.response.items()
elif isinstance(parsed, list):
content = APIListResult(parsed)
content.response = self.response.items()
else:
content = parsed
if 'error' in content:
error = content['error']['code']
if error == "maxlag":
lagtime = int(re.search("(\d+) seconds",
content['error']['info']).group(1))
if lagtime > self.wiki.maxwaittime:
lagtime = self.wiki.maxwaittime
print("Server lag, sleeping for "+str(lagtime)+" seconds")
maxlag = True
time.sleep(int(lagtime)+0.5)
return False
---
MZMcBride
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l