Hi all
Today, we had another replag problem, caused by Aka's user stats script. Replag kept rising even after the script was stopped (not killed). Apparently, the reason lies with the transaction mode used: a long running query may lock (parts of) the tables it uses, denying any updates the replication daemon wants to make.
As far as I know, this should not happen if the long running query uses READ UNCOMMITTED (dirty read). Using it should not cause serious problems in our context, save a bit of resources and allow replication to continue while reading from tables. Details on transaction isolation levels can be found here: http://dev.mysql.com/doc/refman/5.0/en/innodb-transaction-isolation.html
DaB. has set the global default isolation level to READ UNCOMMITTED for now. If you are experiencing problems or need data that is guarantied to be consistent under all circumstances, *and* your query does not take long (no more than, say, a minute or so), you can change the isolation level like this:
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ;
REPEATABLE READ is MySQL's "normal" default, i.e. what you have been using until now if you did not set the isolation level explicitly. There are other options, look at the documentation link above.
I hope this actually does solve the problem, and does not cause you too much trouble.
Regards -- Daniel
Hi,
Today, we had another replag problem, caused by Aka's user stats script. Replag kept rising even after the script was stopped (not killed). Apparently, the reason lies with the transaction mode used: a long running query may lock (parts of) the tables it uses, denying any updates the replication daemon wants to make.
the script already starts with
"set session transaction isolation level read uncommitted"
as told by River on this list a few months ago. Beside this, the last time it run it did not affect the replication lag. So, I'm not sure that this script caused these problems.
-- Aka
Hi again,
the script already starts with
"set session transaction isolation level read uncommitted"
as told by River on this list a few months ago. Beside this, the last time it run it did not affect the replication lag. So, I'm not sure that this script caused these problems.
.. and the script pauses, if the replication lag raises above a few seconds, just to make sure it gets not killed again ... During these pauses it sleeps and there are no pending database queries.
How can it affect the replication lag then? Please find out the real source of these problems and do not kill random processes.
-- Aka
How can it affect the replication lag then? Please find out the real source of these problems and do not kill random processes.
It was definitely the cause. it was the only thing that was started when replag started to rise, and when it was killed, replag dropped instantly (it did not drop while the script was paused). it *was* your script. I'm not sure why, but it was.
I asked DaB to save a copy of SHOW FULL PROCESSLIST next time something like this happens, so we can investigate more closely. Right now I don't have enough information to tell you what precisely happened.
-- Daniel
Hi,
How can it affect the replication lag then? Please find out the real source of these problems and do not kill random processes.
It was definitely the cause. it was the only thing that was started when replag started to rise, and when it was killed, replag dropped instantly (it did not drop while the script was paused). it *was* your script. I'm not sure why, but it was.
I asked DaB to save a copy of SHOW FULL PROCESSLIST next time something like this happens, so we can investigate more closely. Right now I don't have enough information to tell you what precisely happened.
I have no idea how this can happen and how a script can affect the replication lag while it does nothing than sleeping. But I will not start it again - then there are just no statistic updates anymore.
BTW, the lag raises very often even if this scripts does'nt run ..
-- Aka
I have no idea how this can happen and how a script can affect the replication lag while it does nothing than sleeping.
By holding a lock. I don't know if this is the case here, but it would be an explanation. Are you using explicit transactions?
But I will not start it again - then there are just no statistic updates anymore.
Please tell me what queries exactly the script performs. I'm pretty sure this could be done more efficiently somehow - for example, by using at the recentchanges table instead of revisions.
BTW, the lag raises very often even if this scripts does'nt run ..
Every time the replag rises like this, we find a script that's causing it. It's not always yours, sure, but it has been before.
Don't think we just want to blame you - your efforts are appreciated. The toolserver is just at its limits right now (we *still* don't have the fast disk online - this is really annoying). We have to cope somehow.
Regards Daniel
Daniel Kinzler wrote:
I have no idea how this can happen and how a script can affect the replication lag while it does nothing than sleeping.
By holding a lock. I don't know if this is the case here, but it would be an explanation. Are you using explicit transactions?
But I will not start it again - then there are just no statistic updates anymore.
Please tell me what queries exactly the script performs. I'm pretty sure this could be done more efficiently somehow - for example, by using at the recentchanges table instead of revisions.
BTW, the lag raises very often even if this scripts does'nt run ..
Every time the replag rises like this, we find a script that's causing it. It's not always yours, sure, but it has been before.
Don't think we just want to blame you - your efforts are appreciated. The toolserver is just at its limits right now (we *still* don't have the fast disk online - this is really annoying). We have to cope somehow.
Regards Daniel
Sorry to interrupt here, but I'm only getting one side of these emails (those from Daniel); I've checked that it's not a filtering problem on my end, so I'm a bit confused. Wondering if anyone else is having this problem?
Essjay
Hello, Am Dienstag, den 18.07.2006, 15:21 -0400 schrieb Essjay:
Sorry to interrupt here, but I'm only getting one side of these emails (those from Daniel); I've checked that it's not a filtering problem on my end, so I'm a bit confused. Wondering if anyone else is having this problem?
No, no problem here.
Essjay
Sincerly, DaB.
Sorry to interrupt here, but I'm only getting one side of these emails (those from Daniel); I've checked that it's not a filtering problem on my end, so I'm a bit confused. Wondering if anyone else is having this problem?
Essjay
I am also only receiving one side of the conversation -- only the messages from Daniel.
~mdd4696
Sorry to interrupt here, but I'm only getting one side of these emails (those from Daniel); I've checked that it's not a filtering problem on my end, so I'm a bit confused. Wondering if anyone else is having this problem?
Essjay
I am also only receiving one side of the conversation -- only the messages from Daniel.
~mdd4696
Me too, only from Daniel.
Platonides schrieb:
Sorry to interrupt here, but I'm only getting one side of these emails (those from Daniel); I've checked that it's not a filtering problem on my end, so I'm a bit confused. Wondering if anyone else is having this problem?
Essjay
I am also only receiving one side of the conversation -- only the messages from Daniel.
Recieving all emails. Sure you checked your spam directory (maybe a special char in topic is reason for alarming your junk filter)
Regards, Marco
Was it. Gmail caught as spam, 4 Aka messages (it'll show as only an item). Now i've recovered them :) Essjay also uses gmail so i guess it was his problem too.
I don't see any 'special char' or anything wrong in the emails. And they're simple plain text. :s Strange.
Recieving all emails. Sure you checked your spam directory (maybe a special char in topic is reason for alarming your junk filter)
Platonides schrieb:
Was it. Gmail caught as spam, 4 Aka messages (it'll show as only an item). Now i've recovered them :) Essjay also uses gmail so i guess it was his problem too.
I don't see any 'special char' or anything wrong in the emails. And they're simple plain text. :s Strange.
Hm, I got (in mail topic) a black square between setto and READUNCOMMITED. It is a tabulator, maybe email clients filter it out.
Greets, Marco
Hi,
By holding a lock. I don't know if this is the case here, but it would be an explanation. Are you using explicit transactions?
No.
But I will not start it again - then there are just no statistic updates anymore.
Please tell me what queries exactly the script performs. I'm pretty sure this could be done more efficiently somehow - for example, by using at the recentchanges table instead of revisions.
No, the recentchances does not help. For the statements, please see "~aka/Cronjobs/UpdateDEUserStat.pl". The long running function is called "GetUserCounts". Its pretty straight forward.
I already played with caches and so on, but this introduces other problems. So, I have no idea how I can still further improve the script without loosing functionality.
If you have an idea, please drop me a (private) mail, and you can write in German then ;)
Regards, Aka
On 7/18/06, Daniel Kinzler daniel@brightbyte.de wrote:
As far as I know, this should not happen if the long running query uses READ UNCOMMITTED (dirty read). Using it should not cause serious problems in our context, save a bit of resources and allow replication to continue while reading from tables. Details on transaction isolation levels can be found here: http://dev.mysql.com/doc/refman/5.0/en/innodb-transaction-isolation.html
DaB. has set the global default isolation level to READ UNCOMMITTED for now. If you are experiencing problems or need data that is guarantied to be consistent under all circumstances, *and* your query does not take long (no more than, say, a minute or so), you can change the isolation level like this:
I already run my queries that can run safely with dirty reads, with dirty reads. Yet DaB has still killed one of my queries, and even accused my query of impacting replag when I wasn't even running one.
The replag paranoia is out of hand...
There is no reason that a select should be blocking other reads no matter what the isolation level is.... Mysql had that sort of behavior with ISAM tables, but it should not exist with innodb..
If this is the case, can we *please* ask mysql to fix their software?
The replag paranoia is out of hand...
It would help to have more than one person looking at it.
There is no reason that a select should be blocking other reads no matter what the isolation level is.... Mysql had that sort of behavior with ISAM tables, but it should not exist with innodb..
If this is the case, can we *please* ask mysql to fix their software?
If we find the exact problem, sure. At the time, DaB's replication tools where getting transaction timeouts while trying to perform updates. So *something* was locking the tables. It would be great if anyone could tell use what except long running selects could cause this (considering no one else can write to the db).
If setting the default isolation level to dirty *really* does not help, or if it causes *serous* problems to anyone, we should undo it. OTOH, if you rely on a specific transactional behavior, you should probably request it explicitly anyway.
Regards, Daniel
PS: the subject line is so long that it got wrapped into a new header line, and a tab was inserted at the start of the wrapped part. This is compliant with the mail RFC, afaik. But apparently, some clients or MTA's choke on it.
On 7/19/06, Daniel Kinzler daniel@brightbyte.de wrote:
The replag paranoia is out of hand...
It would help to have more than one person looking at it.
There is no reason that a select should be blocking other reads no matter what the isolation level is.... Mysql had that sort of behavior with ISAM tables, but it should not exist with innodb..
If this is the case, can we *please* ask mysql to fix their software?
If we find the exact problem, sure.
[snip]
We know whats happening.. the system is IO starved and has been for a long time. The system sits and thrashes on the disk while only doing <6mbytes/sec of work. There might be additional complications, but it is hard to troubleshoot something that is burried under another problem.
Weither this is a pure hardware issue (lack of IO capacity) or something that can be partially adressed through tunables or checking for FS fragmentation (have updates to page and *links fragmented the disk? I'm not sure. Does mysql support tablespaces? How does inno store updates? Does solaris behave smartly with interleaved appends to multiple files?) is not something I'm sure of..
It is clear that replication playback even with no users on the system is just able to catch up.. in other words, it'sunacceptably slow... so why all the cycles wasted on user oriented solutions?
As far as read uncommited goes, any time you join one of the updated tables (page or *links) to another table on a live, you'll risk getting bogus results because of the referenced rows on the other side may not yet exist.
In any case, MySQL AB claims that with innodb tables *readers never block writers*. If thats not true it is a pretty serious problem.
Hello, Am Mittwoch, den 19.07.2006, 21:14 +0200 schrieb Daniel Kinzler:
The replag paranoia is out of hand...
It would help to have more than one person looking at it.
I must agree with daniel in this point. There are 4 people (if I calc right) with root-rights on zedler at the moment, but I'm the only one, who is realy in the chat at the moment. And I guess, that I'm the person with the lowest experience. So please come and help us/me :)
@river: If hemlock would be avaible, the huge amount of replag reasons (scripts which torture the discs) could be eliminate. I had said it and say it again: I will help you in any way, you must just say, how (I would even speak with brion, if you see a access-problem).
If we find the exact problem, sure. At the time, DaB's replication tools where getting transaction timeouts while trying to perform updates.
It was not my sync-programm, it was the mysql-one. You should see, how critical that is, because this tool is not programm by an amateur like me, but from the profis by mysql.
I'm very unhappy if I annoy somebody, because I kill his/her mysql-query or his/her programm. I'm sorry about this.
Sincerly, DaB.
Daniel Kinzler:
DaB. has set the global default isolation level to READ UNCOMMITTED
i disagree with this change. Aka has said that he already runs his queries--which were the justification for this change--with READ UNCOMMITTED isolation level, so this change will not help.
by default, we should always provide correct results to users, unless they request otherwise.
- river.
I do too. Fix the goddamned hardware instead of finding ways to make the box even more unusable for its intended purpose.
Kelly
On 7/19/06, River Tarnell river@attenuate.org wrote:
Daniel Kinzler:
DaB. has set the global default isolation level to READ UNCOMMITTED
i disagree with this change. Aka has said that he already runs his queries--which were the justification for this change--with READ UNCOMMITTED isolation level, so this change will not help.
by default, we should always provide correct results to users, unless they request otherwise.
- river.
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux)
iD8DBQFEvmNprCkrk04PPEoRAj1XAJ9ql8uov0hVZTDMG6U20zXtBNelxgCcCZrE FmdaJXNpOQRfdoT7IG32GM4= =3ZV2 -----END PGP SIGNATURE-----
Toolserver-l mailing list Toolserver-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/toolserver-l
toolserver-l@lists.wikimedia.org