I notice the dumps seem currently frozen, is this the best place to ask for information, or is it publicly available somewhere else? (in which case sorry for pestering).
Conrad
Conrad Irwin wrote:
I notice the dumps seem currently frozen, is this the best place to ask for information, or is it publicly available somewhere else? (in which case sorry for pestering).
Conrad
Thanks Conrad,
I just got into the office and will be taking a look today to see why were halted.
As for the right list .. I had originally imagined
Xmldatadumps-l@ - general discussion list Xmldatadumps-admin-l@ - operations related issues
How does everyone think that divide has been working? It was originally split from wikitech to remove the extra noise that non xml posts provided.
Then I further split it for ops and general tech.
let me know how well you think that has worked.
--tomasz
Tomasz Finc wrote:
Conrad Irwin wrote:
I notice the dumps seem currently frozen, is this the best place to ask for information, or is it publicly available somewhere else? (in which case sorry for pestering).
Conrad
Thanks Conrad,
I just got into the office and will be taking a look today to see why were halted.
All worker threads were stuck on the write() system call as NFS had started to flap around the time of our outage.
The storage node itself is showing WARNINGS trace back from the kernel
I've mailed out ops crew to start digging at this and hopefully our new dataset1 node can be cleared for production use so that we don't have to worry about this anymore.
And ... I've kicked all the old threads to stop since they weren't going to do anything useful.
Were now seeing work go through the system :D
--tomasz
And I made sure to kick off replacement runs for the jobs that were stuck.
--tomasz
Tomasz Finc wrote:
Tomasz Finc wrote:
Conrad Irwin wrote:
I notice the dumps seem currently frozen, is this the best place to ask for information, or is it publicly available somewhere else? (in which case sorry for pestering).
Conrad
Thanks Conrad,
I just got into the office and will be taking a look today to see why were halted.
All worker threads were stuck on the write() system call as NFS had started to flap around the time of our outage.
The storage node itself is showing WARNINGS trace back from the kernel
I've mailed out ops crew to start digging at this and hopefully our new dataset1 node can be cleared for production use so that we don't have to worry about this anymore.
And ... I've kicked all the old threads to stop since they weren't going to do anything useful.
Were now seeing work go through the system :D
--tomasz
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
On 28/04/10 03:36, Tomasz Finc wrote:
And I made sure to kick off replacement runs for the jobs that were stuck.
--tomasz
Tomasz,
Since enwiki is such a special case in terms of size and the length of time it takes to dump, and the resulting difficulty of achieving complete dumps, would it be possible to give it special treatment, both in terms of how often dumps are run, and how many older dumps remain online at any given time?
Neil
Neil Harris wrote:
On 28/04/10 03:36, Tomasz Finc wrote:
And I made sure to kick off replacement runs for the jobs that were stuck.
--tomasz
Tomasz,
Since enwiki is such a special case in terms of size and the length of time it takes to dump, and the resulting difficulty of achieving complete dumps, would it be possible to give it special treatment, both in terms of how often dumps are run, and how many older dumps remain online at any given time?
Having special cases with our limited staff is really tough so instead of making exceptions I'd rather commit to at least release once a month for each project language.
Rather then retaining en for a larger period .. lets actually fix it so that we don't have to treat it as a special case.
I'm going to bring up another worker box to move through the back log even faster today.
--tomasz
Tomasz Finc wrote:
All worker threads were stuck on the write() system call as NFS had started to flap around the time of our outage.
The dumps are working from a worker servers pool which are writing to storage via NFS?
It may be more efficient to save locally and transfer asynchronously to the storage node.
Platonides wrote:
Tomasz Finc wrote:
All worker threads were stuck on the write() system call as NFS had started to flap around the time of our outage.
The dumps are working from a worker servers pool which are writing to storage via NFS?
It may be more efficient to save locally and transfer asynchronously to the storage node.
It turns out that NFS is likely not the root cause of the issue. We've been debugging it in bug #23264 as we make progress.
-tomasz
Tomasz Finc wrote:
Platonides wrote:
Tomasz Finc wrote:
All worker threads were stuck on the write() system call as NFS had started to flap around the time of our outage.
The dumps are working from a worker servers pool which are writing to storage via NFS?
It may be more efficient to save locally and transfer asynchronously to the storage node.
It turns out that NFS is likely not the root cause of the issue.
That was a general reflexion about improving it, not about the causes for this failure.
We've been debugging it in bug #23264 as we make progress.
-tomasz
I'm subscribed to it. However I fail to see from there which was the core issue.
xmldatadumps-l@lists.wikimedia.org