Current Status

Re: [Xmldatadumps-l]...

Conrad Irwin

26 Apr 2010 26 Apr '10

9:27 p.m.

I notice the dumps seem currently frozen, is this the best place to ask for information, or is it publicly available somewhere else? (in which case sorry for pestering).

Conrad

Show replies by date

Tomasz Finc

27 Apr 27 Apr

11:42 p.m.

Conrad Irwin wrote:

...

I notice the dumps seem currently frozen, is this the best place to ask for information, or is it publicly available somewhere else? (in which case sorry for pestering).

Conrad

Thanks Conrad,

I just got into the office and will be taking a look today to see why were halted.

As for the right list .. I had originally imagined

Xmldatadumps-l@ - general discussion list Xmldatadumps-admin-l@ - operations related issues

How does everyone think that divide has been working? It was originally split from wikitech to remove the extra noise that non xml posts provided.

Then I further split it for ops and general tech.

let me know how well you think that has worked.

--tomasz

Tomasz Finc

28 Apr 28 Apr

9:15 a.m.

Tomasz Finc wrote:

...

Conrad Irwin wrote:

...
I notice the dumps seem currently frozen, is this the best place to ask for information, or is it publicly available somewhere else? (in which case sorry for pestering).

Conrad

Thanks Conrad,

I just got into the office and will be taking a look today to see why were halted.

All worker threads were stuck on the write() system call as NFS had started to flap around the time of our outage.

The storage node itself is showing WARNINGS trace back from the kernel

http://pastebin.com/mjKiCsF2

I've mailed out ops crew to start digging at this and hopefully our new dataset1 node can be cleared for production use so that we don't have to worry about this anymore.

And ... I've kicked all the old threads to stop since they weren't going to do anything useful.

Were now seeing work go through the system :D

--tomasz

Tomasz Finc

9:36 a.m.

And I made sure to kick off replacement runs for the jobs that were stuck.

--tomasz

Tomasz Finc wrote:

...

Tomasz Finc wrote:

...
Conrad Irwin wrote:

...
I notice the dumps seem currently frozen, is this the best place to ask for information, or is it publicly available somewhere else? (in which case sorry for pestering).

Conrad

Thanks Conrad,

I just got into the office and will be taking a look today to see why were halted.

All worker threads were stuck on the write() system call as NFS had started to flap around the time of our outage.

The storage node itself is showing WARNINGS trace back from the kernel

http://pastebin.com/mjKiCsF2

I've mailed out ops crew to start digging at this and hopefully our new dataset1 node can be cleared for production use so that we don't have to worry about this anymore.

And ... I've kicked all the old threads to stop since they weren't going to do anything useful.

Were now seeing work go through the system :D

--tomasz

Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

Neil Harris

3:34 p.m.

On 28/04/10 03:36, Tomasz Finc wrote:

...

And I made sure to kick off replacement runs for the jobs that were stuck.

--tomasz

Tomasz,

Since enwiki is such a special case in terms of size and the length of time it takes to dump, and the resulting difficulty of achieving complete dumps, would it be possible to give it special treatment, both in terms of how often dumps are run, and how many older dumps remain online at any given time?

Neil

Tomasz Finc

29 Apr 29 Apr

12:27 a.m.

Neil Harris wrote:

...

On 28/04/10 03:36, Tomasz Finc wrote:

...
And I made sure to kick off replacement runs for the jobs that were stuck.

--tomasz

Tomasz,

Since enwiki is such a special case in terms of size and the length of time it takes to dump, and the resulting difficulty of achieving complete dumps, would it be possible to give it special treatment, both in terms of how often dumps are run, and how many older dumps remain online at any given time?

Having special cases with our limited staff is really tough so instead of making exceptions I'd rather commit to at least release once a month for each project language.

Rather then retaining en for a larger period .. lets actually fix it so that we don't have to treat it as a special case.

I'm going to bring up another worker box to move through the back log even faster today.

--tomasz

Platonides

30 Apr 30 Apr

5:28 a.m.

Tomasz Finc wrote:

...

All worker threads were stuck on the write() system call as NFS had started to flap around the time of our outage.

The dumps are working from a worker servers pool which are writing to storage via NFS?

It may be more efficient to save locally and transfer asynchronously to the storage node.

Tomasz Finc

5 May 5 May

4:44 a.m.

Platonides wrote:

...

Tomasz Finc wrote:

...
All worker threads were stuck on the write() system call as NFS had started to flap around the time of our outage.

The dumps are working from a worker servers pool which are writing to storage via NFS?

It may be more efficient to save locally and transfer asynchronously to the storage node.

It turns out that NFS is likely not the root cause of the issue. We've been debugging it in bug #23264 as we make progress.

-tomasz

Platonides

6:55 a.m.

Tomasz Finc wrote:

...

Platonides wrote:

...
Tomasz Finc wrote:

...
All worker threads were stuck on the write() system call as NFS had started to flap around the time of our outage.

The dumps are working from a worker servers pool which are writing to storage via NFS?

It may be more efficient to save locally and transfer asynchronously to the storage node.

It turns out that NFS is likely not the root cause of the issue.

That was a general reflexion about improving it, not about the causes for this failure.

...

We've been debugging it in bug #23264 as we make progress.

-tomasz

I'm subscribed to it. However I fail to see from there which was the core issue.

5348

Age (days ago)

5356

Last active (days ago)

xmldatadumps-l@lists.wikimedia.org

8 comments

4 participants

tags (0)

participants (4)

Conrad Irwin
Neil Harris
Platonides
Tomasz Finc