Hello admins and hostmasters,
download.wikimedia.org/backup-index.html says: "Dumps are currently halted pending resolution of disk space issues. Hopefully will be resolved shortly."
Meanwhile some weeks have passed, the german dump is six weeks old. May we still stay hopefully?
Thank you!
jo
And now another 3 weeks.
The en.wikt has not seen a dump since 13 June.
What does it take?
Robert
On Wed, Sep 3, 2008 at 4:25 PM, Jochen Magnus j@chenmagnus.de wrote:
Hello admins and hostmasters,
download.wikimedia.org/backup-index.html says: "Dumps are currently halted pending resolution of disk space issues. Hopefully will be resolved shortly."
Meanwhile some weeks have passed, the german dump is six weeks old. May we still stay hopefully?
Thank you!
jo
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Robert Ullmann wrote:
And now another 3 weeks.
The en.wikt has not seen a dump since 13 June.
What does it take?
We ended up with an incompatible disk array for the new dumps server; replacement delivery ETA is September 29.
- -- brion
On Mon, Sep 22, 2008 at 6:59 PM, Brion Vibber brion@wikimedia.org wrote:
We ended up with an incompatible disk array for the new dumps server; replacement delivery ETA is September 29.
Thanks for the info. In the meantime, would it be possible just to produce pages-articles.xml.bz2 files without the history part, saving enough disk space for this task to be run?
There is a huge number of projects which rely on at least sporadic dump process.
Mathias
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Mathias Schindler wrote:
On Mon, Sep 22, 2008 at 6:59 PM, Brion Vibber brion@wikimedia.org wrote:
We ended up with an incompatible disk array for the new dumps server; replacement delivery ETA is September 29.
Thanks for the info. In the meantime, would it be possible just to produce pages-articles.xml.bz2 files without the history part, saving enough disk space for this task to be run?
Well, there's no *meantime* left -- I'll just start them all today.
- -- brion
On Mon, Oct 6, 2008 at 6:39 PM, Brion Vibber brion@wikimedia.org wrote:
Well, there's no *meantime* left -- I'll just start them all today.
Even better, thanks a million.
Mathias
Hi,
That is excellent.
However, it does not solve the longstanding problem of having current pages dumps and all-history dumps in the same queue. The current pages dump for a small project that takes a few minutes is thus queued behind history dumps for large projects that take weeks.
It is essential that the history dumps be in a separate queue, or that threads are reserved for smaller projects.
best, Robert
FYI: for anyone interested (although I suspect anyone on the en.wikt already knows this): there are daily XML dumps for the en.wikt available at http://devtionary.info/w/dump/xmlu/ ... these are done by incremental revisions to the previous dump (i.e. not by magic ;-)
On Mon, Oct 6, 2008 at 7:39 PM, Brion Vibber brion@wikimedia.org wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Mathias Schindler wrote:
On Mon, Sep 22, 2008 at 6:59 PM, Brion Vibber brion@wikimedia.org
wrote:
We ended up with an incompatible disk array for the new dumps server; replacement delivery ETA is September 29.
Thanks for the info. In the meantime, would it be possible just to produce pages-articles.xml.bz2 files without the history part, saving enough disk space for this task to be run?
Well, there's no *meantime* left -- I'll just start them all today.
- -- brion
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkjqP0YACgkQwRnhpk1wk44rEwCfRk1A4bMZBeHxozrzfdjJRIXI hZoAnjz9cf2+oSbJZ+f2HWcuSEKZxzIz =yFzn -----END PGP SIGNATURE-----
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Robert Ullmann wrote:
Hi,
That is excellent.
However, it does not solve the longstanding problem of having current pages dumps and all-history dumps in the same queue. The current pages dump for a small project that takes a few minutes is thus queued behind history dumps for large projects that take weeks.
It is essential that the history dumps be in a separate queue, or that threads are reserved for smaller projects.
Should be pretty easy to set up a small-projects-only thread.
Bigger changes to the dumps generation should come in the next couple months to make it quicker and more reliable...
- -- brion
On Thu, Oct 9, 2008 at 9:25 PM, Brion Vibber brion@wikimedia.org wrote:
Should be pretty easy to set up a small-projects-only thread.
Bigger changes to the dumps generation should come in the next couple months to make it quicker and more reliable...
If we look at the process right now, there are two threads: one is doing enwiki, the other hewiki. The enwiki thread isn't even to pages-articles yet, and will run for weeks. The hewiki dump will complete in a day or so, but then next on deck is dewiki, which takes at least a week.
So with things running now, it will be a week or two before any other projects get anything.
As you say, it would be easy to make threads limited to smaller projects; I'd suggest adding an option (-small or something) that just has the code skip [en, de, zh, he ...]wiki when looking for the least-recently completed task. The code list should be 10-15 of the biggest 'pedias, and possibly commons. Then start two small threads, and everything should go well?
best, Robert
If we look at the process right now, there are two threads: one is doing enwiki, the other hewiki. The enwiki thread isn't even to pages-articles yet, and will run for weeks. The hewiki dump will complete in a day or so, but then next on deck is dewiki, which takes at least a week.
So with things running now, it will be a week or two before any other projects get anything.
As you say, it would be easy to make threads limited to smaller projects; I'd suggest adding an option (-small or something) that just has the code skip [en, de, zh, he ...]wiki when looking for the least-recently completed task. The code list should be 10-15 of the biggest 'pedias, and possibly commons. Then start two small threads, and everything should go well?
Is it necessary to skip the 10 biggest? I think skipping just the top 3 would be a massive help.
There are more large wikis than you might think. If you "skip" only the three largest, there will still be a serious queuing problem. Better to have 2-3 threads unrestricted, with the understanding that 95% of the time one of those will be on enwiki, and 1-2 doing all the other projects.
In any case, as it is right now, don't expect anything for a week or two ...
Robert
On Fri, Oct 10, 2008 at 8:03 PM, Thomas Dalton thomas.dalton@gmail.comwrote:
If we look at the process right now, there are two threads: one is doing enwiki, the other hewiki. The enwiki thread isn't even to pages-articles yet, and will run for weeks. The hewiki dump will complete in a day or
so,
but then next on deck is dewiki, which takes at least a week.
So with things running now, it will be a week or two before any other projects get anything.
As you say, it would be easy to make threads limited to smaller projects; I'd suggest adding an option (-small or something) that just has the code skip [en, de, zh, he ...]wiki when looking for the least-recently
completed
task. The code list should be 10-15 of the biggest 'pedias, and possibly commons. Then start two small threads, and everything should go well?
Is it necessary to skip the 10 biggest? I think skipping just the top 3 would be a massive help.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
I'm trying to work out if it is actually desirable to separate the larger projects onto one thread. The only way you can have a smaller project dumped more often is the have the larger ones dumped less often, but do we really want less frequent enwiki dumps? By separateing them and sharing them fairly between the threads you can get more regular dumps, but the significant number is surely the amount of time between one dump of your favourite project and the next, which will only change if you share the projects unfairly. Why do we want small projects to be dumped more frequently than large projects?
I guess the answer, really, is to get more servers doing dumps - I'm sure that will come in time.
Look at this way: you can't get enwiki dumps more than once every six weeks. Each one TAKES SIX WEEKS. (modulo lots of stuff, I'm simplifying a bit ;-)
The example I have used before is going into my bank: in the main Queensway office, there will be 50-100 people on the queue. When there are 8-10 tellers, it will go well; except that some transactions (depositing some cash) take a minute or so, and some take many, many minutes. If there are 8 tellers, and 8 people in front of you with 20-30 minute transactions, you are toast. (They handle this by having fast lines for deposits and such ;-)
In general, one queue feeding multiple servers/threads works very nicely if the tasks are about the same size.
But what we have here is projects that take less than a minute, in the same queue with projects that take weeks. That is 5 orders of magnitude: in the time in takes to do the enwiki dump, the same thread could do ONE HUNDRED THOUSAND small projects.
Imagine walking into your bank with a 30 second transaction, and being told it couldn't be completed for 6 weeks because there were 3 officers available, and 5 people who needed complicated loan approvals on the queue in front of you.
That's the way the dumps are set up right now.
On Sat, Oct 11, 2008 at 2:49 AM, Thomas Dalton thomas.dalton@gmail.comwrote:
I'm trying to work out if it is actually desirable to separate the larger projects onto one thread. The only way you can have a smaller project dumped more often is the have the larger ones dumped less often, but do we really want less frequent enwiki dumps? By separateing them and sharing them fairly between the threads you can get more regular dumps, but the significant number is surely the amount of time between one dump of your favourite project and the next, which will only change if you share the projects unfairly. Why do we want small projects to be dumped more frequently than large projects?
I guess the answer, really, is to get more servers doing dumps - I'm sure that will come in time.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
2008/10/11 Robert Ullmann rlullmann@gmail.com:
Look at this way: you can't get enwiki dumps more than once every six weeks. Each one TAKES SIX WEEKS. (modulo lots of stuff, I'm simplifying a bit ;-)
The example I have used before is going into my bank: in the main Queensway office, there will be 50-100 people on the queue. When there are 8-10 tellers, it will go well; except that some transactions (depositing some cash) take a minute or so, and some take many, many minutes. If there are 8 tellers, and 8 people in front of you with 20-30 minute transactions, you are toast. (They handle this by having fast lines for deposits and such ;-)
Your analogy is flawed. In that analogy the desire is to minimise the amount of time between walking in the door and completing your transaction, but in our case we desire to minimise the amount of time between a person completing one transaction and that person completing their next transaction in an ever repeating loop. The circumstances are not the same.
And you can have enwiki dumps less than 6 weeks apart, it will just involve having more than one running at a time.
Thomas Dalton wrote:
2008/10/11 Robert Ullmann rlullmann@gmail.com:
Look at this way: you can't get enwiki dumps more than once every six weeks. Each one TAKES SIX WEEKS. (modulo lots of stuff, I'm simplifying a bit ;-)
The example I have used before is going into my bank: in the main Queensway office, there will be 50-100 people on the queue. When there are 8-10 tellers, it will go well; except that some transactions (depositing some cash) take a minute or so, and some take many, many minutes. If there are 8 tellers, and 8 people in front of you with 20-30 minute transactions, you are toast. (They handle this by having fast lines for deposits and such ;-)
Your analogy is flawed. In that analogy the desire is to minimise the amount of time between walking in the door and completing your transaction, but in our case we desire to minimise the amount of time between a person completing one transaction and that person completing their next transaction in an ever repeating loop. The circumstances are not the same.
And you can have enwiki dumps less than 6 weeks apart, it will just involve having more than one running at a time.
AIUI (but please correct me if I'm wrong), you can't. At least not without throwing more hardware at it. Otherwise, if you try to run two enwiki dumps concurrently on the same hardware, you'll find that they both finish in _twelve_ weeks instead of six.
If not, let's just run _all_ the dumps in parallel simultaneously, and the problem is solved! ...right?
AIUI (but please correct me if I'm wrong), you can't. At least not without throwing more hardware at it. Otherwise, if you try to run two enwiki dumps concurrently on the same hardware, you'll find that they both finish in _twelve_ weeks instead of six.
If not, let's just run _all_ the dumps in parallel simultaneously, and the problem is solved! ...right?
We're already running multiple threads, I can't see why you can't have more than one of them doing the same project. We could run them all in parallel if you want to donate enough servers to have one thread per project.
On Sat, Oct 11, 2008 at 3:50 AM, Thomas Dalton thomas.dalton@gmail.comwrote:
2008/10/11 Robert Ullmann rlullmann@gmail.com:
Look at this way: you can't get enwiki dumps more than once every six
weeks.
Each one TAKES SIX WEEKS. (modulo lots of stuff, I'm simplifying a bit
;-)
The example I have used before is going into my bank: in the main
Queensway
office, there will be 50-100 people on the queue. When there are 8-10 tellers, it will go well; except that some transactions (depositing some cash) take a minute or so, and some take many, many minutes. If there are
8
tellers, and 8 people in front of you with 20-30 minute transactions, you are toast. (They handle this by having fast lines for deposits and such
;-)
Your analogy is flawed. In that analogy the desire is to minimise the amount of time between walking in the door and completing your transaction, but in our case we desire to minimise the amount of time between a person completing one transaction and that person completing their next transaction in an ever repeating loop. The circumstances are not the same.
No, the analogy is exactly correct; your statement of the problem is not. There is no reason whatever that a hundred other projects should have to wait six weeks to be "fair", just because the enwiki takes that long. Just as there is no reason for the person with the 30 second daily transaction to wait behind someone spending 30 minutes settling their monthly KRA (tax authority) accounts.
We aren't going to get enwiki dumps more often than 6 weeks. (Unless/until whatever rearrangement Brion is planning.) But at the same time, there is no reason whatever that smaller projects can't get dumps every week consistently; they just need a thread that only serves them. Just like that "deposits only in 500's and '1000's bills" teller at the bank.
On Sat, Oct 11, 2008 at 3:32 PM, Robert Ullmann rlullmann@gmail.com wrote:
On Sat, Oct 11, 2008 at 3:50 AM, Thomas Dalton thomas.dalton@gmail.comwrote:
2008/10/11 Robert Ullmann rlullmann@gmail.com:
Look at this way: you can't get enwiki dumps more than once every six
weeks.
Each one TAKES SIX WEEKS. (modulo lots of stuff, I'm simplifying a bit
;-)
The example I have used before is going into my bank: in the main
Queensway
office, there will be 50-100 people on the queue. When there are 8-10 tellers, it will go well; except that some transactions (depositing some cash) take a minute or so, and some take many, many minutes. If there
are 8
tellers, and 8 people in front of you with 20-30 minute transactions,
you
are toast. (They handle this by having fast lines for deposits and such
;-)
Your analogy is flawed. In that analogy the desire is to minimise the amount of time between walking in the door and completing your transaction, but in our case we desire to minimise the amount of time between a person completing one transaction and that person completing their next transaction in an ever repeating loop. The circumstances are not the same.
No, the analogy is exactly correct; your statement of the problem is not. There is no reason whatever that a hundred other projects should have to wait six weeks to be "fair", just because the enwiki takes that long. Just as there is no reason for the person with the 30 second daily transaction to wait behind someone spending 30 minutes settling their monthly KRA (tax authority) accounts.
We aren't going to get enwiki dumps more often than 6 weeks. (Unless/until whatever rearrangement Brion is planning.) But at the same time, there is no reason whatever that smaller projects can't get dumps every week consistently; they just need a thread that only serves them. Just like that "deposits only in 500's and '1000's bills" teller at the bank.
(maybe a bit clearer comparison if you note that a shopkeeper can make 2 deposits a day because of the fast lines. If it wasn't there, the main queue would take 6 hours to get through (it would get a lot longer, even though the net transaction rate is about the same), and the shopkeeper would be limited to perhaps one transaction a week. As it is, the queue can easily take an hour ;-)
(maybe a bit clearer comparison if you note that a shopkeeper can make 2 deposits a day because of the fast lines. If it wasn't there, the main queue would take 6 hours to get through (it would get a lot longer, even though the net transaction rate is about the same), and the shopkeeper would be limited to perhaps one transaction a week. As it is, the queue can easily take an hour ;-)
Ok, how about we abandon the use of analogies? This one is just as flawed. If the shopkeeper in your example was like our dumps, he could make 4 deposits a day, 24/6=4. Our dump threads don't need to go and tend a shop inbetween dumps, they can be in the queue constantly.
2008/10/11 Thomas Dalton thomas.dalton@gmail.com:
(maybe a bit clearer comparison if you note that a shopkeeper can make 2 deposits a day because of the fast lines. If it wasn't there, the main queue would take 6 hours to get through (it would get a lot longer, even though the net transaction rate is about the same), and the shopkeeper would be limited to perhaps one transaction a week. As it is, the queue can easily take an hour ;-)
Ok, how about we abandon the use of analogies? This one is just as flawed. If the shopkeeper in your example was like our dumps, he could make 4 deposits a day, 24/6=4. Our dump threads don't need to go and tend a shop inbetween dumps, they can be in the queue constantly.
I think what we really need here is a bad car analogy. See, the Internet is like a series of interstate highways going through tunnels. And when we take a dump of a Wikipedia, it might be an eighteen-wheeler or it might be a Matchbox car. MySQL is diesel, Postgres is premium unleaded. And the radio only does AM, but the music is better. Also, keep your tyre pressures up. I'm sure you can see where I'm going with this.
- d.
On Sat, Oct 11, 2008 at 1:11 PM, David Gerard dgerard@gmail.com wrote:
2008/10/11 Thomas Dalton thomas.dalton@gmail.com:
(maybe a bit clearer comparison if you note that a shopkeeper can make 2 deposits a day because of the fast lines. If it wasn't there, the main
queue
would take 6 hours to get through (it would get a lot longer, even
though
the net transaction rate is about the same), and the shopkeeper would be limited to perhaps one transaction a week. As it is, the queue can
easily
take an hour ;-)
Ok, how about we abandon the use of analogies? This one is just as flawed. If the shopkeeper in your example was like our dumps, he could make 4 deposits a day, 24/6=4. Our dump threads don't need to go and tend a shop inbetween dumps, they can be in the queue constantly.
I think what we really need here is a bad car analogy. See, the Internet is like a series of interstate highways going through tunnels. And when we take a dump of a Wikipedia, it might be an eighteen-wheeler or it might be a Matchbox car. MySQL is diesel, Postgres is premium unleaded. And the radio only does AM, but the music is better. Also, keep your tyre pressures up. I'm sure you can see where I'm going with this.
- d.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
The only problem with the premium unleaded is that it fails to power the 18 wheelers. And diesel doesn't help a Matchbox car either.
Now, if we start talking hydrogen fuel cells, the entire argument is moot.
-Chad
On Sat, Oct 11, 2008 at 7:27 PM, Chad innocentkiller@gmail.com wrote:
On Sat, Oct 11, 2008 at 1:11 PM, David Gerard dgerard@gmail.com wrote:
I think what we really need here is a bad car analogy. See, the Internet is like a series of interstate highways going through tunnels. And when we take a dump of a Wikipedia, it might be an eighteen-wheeler or it might be a Matchbox car. MySQL is diesel, Postgres is premium unleaded. And the radio only does AM, but the music is better. Also, keep your tyre pressures up. I'm sure you can see where I'm going with this.
The only problem with the premium unleaded is that it fails to power the 18 wheelers. And diesel doesn't help a Matchbox car either.
Now, if we start talking hydrogen fuel cells, the entire argument is moot.
mod parent up
On Sun, Oct 12, 2008 at 5:11 AM, David Gerard dgerard@gmail.com wrote:
I think what we really need here is a bad car analogy. See, the Internet is like a series of interstate highways going through tunnels. And when we take a dump of a Wikipedia, it might be an eighteen-wheeler or it might be a Matchbox car. MySQL is diesel, Postgres is premium unleaded. And the radio only does AM, but the music is better. Also, keep your tyre pressures up. I'm sure you can see where I'm going with this.
This analogy is terrible. We all know that the internet is not a big truck.
On Sun, Oct 12, 2008 at 8:56 AM, Andrew Garrett andrew@epstone.net wrote:
On Sun, Oct 12, 2008 at 5:11 AM, David Gerard dgerard@gmail.com wrote:
I think what we really need here is a bad car analogy. See, the Internet is like a series of interstate highways going through tunnels. And when we take a dump of a Wikipedia, it might be an eighteen-wheeler or it might be a Matchbox car. MySQL is diesel, Postgres is premium unleaded. And the radio only does AM, but the music is better. Also, keep your tyre pressures up. I'm sure you can see where I'm going with this.
This analogy is terrible. We all know that the internet is not a big truck.
Concurrency with trucks is easy, you send 2 trucks, will use different routes.
The backup system seems more like a train. That lock the rails for other trains, while the slow big one is using it.
2008/10/12 Tei oscar.vives@gmail.com:
On Sun, Oct 12, 2008 at 8:56 AM, Andrew Garrett andrew@epstone.net wrote:
On Sun, Oct 12, 2008 at 5:11 AM, David Gerard dgerard@gmail.com wrote:
I think what we really need here is a bad car analogy. See, the Internet is like a series of interstate highways going through tunnels. And when we take a dump of a Wikipedia, it might be an eighteen-wheeler or it might be a Matchbox car. MySQL is diesel, Postgres is premium unleaded. And the radio only does AM, but the music is better. Also, keep your tyre pressures up. I'm sure you can see where I'm going with this.
This analogy is terrible. We all know that the internet is not a big truck.
Concurrency with trucks is easy, you send 2 trucks, will use different routes.
The backup system seems more like a train. That lock the rails for other trains, while the slow big one is using it.
So what we need to do is install some sidings so the big train can pull over and let the faster ones by.
On Sat, Oct 11, 2008 at 10:32 PM, Robert Ullmann rlullmann@gmail.com wrote:
No, the analogy is exactly correct; your statement of the problem is not. There is no reason whatever that a hundred other projects should have to wait six weeks to be "fair", just because the enwiki takes that long. Just as there is no reason for the person with the 30 second daily transaction to wait behind someone spending 30 minutes settling their monthly KRA (tax authority) accounts.
He's right, your analogy is misguided. You're placing importance on the time between the beginning of the round of dumps until the completion of the dump for a particular project, when what's really important is the time between dumps for that particular project.
Say that dumping project A takes 3 days, project B 8 days and project C one day. Your argument seems to be that since project C is the fastest to complete, it should be dumped first, because. But your argument doesn't take into account that these are not one-off transactions, but repeated ones.
If the dumping cycle is repeated monthly, starting on the first of the month, then project A will get their dump on the 4th, B on the 12th and C not until the 13th of the month. But C's previous dump will have been on the 13th of the preceding month, so just like A and B, C had to wait a month for their dump.
On Sat, Oct 11, 2008 at 9:25 AM, Stephen Bain stephen.bain@gmail.comwrote:
He's right, your analogy is misguided. You're placing importance on the time between the beginning of the round of dumps until the completion of the dump for a particular project, when what's really important is the time between dumps for that particular project.
Say that dumping project A takes 3 days, project B 8 days and project C one day. Your argument seems to be that since project C is the fastest to complete, it should be dumped first, because. But your argument doesn't take into account that these are not one-off transactions, but repeated ones.
If the dumping cycle is repeated monthly, starting on the first of the month, then project A will get their dump on the 4th, B on the 12th and C not until the 13th of the month. But C's previous dump will have been on the 13th of the preceding month, so just like A and B, C had to wait a month for their dump.
There is a difference, depending where you make the split:
Say Project A takes 1 month to dump, Project B takes 1 month to dump, and 30 other projects take 1 day each to dump. Say you're CPU bound and have 2 CPUs, so you run 2 threads.
Method 1: Project A and Project B take 1 month to complete. Then the rest of the projects take 1/2 month to complete. Time between successive dumps is 1.5 months.
Method 2: Project A takes 1 month to complete while the 30 other projects complete. Then Project B takes 1 month to complete while the other 30 projects complete again. Time between successive dumps is 2 months for projects A and B, and 1 month for the rest of the projects.
On the other hand, if the 30 "other projects" took 4 days to complete, the times would be 2 months by method 1, and 2 months/4 months by method 2. It all depends where you make the split, and it's not clear to me what split is the most "fair".
Anthony
There is a difference, depending where you make the split:
Say Project A takes 1 month to dump, Project B takes 1 month to dump, and 30 other projects take 1 day each to dump. Say you're CPU bound and have 2 CPUs, so you run 2 threads.
Method 1: Project A and Project B take 1 month to complete. Then the rest of the projects take 1/2 month to complete. Time between successive dumps is 1.5 months.
Method 2: Project A takes 1 month to complete while the 30 other projects complete. Then Project B takes 1 month to complete while the other 30 projects complete again. Time between successive dumps is 2 months for projects A and B, and 1 month for the rest of the projects.
Indeed, you can make some dumps more frequent at the expense of making others less frequent. No-one has yet explained why small dumps should be more frequent that large ones.
On the other hand, if the 30 "other projects" took 4 days to complete, the times would be 2 months by method 1, and 2 months/4 months by method 2. It all depends where you make the split, and it's not clear to me what split is the most "fair".
You've made a mistake in your maths (or, a typo) - method 2 should have one thread dumping more frequently than method 1 and the other less frequently, you have one thread the same and the other less frequent (which would be strictly worse than method 1).
On Sat, Oct 11, 2008 at 10:26 AM, Thomas Dalton thomas.dalton@gmail.comwrote:
Indeed, you can make some dumps more frequent at the expense of making others less frequent. No-one has yet explained why small dumps should be more frequent that large ones.
Or, for that matter, what is gained by more frequent dumps, period. 6 weeks isn't a massive amount of time...
-Chad
On Sat, Oct 11, 2008 at 6:30 PM, Chad innocentkiller@gmail.com wrote:
On Sat, Oct 11, 2008 at 10:26 AM, Thomas Dalton thomas.dalton@gmail.comwrote:
Indeed, you can make some dumps more frequent at the expense of making others less frequent. No-one has yet explained why small dumps should be more frequent that large ones.
Or, for that matter, what is gained by more frequent dumps, period. 6 weeks isn't a massive amount of time...
on en,wikt, we have several dozen reports and such that need updating to manage a lot of details. 6 weeks is *interminable*. Which is why we are running daily incrementals now
Why should small dumps be more frequent than large ones? Because they should be weekly. The problem is that the large ones take much too long, and clog the queue.
This is not rocket science people, it just needs one thread that doesn't get blocked. Simple to do. In the loop in findAndLockNextWiki (for db), do
if '--small' in sys.argv[1:] and db.description() in ['enwiki', 'dewiki', 'frwiki', 'plwiki', 'jawiki, 'itwiki', 'nlwiki', 'ptwiki', 'eswiki', 'ruwiki']: continue
that is all. Then run one thread with --small.
(those are the 10 largest pedias, the ones with more than 10M edits)
So this increases the frequency of dumps for small wikis, great.
But this means that the time beetween two dumps of the big wikis is _at_least_ the sum of the times needed to dump each one of the big wikis... more than 10, 12 weeks, not counting any failure ? I don't think that you really want to do this
2008/10/12 Robert Ullmann rlullmann@gmail.com:
On Sat, Oct 11, 2008 at 6:30 PM, Chad innocentkiller@gmail.com wrote:
On Sat, Oct 11, 2008 at 10:26 AM, Thomas Dalton thomas.dalton@gmail.comwrote:
Indeed, you can make some dumps more frequent at the expense of making others less frequent. No-one has yet explained why small dumps should be more frequent that large ones.
Or, for that matter, what is gained by more frequent dumps, period. 6 weeks isn't a massive amount of time...
on en,wikt, we have several dozen reports and such that need updating to manage a lot of details. 6 weeks is *interminable*. Which is why we are running daily incrementals now
Why should small dumps be more frequent than large ones? Because they should be weekly. The problem is that the large ones take much too long, and clog the queue.
This is not rocket science people, it just needs one thread that doesn't get blocked. Simple to do. In the loop in findAndLockNextWiki (for db), do
if '--small' in sys.argv[1:] and db.description() in ['enwiki', 'dewiki', 'frwiki', 'plwiki', 'jawiki, 'itwiki', 'nlwiki', 'ptwiki', 'eswiki', 'ruwiki']: continue
that is all. Then run one thread with --small.
(those are the 10 largest pedias, the ones with more than 10M edits)
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
2008/10/11 Nicolas Dumazet nicdumz@gmail.com:
So this increases the frequency of dumps for small wikis, great.
But this means that the time beetween two dumps of the big wikis is _at_least_ the sum of the times needed to dump each one of the big wikis... more than 10, 12 weeks, not counting any failure ? I don't think that you really want to do this
Exactly. The only way you can speed up the smaller dumps is the slow down the bigger ones (or throw more money at the problem), and no-one has given any reason why we should prioritise smaller dumps.
On Sat, Oct 11, 2008 at 6:32 PM, Thomas Dalton thomas.dalton@gmail.comwrote:
2008/10/11 Nicolas Dumazet nicdumz@gmail.com:
So this increases the frequency of dumps for small wikis, great.
But this means that the time beetween two dumps of the big wikis is _at_least_ the sum of the times needed to dump each one of the big wikis... more than 10, 12 weeks, not counting any failure ? I don't think that you really want to do this
Exactly. The only way you can speed up the smaller dumps is the slow down the bigger ones (or throw more money at the problem), and no-one has given any reason why we should prioritise smaller dumps.
Processing a huge wiki for the bot owners, etc. takes a longer, not having a fresh dump so often would not be felt until the jobs run on the previous are complete. On the other hand, jobs on smaller or medium complete far faster, so the bot owners would be idle for much more time, than a bot owner working on a larger wiki. Downloading the whole Wikipedia article by article after a certain size becomes too slow to be a comfortable option (it costs and wastes bandwith; and more importantly the valuable time of the editor overseeing the given bot downloading, analysing articles to do nothing until it finds its target [as opposed to finding all the targets fast, and then working on just them]).
Bence Damokos
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
My longest bot job on enwiki lasts one week, way less than.... 12 weeks. Processing a few thousand pages on small wikis takes only a few hours. I don't know any bot job running for a longer time than the time between two dumps of the considered wiki.
The more time you put between two dumps, the more changes there are, the longer are usually the bot jobs. It also means that having a dump let's say every week for small wikis do not add much for bot jobs : if the job consists in fixing a single type of mistake, chances are that during one week, only tens of these mistakes would have been introduced, and the bot job is likely to run really quickly
For bot jobs, I really don't see any advantages in reducing the time between dumps for small wikis. There is not a lot of activity, meaning not a lot to do.
Other applications might require fresher updates of small wiki dumps, but I don't know any bot tasks needing a faster dump rate.
2008/10/12 Bence Damokos bdamokos@gmail.com:
On Sat, Oct 11, 2008 at 6:32 PM, Thomas Dalton thomas.dalton@gmail.comwrote:
2008/10/11 Nicolas Dumazet nicdumz@gmail.com:
So this increases the frequency of dumps for small wikis, great.
But this means that the time beetween two dumps of the big wikis is _at_least_ the sum of the times needed to dump each one of the big wikis... more than 10, 12 weeks, not counting any failure ? I don't think that you really want to do this
Exactly. The only way you can speed up the smaller dumps is the slow down the bigger ones (or throw more money at the problem), and no-one has given any reason why we should prioritise smaller dumps.
Processing a huge wiki for the bot owners, etc. takes a longer, not having a fresh dump so often would not be felt until the jobs run on the previous are complete. On the other hand, jobs on smaller or medium complete far faster, so the bot owners would be idle for much more time, than a bot owner working on a larger wiki. Downloading the whole Wikipedia article by article after a certain size becomes too slow to be a comfortable option (it costs and wastes bandwith; and more importantly the valuable time of the editor overseeing the given bot downloading, analysing articles to do nothing until it finds its target [as opposed to finding all the targets fast, and then working on just them]).
Bence Damokos
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Sat, Oct 11, 2008 at 12:32 PM, Thomas Dalton thomas.dalton@gmail.comwrote:
Exactly. The only way you can speed up the smaller dumps is the slow down the bigger ones (or throw more money at the problem), and no-one has given any reason why we should prioritise smaller dumps.
What about the original idea, to do the "current" dumps in a separate thread from the "history" dumps. With only one thread doing the "history" dumps they'd take a really long time, but at least the "current" dumps would finish in a reasonable amount of time.
2008/10/11 Anthony wikimail@inbox.org:
On Sat, Oct 11, 2008 at 12:32 PM, Thomas Dalton thomas.dalton@gmail.comwrote:
Exactly. The only way you can speed up the smaller dumps is the slow down the bigger ones (or throw more money at the problem), and no-one has given any reason why we should prioritise smaller dumps.
What about the original idea, to do the "current" dumps in a separate thread from the "history" dumps. With only one thread doing the "history" dumps they'd take a really long time, but at least the "current" dumps would finish in a reasonable amount of time.
There's something in that. Speed up the quick parts of all the dumps at the expense of the slow parts of all the dumps. It's fair to each project and probably correlates well with how people actually use the dumps (do we have download logs to see how often people actually download the whole thing?).
There seems to be problems with image loss again.
At least 12 from enwiki recently reported in http://en.wikipedia.org/wiki/Category:Missing_images_for_speedy_deletion
On 12/10/2008, at 7:39 AM, Alex mrzmanwiki@gmail.com wrote:
There seems to be problems with image loss again.
At least 12 from enwiki recently reported in http://en.wikipedia.org/wiki/Category:Missing_images_for_speedy_deletion
Speedily deleting missing images is an extraordinarily bad idea. I'm deleting that category and the associated template.
On Sat, Oct 11, 2008 at 8:32 AM, Robert Ullmann rlullmann@gmail.com wrote:
We aren't going to get enwiki dumps more often than 6 weeks. (Unless/until whatever rearrangement Brion is planning.) But at the same time, there is no reason whatever that smaller projects can't get dumps every week consistently; they just need a thread that only serves them. Just like that "deposits only in 500's and '1000's bills" teller at the bank.
Right now there are 2 threads, right? Adding 2 more threads might make that 6 weeks turn into 13 weeks, or even worse if the process is I/O bound.
The bank teller analogy is a good one, but I'm not sure how many tellers we have.
2008/10/11 Anthony wikimail@inbox.org:
On Sat, Oct 11, 2008 at 8:32 AM, Robert Ullmann rlullmann@gmail.com wrote:
We aren't going to get enwiki dumps more often than 6 weeks. (Unless/until whatever rearrangement Brion is planning.) But at the same time, there is no reason whatever that smaller projects can't get dumps every week consistently; they just need a thread that only serves them. Just like that "deposits only in 500's and '1000's bills" teller at the bank.
Right now there are 2 threads, right? Adding 2 more threads might make that 6 weeks turn into 13 weeks, or even worse if the process is I/O bound.
The bank teller analogy is a good one, but I'm not sure how many tellers we have.
If you want to add two more threads you would need to add more servers to run them. I can't see why they would be I/O bound, they aren't writing anything to the database so can just use slave database servers of which there are plenty.
On Sat, Oct 11, 2008 at 9:39 AM, Thomas Dalton thomas.dalton@gmail.comwrote:
If you want to add two more threads you would need to add more servers to run them. I can't see why they would be I/O bound, they aren't writing anything to the database so can just use slave database servers of which there are plenty.
My understanding is that the processes have to read from the stub dumps and and write to the history dumps. In my experience, though admittedly with cheap hard drives and an untweaked linux system running ext3, running two such processes (accessing 4 files on the same hard drive) on a dual-CPU system is I/O bound (presumably due to the disk seeks, since you can hear the difference on the poor hard drive). I'm sure some tweaks could be made to solve this, but I haven't figured them out yet (at least, not other than the obvious one of using 2 or even 4 hard drives which I just can't afford).
2008/10/11 Anthony wikimail@inbox.org:
On Sat, Oct 11, 2008 at 9:39 AM, Thomas Dalton thomas.dalton@gmail.comwrote:
If you want to add two more threads you would need to add more servers to run them. I can't see why they would be I/O bound, they aren't writing anything to the database so can just use slave database servers of which there are plenty.
My understanding is that the processes have to read from the stub dumps and and write to the history dumps. In my experience, though admittedly with cheap hard drives and an untweaked linux system running ext3, running two such processes (accessing 4 files on the same hard drive) on a dual-CPU system is I/O bound (presumably due to the disk seeks, since you can hear the difference on the poor hard drive). I'm sure some tweaks could be made to solve this, but I haven't figured them out yet (at least, not other than the obvious one of using 2 or even 4 hard drives which I just can't afford).
Hard drives are pretty cheap, installing a second drive to take better advantage of a dual-CPU system wouldn't be too difficult.
On Sat, Oct 11, 2008 at 10:22 AM, Thomas Dalton thomas.dalton@gmail.comwrote:
Hard drives are pretty cheap, installing a second drive to take better advantage of a dual-CPU system wouldn't be too difficult.
Installing it wouldn't be difficult. Justifying the expenditure to my wife while our net worth is negative and we're making less than 150% of the poverty level in income, would be.
But enough about me. That wasn't my point, anyway. The point was, I don't know if the current process is CPU-bound or I/O-bound. There's certainly not enough information available at the svn to determine this. I tend to think the process *is* I/O-bound simply because I don't see anything being done (to create the bz2 history dump) that is CPU-intensive, other than bzipping and bunzipping, and my back of the envelope is that this should take about 3 or 4 days (on one of my crappy several-year-old processors), not 6 weeks. On the other hand, I assume if this *was* the problem that enough extra disks would have been added by now to mitigate it.
You also have to take demand into consideration; how many people are waiting for dumps of enwiki, dewiki, etc. vs. how many are waiting for the smaller wikis? (Not a rhetorical question, I'd be interested in the answer.) To use the bank analogy, if everyone is waiting for a loan, you don't move your loan officers to the teller windows just because they can process small transactions faster. Note also that several dozen of the smallest wikis have fewer than 5000 articles. If someone has a bot or sysop account, they can get the current revision of every article with a single API query. While a dump would be more efficient and probably slightly faster, getting the current revision for every article on a large wiki basically requires a dump.
Robert Ullmann wrote:
Look at this way: you can't get enwiki dumps more than once every six weeks. Each one TAKES SIX WEEKS. (modulo lots of stuff, I'm simplifying a bit ;-)
The example I have used before is going into my bank: in the main Queensway office, there will be 50-100 people on the queue. When there are 8-10 tellers, it will go well; except that some transactions (depositing some cash) take a minute or so, and some take many, many minutes. If there are 8 tellers, and 8 people in front of you with 20-30 minute transactions, you are toast. (They handle this by having fast lines for deposits and such ;-)
In general, one queue feeding multiple servers/threads works very nicely if the tasks are about the same size.
But what we have here is projects that take less than a minute, in the same queue with projects that take weeks. That is 5 orders of magnitude: in the time in takes to do the enwiki dump, the same thread could do ONE HUNDRED THOUSAND small projects.
Imagine walking into your bank with a 30 second transaction, and being told it couldn't be completed for 6 weeks because there were 3 officers available, and 5 people who needed complicated loan approvals on the queue in front of you.
That's the way the dumps are set up right now.
On Sat, Oct 11, 2008 at 2:49 AM, Thomas Dalton thomas.dalton@gmail.comwrote:
I'm trying to work out if it is actually desirable to separate the larger projects onto one thread. The only way you can have a smaller project dumped more often is the have the larger ones dumped less often, but do we really want less frequent enwiki dumps? By separateing them and sharing them fairly between the threads you can get more regular dumps, but the significant number is surely the amount of time between one dump of your favourite project and the next, which will only change if you share the projects unfairly. Why do we want small projects to be dumped more frequently than large projects?
I guess the answer, really, is to get more servers doing dumps - I'm sure that will come in time.
On Fri, Oct 10, 2008 at 7:49 PM, Thomas Dalton thomas.dalton@gmail.comwrote:
I guess the answer, really, is to get more servers doing dumps - I'm sure that will come in time.
No, the answer, really, is to do the dumps more efficiently. Brion says this should come in the next couple months.
Anthony
Hey !
May I mention that the scripts generating the dumps and handling the scheduling are written in Python and are available on wikimedia svn ? [1]
If you have some improvements to suggest on the task scheduling, I guess that patches are welcome :)
In may, following another wikitech-l discussion [2] some small improvements were done on the dump processing, to prioritize the dumps that haven't been successfully dumped in a long time. Previously, we were not taking into account the fact that some dump attempts failed, only ordering the dumps by "last dump try start time", leading to some inconsistencies.
If I'm right, I think that you should also consider the fact that the Xml dumping process is also basing itself on the previous dumps to be faster: in other words, if you have a recent Xml dump, it is faster to work with that existing dump because you can fetch text records from the old dump instead of fetching them from the external storage which also requires normalizing and decompressing. Here, the latest dump available for enwiki is from July, meaning a lot of new text to fetch from external storage: this first dump *will* take a long time, but you should expect the next dumps to go faster.
[1] http://svn.wikimedia.org/viewvc/mediawiki/trunk/backup/ [2] http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/38401/...
2008/10/11 Anthony wikimail@inbox.org:
On Fri, Oct 10, 2008 at 7:49 PM, Thomas Dalton thomas.dalton@gmail.comwrote:
I guess the answer, really, is to get more servers doing dumps - I'm sure that will come in time.
No, the answer, really, is to do the dumps more efficiently. Brion says this should come in the next couple months.
Anthony _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Fri, Oct 10, 2008 at 10:09 PM, Nicolas Dumazet nicdumz@gmail.com wrote:
Hey !
May I mention that the scripts generating the dumps and handling the scheduling are written in Python and are available on wikimedia svn ? [1]
Well, you can, but I already knew that.
If you have some improvements to suggest on the task scheduling, I
guess that patches are welcome :)
Well, I don't know Python, and I'd advocate rewriting the dump system from scratch anyway, but 1) I'd really need access to the SQL server in order to do that; and 2) If I put that much work into something I need some sort of financial reward. Hiring me and/or paying for my family's health care is welcome as well.
I'm actually working on redoing the full history bz2 dump as a bunch of smaller bz2 files (of 900K or less uncompressed text each) so they can be accessed randomly without losing the compressing. But it's going to take a while for me to complete it, since I don't have a very fast machine or hard drives, and I don't have a lot of time to spend on it since working on it has little potential to feed, clothe, or shelter my family. And when I finish it, I'm probably not going to give it away for free, on the off chance that maybe I can sell it to buy my daughter diapers or buy my son milk or something.
I'm a terrible person, aren't I?
Anthony wrote:
I'm actually working on redoing the full history bz2 dump as a bunch of smaller bz2 files (of 900K or less uncompressed text each) so they can be accessed randomly without losing the compressing. But it's going to take a while for me to complete it, since I don't have a very fast machine or hard drives, and I don't have a lot of time to spend on it since working on it has little potential to feed, clothe, or shelter my family. And when I finish it, I'm probably not going to give it away for free, on the off chance that maybe I can sell it to buy my daughter diapers or buy my son milk or something.
Then you're lucky. I have a program for doing exactly that :)
wikitech-l@lists.wikimedia.org