Since today the dump process does not work correctly. It is running, but without any success
Best regards
Andim
Brion Vibber wrote:
El 5/1/09 5:51 PM, Andreas Meier escribió:
Since today the dump process does not work correctly. It is running, but without any success
Tomasz is on it... we've upgraded the machine they run on and it needs some more tweaking. :)
-- brion
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Indeed. The backup job was missing the php normalize library. Putting that into place now. Then I'll see if there is any db weirdness.
--tomasz
"Tomasz Finc" tfinc@wikimedia.org wrote in message news:49FB3CA6.90602@wikimedia.org... Brion Vibber wrote:
El 5/1/09 5:51 PM, Andreas Meier escribió:
Since today the dump process does not work correctly. It is running, but without any success
Tomasz is on it... we've upgraded the machine they run on and it needs some more tweaking. :)
Indeed. The backup job was missing the php normalize library. Putting that into place now. Then I'll see if there is any db weirdness.
But, on the bright side, every database in the system now has a dump that was completed within the last nine hours (roughly). When's the last time you could say *that*? :-)
Russell Blau wrote:
"Tomasz Finc" tfinc@wikimedia.org wrote in message news:49FB3CA6.90602@wikimedia.org... Brion Vibber wrote:
El 5/1/09 5:51 PM, Andreas Meier escribió:
Since today the dump process does not work correctly. It is running, but without any success
Tomasz is on it... we've upgraded the machine they run on and it needs some more tweaking. :)
Indeed. The backup job was missing the php normalize library. Putting that into place now. Then I'll see if there is any db weirdness.
But, on the bright side, every database in the system now has a dump that was completed within the last nine hours (roughly). When's the last time you could say *that*? :-)
Mwhaha .. that would be awesome if it was actually useful data. The libs, binaries and configs have all been fixed. I've run a couple of batch jobs for the small wikis [tokiponawiktionary, emlwiki] and am running [afwiki] right now to try a bigger data set. No issues so far past the main page not noticing them finishing.
After afwiki finishes up I'll remove the failed runs as they don't provide us with any useful data. Will set the worker to begin processing after that. Plus I'll actually document the setup.
--tomasz
Tomasz Finc wrote:
Russell Blau wrote:
"Tomasz Finc" tfinc@wikimedia.org wrote in message news:49FB3CA6.90602@wikimedia.org... Brion Vibber wrote:
El 5/1/09 5:51 PM, Andreas Meier escribió:
Since today the dump process does not work correctly. It is running, but without any success
Tomasz is on it... we've upgraded the machine they run on and it needs some more tweaking. :)
Indeed. The backup job was missing the php normalize library. Putting that into place now. Then I'll see if there is any db weirdness.
But, on the bright side, every database in the system now has a dump that was completed within the last nine hours (roughly). When's the last time you could say *that*? :-)
Mwhaha .. that would be awesome if it was actually useful data. The libs, binaries and configs have all been fixed. I've run a couple of batch jobs for the small wikis [tokiponawiktionary, emlwiki] and am running [afwiki] right now to try a bigger data set. No issues so far past the main page not noticing them finishing.
After afwiki finishes up I'll remove the failed runs as they don't provide us with any useful data. Will set the worker to begin processing after that. Plus I'll actually document the setup.
afwiki finished just fine and all subsequent wiki's have been happy. The only issue left is that the version of 7za on Ubuntu 8.04 ignores the system umask and decides that 600 is good enough for everyone. This is fixed in 4.58 and I've requested a backport from the ubuntu folks at
https://bugs.edge.launchpad.net/hardy-backports/+bug/370618
In the mean time I've forced a chmod of 644 into the dumps script.
--tomasz
Tomasz Finc schrieb:
Tomasz Finc wrote:
After afwiki finishes up I'll remove the failed runs as they don't provide us with any useful data. Will set the worker to begin processing after that. Plus I'll actually document the setup.
afwiki finished just fine and all subsequent wiki's have been happy. The only issue left is that the version of 7za on Ubuntu 8.04 ignores the system umask and decides that 600 is good enough for everyone. This is fixed in 4.58 and I've requested a backport from the ubuntu folks at
https://bugs.edge.launchpad.net/hardy-backports/+bug/370618
In the mean time I've forced a chmod of 644 into the dumps script.
--tomasz
Last year and in the begining of this year there were 5 dump processes at the same time. Now there are only two. With 5 running processes it ws possible to have a dump of each project once a month, but with 2 this is not possible. The system seems to be stable now, so can you increase the number of running jobs to 5?
Best regards
Andim
Andreas Meier wrote:
Tomasz Finc schrieb:
Tomasz Finc wrote:
After afwiki finishes up I'll remove the failed runs as they don't provide us with any useful data. Will set the worker to begin processing after that. Plus I'll actually document the setup.
afwiki finished just fine and all subsequent wiki's have been happy. The only issue left is that the version of 7za on Ubuntu 8.04 ignores the system umask and decides that 600 is good enough for everyone. This is fixed in 4.58 and I've requested a backport from the ubuntu folks at
https://bugs.edge.launchpad.net/hardy-backports/+bug/370618
In the mean time I've forced a chmod of 644 into the dumps script.
--tomasz
Last year and in the begining of this year there were 5 dump processes at the same time. Now there are only two. With 5 running processes it ws possible to have a dump of each project once a month, but with 2 this is not possible. The system seems to be stable now, so can you increase the number of running jobs to 5?
This has now been upped to 12 jobs being run concurrently in order to catch up. No outstanding issues that have surfaced yet.
-tomasz
Tomasz, the amount of dump power that you managed to activate is impressive. 136 dumps yesterday, today already 110 :-) Out of 760 total. Of course there are small en large dumps, but this is very encouraging.
Erik Zachte
"Erik Zachte" erikzachte@infodisiac.com wrote in message news:002d01c9cd8d$3355beb0$9a013c10$@com...
Tomasz, the amount of dump power that you managed to activate is impressive. 136 dumps yesterday, today already 110 :-) Out of 760 total. Of course there are small en large dumps, but this is very encouraging.
Yes, thank you Tomasz for your attention to this. The commonswiki process looks like it *might* be dead, by the way.
Russ
Russell Blau wrote:
"Erik Zachte" erikzachte@infodisiac.com wrote in message news:002d01c9cd8d$3355beb0$9a013c10$@com...
Tomasz, the amount of dump power that you managed to activate is impressive. 136 dumps yesterday, today already 110 :-) Out of 760 total. Of course there are small en large dumps, but this is very encouraging.
Yes, thank you Tomasz for your attention to this. The commonswiki process looks like it *might* be dead, by the way.
Don't think so as I actively see it being updated. It's currently set to to finish it's second to last step on 2009-05-06 02:53:21.
No one touch anything while its still going ;)
--tomasz
Tomasz Finc wrote:
Russell Blau wrote:
"Erik Zachte" erikzachte@infodisiac.com wrote in message news:002d01c9cd8d$3355beb0$9a013c10$@com...
Tomasz, the amount of dump power that you managed to activate is impressive. 136 dumps yesterday, today already 110 :-) Out of 760 total. Of course there are small en large dumps, but this is very encouraging.
Yes, thank you Tomasz for your attention to this. The commonswiki process looks like it *might* be dead, by the way.
Don't think so as I actively see it being updated. It's currently set to to finish it's second to last step on 2009-05-06 02:53:21.
No one touch anything while its still going ;)
Commons finished just fine along with every single one of the other small & mid size wiki's waiting to be picked up. Now were just left with the big sized wiki's to finish.
--tomasz
"Tomasz Finc" tfinc@wikimedia.org wrote in message news:4A032BE3.60600@wikimedia.org...
Commons finished just fine along with every single one of the other small & mid size wiki's waiting to be picked up. Now were just left with the big sized wiki's to finish.
This is probably a stupid question (because it depends on umpteen different variables), but would the remaining "big sized wiki's" finish any faster if you stopped the dump processes for the smaller wikis that have already had a dump complete within the past week and are now starting on their second rounds?
Russ
Russell Blau wrote:
"Tomasz Finc" tfinc@wikimedia.org wrote in message news:4A032BE3.60600@wikimedia.org...
Commons finished just fine along with every single one of the other small & mid size wiki's waiting to be picked up. Now were just left with the big sized wiki's to finish.
This is probably a stupid question (because it depends on umpteen different variables), but would the remaining "big sized wiki's" finish any faster if you stopped the dump processes for the smaller wikis that have already had a dump complete within the past week and are now starting on their second rounds?
Not a bad question at all. I've actually been turning down the amount of work to see if it improves any of the larger ones. No increase in processing just yet.
--tomasz
Tomasz Finc wrote:
Commons finished just fine along with every single one of the other small & mid size wiki's waiting to be picked up. Now were just left with the big sized wiki's to finish.
The new dump processes started on May 1 and sped up to twelve processes on May 4. As of yesterday May 7, dumps have started on all databases. While the big ones (enwiki, dewiki, ...) are still running, tokiponawiktionary is the first to have its second dump in this round. They were produced on May 1 and 7. Soon, all small and medium sized databases will have multiple dumps, with roughly 4 day intervals. This is a real improvement over the previous 12 months, and I really hope we don't fall down again.
Now, to be even more useful, database dumps should be produced on *regular* intervals. That way, we can compare various measures such as article growth, link counts or usage of certain words, without having to introduce the exact dump time in the count.
An easy way to implement this is to delay the next dump of a database to exactly one week after the previous dump started.
For example, the last dump of svwiki (Swedish Wikipedia) started at 20:48 (UTC) on Tuesday May 5. So let this time of week (20:48 on Tuesdays) be the timeslot for svwiki. If its turn comes up any earlier, the next dump should be delayed until 20:48 on May 12.
That way, the number of mentions of "EU parliament" (elections are due on June 7) can be compared on a weekly (7 day) basis, rather than on a 5-and-a-half day basis. The 7 day interval removes any measurement bias from weekday/weekend variations.
Another advantage is that we can expect new dumps of svwiki by Wednesday lunch, and can plan our weekly projects accordingly.
This plan does not help the larger projects, which take many days to dump. They would still benefit from optimiziations of the dump process itself. Right now the enwiki is extracting "page abstracts for Yahoo" and will continue to do so until May 21. I really hope Yahoo appreciates this, or else the current dump should be advanced to its next stage to save days and weeks. Maybe the pages-articles.xml part of the dump can be produced on a regular weekly (or fortnightly) basis even for the larger projects, while the other parts are produced more seldom.
Lars wrote
Now, to be even more useful, database dumps should be produced on *regular* intervals. That way, we can compare various measures such as article growth, link counts or usage of certain words, without having to introduce the exact dump time in the count.
That would complicate matters further though. Also as each new dump for some wiki takes a little longer this would mean lots of slack in the schedule and force servers to be idle part of the time.
I hate to distract Tomasz from optimizing the dump process, so I'll postpone new feature requests, but an option to order gift wrappings for the dumps would be neat :-)
Erik Zachte
Now, to be even more useful, database dumps should be produced on *regular* intervals. That way, we can compare various measures such as article growth, link counts or usage of certain words, without having to introduce the exact dump time in the count.
On a related note: I noticed that the meta-info dumps like stub-meta-history.xml.gz etc appear to be generated from the full history dump - and thus fail if the full history dump fails, and get delayed if the full history dump gets delayed.
There are a lot of things that can be done with the meta-info alone, and it seems that dump should be easy and fast to generate. So I propose to genereate it from the database directly, instead of making it depend on the full history dump, which is slow and the most likely to break.
-- daniel
On Wed, May 13, 2009 at 10:13 AM, Daniel Kinzler daniel@brightbyte.dewrote:
Now, to be even more useful, database dumps should be produced on *regular* intervals. That way, we can compare various measures such as article growth, link counts or usage of certain words, without having to introduce the exact dump time in the count.
On a related note: I noticed that the meta-info dumps like stub-meta-history.xml.gz etc appear to be generated from the full history dump - and thus fail if the full history dump fails, and get delayed if the full history dump gets delayed.
Is that something that changed? It used to be the other way around. pages-meta-history.xml.bz2 was generated from stub-meta-history.xml.gz
El May 13, 2009, a las 7:13, Daniel Kinzler daniel@brightbyte.de escribió:
Now, to be even more useful, database dumps should be produced on *regular* intervals. That way, we can compare various measures such as article growth, link counts or usage of certain words, without having to introduce the exact dump time in the count.
On a related note: I noticed that the meta-info dumps like stub-meta-history.xml.gz etc appear to be generated from the full history dump - and thus fail if the full history dump fails, and get delayed if the full history dump gets delayed.
Quite the opposite; the full history dump is generated from the stub skeleton.
-- brion
There are a lot of things that can be done with the meta-info alone, and it seems that dump should be easy and fast to generate. So I propose to genereate it from the database directly, instead of making it depend on the full history dump, which is slow and the most likely to break.
-- daniel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
There is a thread on enwiki WP:VPT [1] speculating that the sluggish server performance that some people are seeing is being caused by the dumper working on enwiki.
This strikes me as implausible, but I thought I'd mention it here in case it could be true. I suppose it is at least possible to expand the dumper enough to have a noticable effect on other aspects of site performance, but I wouldn't expect it to be likely.
-Robert Rohde
[1] http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Slow_Server
El 5/13/09 9:06 PM, Robert Rohde escribió:
There is a thread on enwiki WP:VPT [1] speculating that the sluggish server performance that some people are seeing is being caused by the dumper working on enwiki.
This strikes me as implausible, but I thought I'd mention it here in case it could be true. I suppose it is at least possible to expand the dumper enough to have a noticable effect on other aspects of site performance, but I wouldn't expect it to be likely.
I believe that refers to yesterday's replication lag on the machine running watchlist queries; the abstract dump process that was hitting that particular server was aborted yesterday.
-- brion
-Robert Rohde
[1] http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Slow_Server
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Brion Vibber wrote:
I believe that refers to yesterday's replication lag on the machine running watchlist queries; the abstract dump process that was hitting that particular server was aborted yesterday.
Is Yahoo still using those? Looking at the last successful one for enwiki, it looks like it took a little more than a day to generate. Combined with all the other projects, that seems like an awful lot of processing time spent for something of questionable utility to anyone but Yahoo.
El 5/14/09 2:17 PM, Alex escribió:
Brion Vibber wrote:
I believe that refers to yesterday's replication lag on the machine running watchlist queries; the abstract dump process that was hitting that particular server was aborted yesterday.
Is Yahoo still using those? Looking at the last successful one for enwiki, it looks like it took a little more than a day to generate. Combined with all the other projects, that seems like an awful lot of processing time spent for something of questionable utility to anyone but Yahoo.
Actually, yes. :) I've occasionally heard from other folks using them for stuff, but indeed Yahoo is still grabbing them.
The script needs some cleaning up, and it wouldn't hurt to rearchitect how it's generated in general. (First-sentence summary extraction is also being done in OpenSearchXml for IE 8's search support, and I've improved the implementation there. It should get merged back, and probably merged into core so we can make the extracts more generally available for other uses.)
-- brion
Brion Vibber schrieb:
On a related note: I noticed that the meta-info dumps like stub-meta-history.xml.gz etc appear to be generated from the full history dump - and thus fail if the full history dump fails, and get delayed if the full history dump gets delayed.
Quite the opposite; the full history dump is generated from the stub skeleton.
Good to know, thanks for clarifying.
-- daniel
Tomasz Finc schrieb:
Tomasz Finc wrote:
Russell Blau wrote:
"Erik Zachte" erikzachte@infodisiac.com wrote in message news:002d01c9cd8d$3355beb0$9a013c10$@com...
Tomasz, the amount of dump power that you managed to activate is impressive. 136 dumps yesterday, today already 110 :-) Out of 760 total. Of course there are small en large dumps, but this is very encouraging.
Yes, thank you Tomasz for your attention to this. The commonswiki process looks like it *might* be dead, by the way.
Don't think so as I actively see it being updated. It's currently set to to finish it's second to last step on 2009-05-06 02:53:21.
No one touch anything while its still going ;)
Commons finished just fine along with every single one of the other small & mid size wiki's waiting to be picked up. Now were just left with the big sized wiki's to finish.
First many thanks to Tomasz for the now running system.
Now there are two running dumps of frwiki at the same time: http://download.wikipedia.org/frwiki/20090509/ and http://download.wikipedia.org/frwiki/20090506/ I don't know, if this was intented. Usually this should not happen.
Best regards
Andim
Andreas Meier wrote:
Tomasz Finc schrieb:
Tomasz Finc wrote:
Russell Blau wrote:
"Erik Zachte" erikzachte@infodisiac.com wrote in message news:002d01c9cd8d$3355beb0$9a013c10$@com...
Tomasz, the amount of dump power that you managed to activate is impressive. 136 dumps yesterday, today already 110 :-) Out of 760 total. Of course there are small en large dumps, but this is very encouraging.
Yes, thank you Tomasz for your attention to this. The commonswiki process looks like it *might* be dead, by the way.
Don't think so as I actively see it being updated. It's currently set to to finish it's second to last step on 2009-05-06 02:53:21.
No one touch anything while its still going ;)
Commons finished just fine along with every single one of the other small & mid size wiki's waiting to be picked up. Now were just left with the big sized wiki's to finish.
First many thanks to Tomasz for the now running system.
Now there are two running dumps of frwiki at the same time: http://download.wikipedia.org/frwiki/20090509/ and http://download.wikipedia.org/frwiki/20090506/ I don't know, if this was intented. Usually this should not happen.
This has been dealt with. Let's take any further operations conversations over to Xmldatadumps-admin-l@lists.wikimedia.org
--tomasz
In my opinion fragmentation of conversations onto evermore mailing lists discourages contribution.
On Mon, May 11, 2009 at 1:04 PM, Tomasz Finc tfinc@wikimedia.org wrote:
Andreas Meier wrote:
Tomasz Finc schrieb:
Tomasz Finc wrote:
Russell Blau wrote:
"Erik Zachte" erikzachte@infodisiac.com wrote in message news:002d01c9cd8d$3355beb0$9a013c10$@com...
Tomasz, the amount of dump power that you managed to activate is impressive. 136 dumps yesterday, today already 110 :-) Out of 760 total. Of course there are small en large dumps, but this is very
encouraging.
Yes, thank you Tomasz for your attention to this. The commonswiki
process
looks like it *might* be dead, by the way.
Don't think so as I actively see it being updated. It's currently set
to
to finish it's second to last step on 2009-05-06 02:53:21.
No one touch anything while its still going ;)
Commons finished just fine along with every single one of the other small & mid size wiki's waiting to be picked up. Now were just left with the big sized wiki's to finish.
First many thanks to Tomasz for the now running system.
Now there are two running dumps of frwiki at the same time: http://download.wikipedia.org/frwiki/20090509/ and http://download.wikipedia.org/frwiki/20090506/ I don't know, if this was intented. Usually this should not happen.
This has been dealt with. Let's take any further operations conversations over to Xmldatadumps-admin-l@lists.wikimedia.org
--tomasz
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Mon, May 11, 2009 at 3:27 PM, Brian Brian.Mingus@colorado.edu wrote:
In my opinion fragmentation of conversations onto evermore mailing lists discourages contribution.
I have to agree that I don't think the dump discussion traffic seemed large enough to warrant a whole new mailing list.
Aryeh Gregor wrote:
On Mon, May 11, 2009 at 3:27 PM, Brian Brian.Mingus@colorado.edu wrote:
In my opinion fragmentation of conversations onto evermore mailing lists discourages contribution.
I have to agree that I don't think the dump discussion traffic seemed large enough to warrant a whole new mailing list.
If we find that doesn't work then we'll steer the conversation back to wikitech.
But here is my reasoning
The admin list was meant to receive any and all automated mails from the backup system and I didn't want to busy the readers of wikitech-l with that noise. Previously there was a single recipient of any failures which was not very scalable or transparent.
The discussion list was meant to capture consumers who have approached me that are active users of the dumps but have no direct involvement with mediaiwki and are not regular participants of this list. This runs the list of researchers, search engines, etc .. that are not concerned with all the other conversation that go on wikitech and simply want updates on any changes within the dumps system .
--tomasz
Hoi, This is the kind of news that will make many people happy. Obviously what every one is waiting for is the en.wp to finish .. :) But it is great to have many moments to be happy. thanks, GerardM
2009/5/5 Erik Zachte erikzachte@infodisiac.com
Tomasz, the amount of dump power that you managed to activate is impressive. 136 dumps yesterday, today already 110 :-) Out of 760 total. Of course there are small en large dumps, but this is very encouraging.
Erik Zachte
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hi Tomasz, Any ideas about a fresher dump of enwiki-meta-pages-history?
bilal
2009/5/5 Erik Zachte erikzachte@infodisiac.com
Tomasz, the amount of dump power that you managed to activate is impressive. 136 dumps yesterday, today already 110 :-) Out of 760 total. Of course there are small en large dumps, but this is very encouraging.
Erik Zachte
2009/5/1 Russell Blau russblau@hotmail.com:
But, on the bright side, every database in the system now has a dump that was completed within the last nine hours (roughly). When's the last time you could say *that*? :-)
Yes, and those dumps are 20-byte .gz files. Oops.
Roan Kattouw (Catrope)
Roan Kattouw wrote:
2009/5/1 Russell Blau russblau@hotmail.com:
But, on the bright side, every database in the system now has a dump that was completed within the last nine hours (roughly). When's the last time you could say *that*? :-)
Yes, and those dumps are 20-byte .gz files. Oops.
Roan Kattouw (Catrope)
Completed and with an outstanding compression rate! ;)
2009/5/1 Russell Blau russblau@hotmail.com:
But, on the bright side, every database in the system now has a dump that was completed within the last nine hours (roughly). When's the last time you could say *that*? :-)
A good en:wp dump is the sort of thing warranting announcement on the tech blog. Heck, the *main* blog.
- d.
On Mon, May 4, 2009 at 8:52 PM, David Gerard dgerard@gmail.com wrote:
2009/5/1 Russell Blau russblau@hotmail.com:
But, on the bright side, every database in the system now has a dump that was completed within the last nine hours (roughly). When's the last time you could say *that*? :-)
A good en:wp dump is the sort of thing warranting announcement on the tech blog. Heck, the *main* blog.
Why don't you make up the press release, David?
Might wait for them to actually complete properly.
-Chad
On May 4, 2009 9:05 PM, "Anthony" wikimail@inbox.org wrote:
On Mon, May 4, 2009 at 8:52 PM, David Gerard dgerard@gmail.com wrote: > 2009/5/1 Russell Blau <ru... Why don't you make up the press release, David?
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.o...
That wouldn't be nearly as fun to watch.
On Mon, May 4, 2009 at 9:56 PM, Chad innocentkiller@gmail.com wrote:
Might wait for them to actually complete properly.
-Chad
On May 4, 2009 9:05 PM, "Anthony" wikimail@inbox.org wrote:
On Mon, May 4, 2009 at 8:52 PM, David Gerard dgerard@gmail.com wrote: > 2009/5/1 Russell Blau <ru... Why don't you make up the press release, David?
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.o... _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org