Dear All, specially Anthony and Platonides,
I'm not techy - so why hasn't it been possible to have a non-corrupt dump in a long time (that includes history?). A professor of mine asked if the problem could be man(person)-power and if it would be interesting/useful to have the university help with a programmer to help the dump happen.
Also - now I've got a file from 2006 but I still wonder if there is no place where one can access old dumps - these will/could be very important research wise.
And last but not least - If the dumps don't work, then it is very important to be able to dump some articles with their full histories in other fashions. I ask my pledge again - do you know who made the block so that export would only allow for 100 revisions? any way to hack that? Would it be possible to open an exception to get the data for a research study?
Thanks! Rut
I'm not techy - so why hasn't it been possible to have a non-corrupt dump in a long time (that includes history?). A professor of mine asked if the problem could be man(person)-power and if it would be interesting/useful to have the university help with a programmer to help the dump happen.
I've been wondering why we've been having such trouble with dumps myself... anyone?
Also - now I've got a file from 2006 but I still wonder if there is no place where one can access old dumps - these will/could be very important research wise.
Only the most recent full dump is worth keeping, since it includes everything that would be in an older dump. The other benefit of removing old dumps is that is makes oversight (the ability to remove revisions from the history) more effective.
And last but not least - If the dumps don't work, then it is very important to be able to dump some articles with their full histories in other fashions. I ask my pledge again - do you know who made the block so that export would only allow for 100 revisions? any way to hack that? Would it be possible to open an exception to get the data for a research study?
I'm sure something can be arranged. This mailing list is probably the best way to contact someone able to arrange it.
I was just thinking to myself, it would probably be a good idea if we had a notification system so if a user wants to know upon completion of a dump. I don't know if this is already in place, but it would probably just involve another mailing list where users can be notified by a representative or a developer about a completed dump, and a link to the directory and/or download.
Does anyone agree to this? I'd certainly subscribe to it if available.
Kind regards,
E English Wikipedia e.wikipedia@gmail.com
-------------------------------------------------- From: "Thomas Dalton" thomas.dalton@gmail.com Sent: Sunday, November 18, 2007 9:20 AM To: "Wikimedia developers" wikitech-l@lists.wikimedia.org Subject: Re: [Wikitech-l] Why is difficult to have a non-corrupt dump -other ways of getting the information
I'm not techy - so why hasn't it been possible to have a non-corrupt dump in a long time (that includes history?). A professor of mine asked if the problem could be man(person)-power and if it would be interesting/useful to have the university help with a programmer to help the dump happen.
I've been wondering why we've been having such trouble with dumps myself... anyone?
Also - now I've got a file from 2006 but I still wonder if there is no place where one can access old dumps - these will/could be very important research wise.
Only the most recent full dump is worth keeping, since it includes everything that would be in an older dump. The other benefit of removing old dumps is that is makes oversight (the ability to remove revisions from the history) more effective.
And last but not least - If the dumps don't work, then it is very important to be able to dump some articles with their full histories in other fashions. I ask my pledge again - do you know who made the block so that export would only allow for 100 revisions? any way to hack that? Would it be possible to open an exception to get the data for a research study?
I'm sure something can be arranged. This mailing list is probably the best way to contact someone able to arrange it.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
E wrote:
I was just thinking to myself, it would probably be a good idea if we had a notification system so if a user wants to know upon completion of a dump. I don't know if this is already in place, but it would probably just involve another mailing list where users can be notified by a representative or a developer about a completed dump, and a link to the directory and/or download.
Does anyone agree to this? I'd certainly subscribe to it if available.
There is already an RSS feed IIRC, however the problem is not notification of a complete dump, it is a complete non-corrupt dump, which is more difficult to identify.
MinuteElectron.
On 11/17/07, Di (rut) vulpeto@gmail.com wrote:
Dear All, specially Anthony and Platonides,
I'm not techy - so why hasn't it been possible to have a non-corrupt dump in a long time (that includes history?). A professor of mine asked if the problem could be man(person)-power and if it would be interesting/useful to have the university help with a programmer to help the dump happen.
AFAIK, basically, it's a manpower problem. It probably needs a root to set it up, or at least a shell user, and there aren't so many of those. I think Brion is the one who's tended to deal with it in the past, and he has no time. I don't know if outside help would be appreciated or not, you'd have to ask . . . uh, Brion, probably.
On Nov 17, 2007 6:10 PM, Di (rut) vulpeto@gmail.com wrote:
Dear All, specially Anthony and Platonides,
I'm not techy - so why hasn't it been possible to have a non-corrupt dump in a long time (that includes history?). A professor of mine asked if the problem could be man(person)-power and if it would be interesting/useful to have the university help with a programmer to help the dump happen.
I believe Brion Vibber is the only one who can possibly answer this question. Does anyone else even have access to the dump server and permission to fix the problem?
I've about given up on there being a valid dump any time soon.
And last but not least - If the dumps don't work, then it is very important to be able to dump some articles with their full histories in other fashions. I ask my pledge again - do you know who made the block so that export would only allow for 100 revisions? any way to hack that? Would it be possible to open an exception to get the data for a research study?
Email Brion Vibber. If he doesn't respond within a few days email Sue Gardner. If she doesn't respond in a few days, let us know on here once again. I don't know their email addresses off hand but they're probably not hard to find. If you do get a response, please let me or the list know.
In the mean time, why not just write a script to automate exporting 100 revisions at a time?
Anthony wrote:
In the mean time, why not just write a script to automate exporting 100 revisions at a time?
This is not possible due too the way Special:Export works, you can only specify a page - not a revision from that page - so you will always get the first revision (unless there has been some changes since I last used it). Alternatively there is the toolserver which has every revision on it, so you could always do a small data dump of a limited subset from there, import it into your own wiki, then maybe export it as XML depending on what format your research application takes.
MinuteElectron.
On Nov 18, 2007 8:39 PM, MinuteElectron minuteelectron@googlemail.com wrote:
Anthony wrote:
In the mean time, why not just write a script to automate exporting 100 revisions at a time?
This is not possible due too the way Special:Export works, you can only specify a page - not a revision from that page - so you will always get the first revision (unless there has been some changes since I last used it). Alternatively there is the toolserver which has every revision on it, so you could always do a small data dump of a limited subset from there, import it into your own wiki, then maybe export it as XML depending on what format your research application takes.
Or use the API:
http://www.mediawiki.org/wiki/API
Di (rut) wrote:
Dear All, specially Anthony and Platonides,
I hope the heroic duo will give their blessing to this post.
I'm not techy - so why hasn't it been possible to have a non-corrupt dump in a long time (that includes history?). A professor of mine asked if the problem could be man(person)-power and if it would be interesting/useful to have the university help with a programmer to help the dump happen.
In my opinion, it would be a lot easier to generate a full dump if it was split into multiple XML files for each wiki. Then the job could be checkpointed on the file level. Checkpoint/resume is quite difficult with the current single-file architecture.
Tolerant parsers on the client side would help a bit. A dump shouldn't be considered "failed" just because it has a region of garbage and some unclosed tags in the middle of the file.
Also - now I've got a file from 2006 but I still wonder if there is no place where one can access old dumps - these will/could be very important research wise.
The Foundation does not host old dumps. Maybe someone else has one.
And last but not least - If the dumps don't work, then it is very important to be able to dump some articles with their full histories in other fashions. I ask my pledge again - do you know who made the block so that export would only allow for 100 revisions? any way to hack that? Would it be possible to open an exception to get the data for a research study?
There's an offset parameter which allows you to get specified revisions or revision ranges. Read the relevant code in includes/SpecialExport.php before use, it's a bit counterintuitive (buggy?).
-- Tim Starling
On Nov 18, 2007 8:52 AM, Tim Starling tstarling@wikimedia.org wrote:
Di (rut) wrote:
Dear All, specially Anthony and Platonides,
I hope the heroic duo will give their blessing to this post.
My name is Anthony DiPierro, and I approve this message :).
I'm not techy - so why hasn't it been possible to have a non-corrupt dump in a long time (that includes history?). A professor of mine asked if the problem could be man(person)-power and if it would be interesting/useful to have the university help with a programmer to help the dump happen.
In my opinion, it would be a lot easier to generate a full dump if it was split into multiple XML files for each wiki. Then the job could be checkpointed on the file level. Checkpoint/resume is quite difficult with the current single-file architecture.
Tolerant parsers on the client side would help a bit. A dump shouldn't be considered "failed" just because it has a region of garbage and some unclosed tags in the middle of the file.
There are a ton of possible solutions. Do you have access to the dump server and permission to implement any of them? That seems to be the bottleneck.
Personally, I think a good backward-compatible improvement would be to only regenerate the parts of the bzip2 file which have changed. Bzip2 resets its compression every 900K or so of uncompressed text, plus the specification treats the concatenation of two bzip2 files as decompressing to the same as the bzip2 of the concatenation of the two uncompressed files (I hope that made sense). So if the files are ordered by title then by revision time there should be a whole lot of chunks which don't need to be uncompressed/recompressed every dump, and from what I've read compression is the current bottleneck.
But, well, I don't have access to the dump server, or even to the toolserver, so I couldn't implement it even if I did have the time.
And last but not least - If the dumps don't work, then it is very important to be able to dump some articles with their full histories in other fashions. I ask my pledge again - do you know who made the block so that export would only allow for 100 revisions? any way to hack that? Would it be possible to open an exception to get the data for a research study?
There's an offset parameter which allows you to get specified revisions or revision ranges. Read the relevant code in includes/SpecialExport.php before use, it's a bit counterintuitive (buggy?).
How much are we allowed to use this without getting blocked?
Anthony wrote:
There are a ton of possible solutions. Do you have access to the dump server and permission to implement any of them? That seems to be the bottleneck.
Yes I have access, but I don't have time. You don't need access to the dump server to implement improvements, it's all open source:
http://svn.wikimedia.org/viewvc/mediawiki/trunk/backup/
You can just submit a patch.
Personally, I think a good backward-compatible improvement would be to only regenerate the parts of the bzip2 file which have changed. Bzip2 resets its compression every 900K or so of uncompressed text, plus the specification treats the concatenation of two bzip2 files as decompressing to the same as the bzip2 of the concatenation of the two uncompressed files (I hope that made sense). So if the files are ordered by title then by revision time there should be a whole lot of chunks which don't need to be uncompressed/recompressed every dump, and from what I've read compression is the current bottleneck.
That's an interesting theory.
But, well, I don't have access to the dump server, or even to the toolserver, so I couldn't implement it even if I did have the time.
Couldn't you just set up a test server at home, operating on a reduced data set?
And last but not least - If the dumps don't work, then it is very important to be able to dump some articles with their full histories in other fashions. I ask my pledge again - do you know who made the block so that export would only allow for 100 revisions? any way to hack that? Would it be possible to open an exception to get the data for a research study?
There's an offset parameter which allows you to get specified revisions or revision ranges. Read the relevant code in includes/SpecialExport.php before use, it's a bit counterintuitive (buggy?).
How much are we allowed to use this without getting blocked?
Please don't walk that line, if you stop when a sysadmin notices that you're slowing down the servers, you've gone way too far. Stick to a single thread.
-- Tim Starling
On Nov 18, 2007 10:16 AM, Tim Starling tstarling@wikimedia.org wrote:
Anthony wrote:
There are a ton of possible solutions. Do you have access to the dump server and permission to implement any of them? That seems to be the bottleneck.
Yes I have access, but I don't have time. You don't need access to the dump server to implement improvements, it's all open source:
http://svn.wikimedia.org/viewvc/mediawiki/trunk/backup/
You can just submit a patch.
I wasn't aware that the source code to the backup server was open. Now I guess I can't complain unless I have code :).
Personally, I think a good backward-compatible improvement would be to only regenerate the parts of the bzip2 file which have changed. Bzip2 resets its compression every 900K or so of uncompressed text, plus the specification treats the concatenation of two bzip2 files as decompressing to the same as the bzip2 of the concatenation of the two uncompressed files (I hope that made sense). So if the files are ordered by title then by revision time there should be a whole lot of chunks which don't need to be uncompressed/recompressed every dump, and from what I've read compression is the current bottleneck.
That's an interesting theory.
But, well, I don't have access to the dump server, or even to the toolserver, so I couldn't implement it even if I did have the time.
Couldn't you just set up a test server at home, operating on a reduced data set?
Yes, I could. One thing stopping me has been that I didn't have much of a clue how the dumps were actually being made. Now that I know about the source code, maybe I can do a little better.
I already have most of the random access *reading* completed. It was a simple hack to bzip2recover (which has a very small source code file). Don't credit me with the idea though, I stole the idea from Thanassis Tsiodras (http://www.softlab.ntua.gr/~ttsiod/buildWikipediaOffline.html)
Read the relevant code in includes/SpecialExport.php before use, it's a bit counterintuitive (buggy?).
How much are we allowed to use this without getting blocked?
Please don't walk that line, if you stop when a sysadmin notices that you're slowing down the servers, you've gone way too far. Stick to a single thread.
I can handle an awful lot in a single thread, using the API. I have no idea if it'd hurt the server to do so, though.
On Nov 18, 2007 11:20 AM, Anthony wikimail@inbox.org wrote:
On Nov 18, 2007 10:16 AM, Tim Starling tstarling@wikimedia.org wrote:
Anthony wrote:
There are a ton of possible solutions. Do you have access to the dump server and permission to implement any of them? That seems to be the bottleneck.
Yes I have access, but I don't have time. You don't need access to the dump server to implement improvements, it's all open source:
http://svn.wikimedia.org/viewvc/mediawiki/trunk/backup/
You can just submit a patch.
I wasn't aware that the source code to the backup server was open. Now I guess I can't complain unless I have code :).
Oh, God, it's in python? Nevermind.
On 11/18/07, Anthony wikimail@inbox.org wrote:
Oh, God, it's in python? Nevermind.
What, you prefer PHP?
On Nov 18, 2007 1:32 PM, Simetrical Simetrical+wikilist@gmail.com wrote:
On 11/18/07, Anthony wikimail@inbox.org wrote:
Oh, God, it's in python? Nevermind.
What, you prefer PHP?
At least I can understand PHP. But, it turns out most of it *is* in PHP, mostly these two files:
*http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/maintenance/dumpBacku... *http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/maintenance/dumpTextP...
I read though dumpBackup.php which seems pretty straightforward, just does a "SELECT /*! STRAIGHT_JOIN */ * FROM page, revision, text WHERE page_id=rev_page ORDER by page_id" and puts it into stub-meta-history. dumpTextPass is going to then go through stub-meta-history and fill in the actual text, but I haven't read that yet.
For the immediate future a way to restart a broken dump is probably the most important. Find the last ~900K segment of the bz2 file, remove it, add the bzip2 end of file information, then concatenate the rest of the dump? Sound reasonable?
On Sun, 2007-11-18 at 13:53 -0500, Anthony wrote:
For the immediate future a way to restart a broken dump is probably the most important. Find the last ~900K segment of the bz2 file, remove it, add the bzip2 end of file information, then concatenate the rest of the dump? Sound reasonable?
I raised something similar about 3-4 years ago, regarding the --rsyncable option of gzip, when the Wikipeda dump servers allowed us to rsync the dumps, instead of straight http.
Perhaps something similar should be revisited on the server-side, while preparing and compressing those dumps?
(waits for brion to counter with something valid to pull out the rug on my idea =)
On Sun, Nov 18, 2007 at 02:11:49PM -0500, David A. Desrosiers wrote:
On Sun, 2007-11-18 at 13:53 -0500, Anthony wrote:
For the immediate future a way to restart a broken dump is probably the most important. Find the last ~900K segment of the bz2 file, remove it, add the bzip2 end of file information, then concatenate the rest of the dump? Sound reasonable?
I raised something similar about 3-4 years ago, regarding the --rsyncable option of gzip, when the Wikipeda dump servers allowed us to rsync the dumps, instead of straight http.
Perhaps something similar should be revisited on the server-side, while preparing and compressing those dumps?
(waits for brion to counter with something valid to pull out the rug on my idea =)
Well, I gather the new version of rsync is *much* smarter that the old versions were about rilly, rilly big files, so perhaps this is worth revisiting.
Cheers, -- jra
On 11/18/07, Jay R. Ashworth jra@baylink.com wrote:
Well, I gather the new version of rsync is *much* smarter that the old versions were about rilly, rilly big files, so perhaps this is worth revisiting.
Really big files aren't the issue, it's a really large number that's the issue. rsync < 3.0 will first create a list, in memory, of all files it's going to transfer. Only once it's made the list will it start the transfer. When I moved my server and used rsync to copy the entire contents of the old filesystem to the new server, it used several hundred MB of memory before it even started transferring files. The same happens for *image* files, since there are so many.
For a single large file, rsync's clever rolling diff algorithm might or might not be entirely optimal, but I haven't heard of either complaints against it or improvements in recent versions.
On Sun, 2007-11-18 at 14:54 -0500, Simetrical wrote:
For a single large file, rsync's clever rolling diff algorithm might or might not be entirely optimal, but I haven't heard of either complaints against it or improvements in recent versions.
You probably want the -S option, as well as -P for those files.
On Sun, 2007-11-18 at 11:20 -0500, Anthony wrote:
Don't credit me with the idea though, I stole the idea from Thanassis Tsiodras (http://www.softlab.ntua.gr/~ttsiod/buildWikipediaOffline.html)
Looks interesting, though I'm skeptical of his assertion here:
"Orders of magnitude faster to install (a matter of hours) compared to loading the "dump" into MySQL"
I've posted my results here many times, and the largest wiki out there (enwiki) takes ~40 minutes, _max_ to load up into a clean, cold-booted MySQL instance, from the XML source, using mysql and redirection (not using mwdumper in a pipe).
The target machine is a dual-core AMD64/2.4 machine with 2gb RAM using a single SATA drive. Not a powerhouse by any means. Give me a faster machine with a lot more RAM and faster disks, and I bet I could cut that down to less than 20 minutes.
So I think his logic is backwards. If it takes "a matter of hours" to install his version, that is SIGNIFICANTLY slower than using mwdumper and mysql directly, on the largest wiki dump available.
I can handle an awful lot in a single thread, using the API. I have no idea if it'd hurt the server to do so, though.
Ideally, this should work _on_ the dump, not on the live server(s).
I can handle an awful lot in a single thread, using the API. I have no idea if it'd hurt the server to do so, though.
Ideally, this should work _on_ the dump, not on the live server(s).
The problem is, there is no dump, at least not a full history one in 2007.
Tim Starling wrote:
Di (rut) wrote:
Dear All, specially Anthony and Platonides,
I hope the heroic duo will give their blessing to this post.
You're welcome to kick in. :)
I'm not techy - so why hasn't it been possible to have a non-corrupt dump in a long time (that includes history?). A professor of mine asked if the problem could be man(person)-power and if it would be interesting/useful to have the university help with a programmer to help the dump happen.
In my opinion, it would be a lot easier to generate a full dump if it was split into multiple XML files for each wiki. Then the job could be checkpointed on the file level. Checkpoint/resume is quite difficult with the current single-file architecture.
I did a proposal on that line last month http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/34547 You're also welcomed to comment it ;) Although the main point seems to be if the files compression is good enough... The compression acceptable level varying due to things like WMF disk space available for dumps and the needing to have a better dump system.
Anthony wrote:
So if the files are ordered by title then by revision time there should be a whole lot of chunks which don't need to be uncompressed/recompressed every dump, and from what I've read compression is the current bottleneck.
The backup is based on having it sorted by id. Moreover, even changing that (ie. rewriting most of the code), you'd need to insert in the middle whenever a page gets a new revision.
On Nov 18, 2007 3:33 PM, Platonides Platonides@gmail.com wrote:
Anthony wrote:
So if the files are ordered by title then by revision time there should be a whole lot of chunks which don't need to be uncompressed/recompressed every dump, and from what I've read compression is the current bottleneck.
The backup is based on having it sorted by id. Moreover, even changing that (ie. rewriting most of the code), you'd need to insert in the middle whenever a page gets a new revision.
It's sorted by page_id, so it's fine. Probably would benefit from rewriting the code though, at least porting it to C.
You'd rewrite the entire file, just not recompress all of it. Partial chunks (like at the end of a page_id) would have to be uncompressed and recompressed, but fortunately the bzip2 spec allows for small chunks.
If I have free time some weekend I'll throw together a proof of concept. But for now I think the more pressing issue is allowing resumption of broken dumps.
As for rsync, I don't see the point. The HTTP protocol allows random file access.
On 11/18/07, Anthony wikimail@inbox.org wrote:
As for rsync, I don't see the point. The HTTP protocol allows random file access.
rsync can more or less efficiently diff two remote files, without transmitting anywhere near the entire file over the network. I suppose this may have been why it was suggested.
On Nov 18, 2007 4:58 PM, Simetrical Simetrical+wikilist@gmail.com wrote:
On 11/18/07, Anthony wikimail@inbox.org wrote:
As for rsync, I don't see the point. The HTTP protocol allows random file access.
rsync can more or less efficiently diff two remote files, without transmitting anywhere near the entire file over the network. I suppose this may have been why it was suggested.
There's no need to use rsync to create a diff over the network when both are already available locally.
On 11/18/07, Anthony wikimail@inbox.org wrote:
On Nov 18, 2007 4:58 PM, Simetrical Simetrical+wikilist@gmail.com wrote:
On 11/18/07, Anthony wikimail@inbox.org wrote:
As for rsync, I don't see the point. The HTTP protocol allows random file access.
rsync can more or less efficiently diff two remote files, without transmitting anywhere near the entire file over the network. I suppose this may have been why it was suggested.
There's no need to use rsync to create a diff over the network when both are already available locally.
I don't think anyone was talking about using rsync locally? Actually I'm not sure what anyone was talking about using rsync for. Looking back, I'm not totally sure anyone *was* talking about using rsync as anything other than a historical matter.
On Nov 18, 2007 4:51 PM, Anthony wikimail@inbox.org wrote:
But for now I think the more pressing issue is allowing resumption of broken dumps.
Hmm. It looks to me like only 1% (93383/10633249) were dumped before the most recent failure. Is that number anywhere close to true? If so, maybe resumption of broken dumps isn't the most pressing issue. How long do these history dumps actually take? What are the uncompressed and compressed sizes?
Di (rut) wrote:
Dear All, specially Anthony and Platonides,
I'm not techy - so why hasn't it been possible to have a non-corrupt dump in a long time (that includes history?). A professor of mine asked if the problem could be man(person)-power and if it would be interesting/useful to have the university help with a programmer to help the dump happen.
See my blog posts discussing this matter:
http://leuksman.com/log/2007/10/02/wiki-data-dumps/
http://leuksman.com/log/2007/10/14/incremental-dumps/
http://leuksman.com/log/2007/10/29/wiki-dumps-in-dump-revision-diffs/
The general problem is that there's a lot of data and compressing it takes an ungodly amount of time. When it takes forever to run, you're more likely to hit some cute little error in the middle which causes the process to fail.
Either we need to make the process more resistant to problems or we need to speed it up a lot, or both.
Splitting up the dump into smaller pieces which can be checkpointed (Tim's suggestion), or a recoverable version of the grab-text-from-the-database subprocess (my suggestion) would allow a dump run broken by a lost database connection to continue to completion. (These are not mutually exclusive options.)
The cost of splitting the dump is complication for users -- more files to fetch, more difficulty for automation, possibly changes to client scripts required. But it's also a popular idea to have smaller files to work with in batch.
Replacing thousands-of-revisions-bzipped-or-7zipped-together with a smarter diff to reduce the amount of slow general-purpose compression needed to get a decent download size should also reduce the amount of time it takes to run, making it more likely that a history dump will continue without hitting an error.
This would involve changing the format, necessitating even more changes to client software for compatibility.
Alas, this hasn't yet seen all the work done on it that it needs. Currently we have a programming staff of two (me and Tim) jumping back and forth between too many projects and our own relocations, and neither of us has gotten to the finish line on this project yet. Neither has any other interested party so far.
(Note that the foundation will be hiring a couple more programmers for 2008, as we get the San Francisco office set up.)
Also - now I've got a file from 2006 but I still wonder if there is no place where one can access old dumps - these will/could be very important research wise.
I have a fair number of *old* dumps sitting around at the office, but I'm not sure if I have any medium-depth ones. We don't generally keep old dumps up for download, but I could possibly provide an individual one if needed for research purposes.
And last but not least - If the dumps don't work, then it is very important to be able to dump some articles with their full histories in other fashions. I ask my pledge again - do you know who made the block so that export would only allow for 100 revisions? any way to hack that? Would it be possible to open an exception to get the data for a research study?
That was originally done because buffering would cause a longer export to fail. The export has since been changed so it should skip buffering, so this possibly could be lifted. I'll take a peek.
-- brion vibber (brion @ wikimedia.org)
Brion Vibber wrote:
And last but not least - If the dumps don't work, then it is very important to be able to dump some articles with their full histories in other fashions. I ask my pledge again - do you know who made the block so that export would only allow for 100 revisions? any way to hack that? Would it be possible to open an exception to get the data for a research study?
That was originally done because buffering would cause a longer export to fail. The export has since been changed so it should skip buffering, so this possibly could be lifted. I'll take a peek.
Currently the limit isn't applied to GET requests -- you either get only current version or the full history. Interesting. :)
I'm not entirely sure it's supposed to do that, the code for handling input is a little funky atm. :)
wget 'http://en.wikipedia.org/wiki/Special:Export/Bay_Area_Rapid_Transit?history=1'
-- brion
Brion Vibber wrote:
Either we need to make the process more resistant to problems or we need to speed it up a lot, or both.
[snip] a recoverable version of the grab-text-from-the-database subprocess (my suggestion) would allow a dump run broken by a lost database connection to continue to completion.
Finally got round to this part; should be active on the next run.
-- brion
Thanks, Brion for your time improving this. Let's cross our fingers for the next run to succeed... ;)
Felipe.
Brion Vibber brion@wikimedia.org escribió: Brion Vibber wrote:
Either we need to make the process more resistant to problems or we need to speed it up a lot, or both.
[snip] a recoverable version of the grab-text-from-the-database subprocess (my suggestion) would allow a dump run broken by a lost database connection to continue to completion.
Finally got round to this part; should be active on the next run.
-- brion
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
---------------------------------
¿Chef por primera vez? - Sé un mejor Cocinillas. Entra en Yahoo! Respuestas.
wikitech-l@lists.wikimedia.org