Hi, data dumps people.
I've been working on a tool to generate smaller update files from XML dumps by just saving the diff between the previous dump and the latest revision. It doesn't aim to do everything the new dump format does, and indeed it looks like the dumps project will make it obsolete at some point, but I'm putting this out there to see if it's useful in the interim.
You can get it at https://github.com/twotwotwo/dltp
After downloading one of the binaries, you can put it in a directory with enwiki-20130805-pages-articles.xml.bz2 and run:
./dltp http://www.rfarmer.net/dltp/en/enwiki-20130805-20130812-cut-merged.xml.dltp....
That download is ~50 MB, but it expands to about 4GB of XML, consisting of the latest revision's text for every page that was updated 8/5-8/12. (A diff for 7/9-8/5 was 440MB.) You can pipe the XML output to something, rather than saving it to a file, using the -c flag.
This all sounds great, and maybe it is if bandwidth is your bottleneck. There are lots of important caveats, though:
- *It looks likely to be obsoleted by official WMF diff dumps. *I started on this a while back, and at the time, from what I'd read, I thought diff dumps weren't a high priority for the official project so might be worth implementing unofficially. More recently it sounds more like official diff dumps actually might not be all that far off, so if you invest in using this, it might not pay off for all that long. - These diffs give you *only the latest revision* of each page in namespace 0. No full history. - You have to keep the old dump file around. - The tool reads and *unzips* the old dump each time. bunzip2 is slow, so *you may want to keep reference XML in uncompressed form*, or (better) compressed with gzip or lzop. - On Windows, dltp unzips at slower than native speed, so there's even more reason to store your source file uncompressed. - No matter what, the old file can take a while to read through; *5-25 minutes to expand the file, depending on how your reference XML is stored, is entirely plausible.* - It uses adds-changes dumps, so it doesn't do anything to account for deletions or oversighting. - *It's new, non-battle-tested software, so caveat emptor.*
So, I'm curious:
- Does this help you, given the caveats? Would you start using, e.g., weekly deltas if I posted them? - If you actually ran the dltp command above, did you have any trouble? How long did it take? - If you'd use this, what sort of project are you thinking of on what sort of machine? (A public site hosting Wiki content or something else? Big server, VPS, desktop? Linux or another OS?) - Which wiki (enwiki, etc.) would you want diffs for?
Other than that the source is out there and anyone can use it, I'm not promising to post deltas or anything now (since I don't know if folks would rather wait on official delta dumps, etc.). But interested to see if there's any potential use here.
Best, Randall
Adding wikitech-l, with some edits.
Hi, everyone.
I've been working on a tool to generate smaller update files from XML dumps by just saving the diff between the previous dump and the latest revision. It doesn't aim to do everything the new dump format does, and indeed it looks like the dumps project will make it obsolete at some point, but I'm putting this out there to see if it's useful in the interim.
You can get it at https://github.com/twotwotwo/dltp
After downloading one of the binaries, you can put it in a directory with enwiki-20130805-pages-articles.xml.bz2 and run:
./dltp http://www.rfarmer.net/dltp/en/enwiki-20130805-20130812-cut-merged.xml.dltp....
That download is ~50 MB, but it expands to about 4GB of XML, consisting of the latest revision's text for every page that was updated 8/5-8/12. (A diff for 7/9-8/5 was 440MB.) You can pipe the XML output to something, rather than saving it to a file, using the -c flag.
This all sounds great, and maybe it is if bandwidth is your bottleneck. There are lots of important caveats, though:
- *It looks likely to be obsoleted by official WMF diff dumps. *I started on this a while back, and at the time, from what I'd read, I thought diff dumps weren't a high priority for the official project so might be worth implementing unofficially [ed: looking back, I probably should've been more cautious about this]. More recently it sounds more like official diff dumps actually might not be all that far off, so if you invest in using this, it might not pay off for all that long. - These diffs give you *only the latest revision* of each page in namespace 0. No full history. - You have to keep the old dump file around. - The tool reads and *unzips* the old dump each time. bunzip2 is slow, so *you may want to keep reference XML in uncompressed form*, or (better) compressed with gzip or lzop. - On Windows, dltp unzips at slower than native speed, so there's even more reason to store your source file uncompressed. - No matter what, the old file can take a while to read through; *5-25 minutes to expand the file, depending on how your reference XML is stored, is entirely plausible.* - It uses adds-changes dumps, so it doesn't do anything to account for deletions or oversighting. - *It's new, non-battle-tested software, so caveat emptor.*
So, I'm curious:
- Does this help you, given the caveats? Would you start using, e.g., weekly deltas if I posted them? - If you actually ran the dltp command above, did you have any trouble? How long did it take? - If you'd use this, what sort of project are you thinking of on what sort of machine? (A public site hosting Wiki content or something else? Big server, VPS, desktop? Linux or another OS?) - Which wiki (enwiki, etc.) would you want diffs for?
Other than that the source is out there and anyone can use it, I'm not promising to post deltas or anything now (since perhaps many folks would rather wait on official delta dumps, etc.). But interested to see if there's any potential use here.
Best, Randall
[Now really adding wikitech-l (last attempt failed, wasn't subscribed).]
Hi, everyone.
I've been working on a tool to generate smaller update files from XML dumps by just saving the diff between the previous dump and the latest revision. It doesn't aim to do everything the new dump format does, and indeed it looks like the dumps project will make it obsolete at some point, but I'm putting this out there to see if it's useful in the interim.
You can get it at https://github.com/twotwotwo/dltp
After downloading one of the binaries, you can put it in a directory with enwiki-20130805-pages-articles.xml.bz2 and run:
./dltp http://www.rfarmer.net/dltp/en/enwiki-20130805-20130812-cut-merged.xml.dltp....
That download is ~50 MB, but it expands to about 4GB of XML, consisting of the latest revision's text for every page that was updated 8/5-8/12. (A diff for 7/9-8/5 was 440MB.) You can pipe the XML output to something, rather than saving it to a file, using the -c flag.
This all sounds great, and maybe it is if bandwidth is your bottleneck. There are lots of important caveats, though:
- *It looks likely to be obsoleted by official WMF diff dumps. *I started on this a while back, and at the time, from what I'd read, I thought diff dumps weren't a high priority for the official project so might be worth implementing unofficially [ed: looking back, I probably should've been more cautious about this]. More recently it sounds more like official diff dumps actually might not be all that far off, so if you invest in using this, it might not pay off for all that long. - These diffs give you *only the latest revision* of each page in namespace 0. No full history. - You have to keep the old dump file around. - The tool reads and *unzips* the old dump each time. bunzip2 is slow, so *you may want to keep reference XML in uncompressed form*, or (better) compressed with gzip or lzop. - On Windows, dltp unzips at slower than native speed, so there's even more reason to store your source file uncompressed. - No matter what, the old file can take a while to read through; *5-25 minutes to expand the file, depending on how your reference XML is stored, is entirely plausible.* - It uses adds-changes dumps, so it doesn't do anything to account for deletions or oversighting. - *It's new, non-battle-tested software, so caveat emptor.*
So, I'm curious:
- Does this help you, given the caveats? Would you start using, e.g., weekly deltas if I posted them? - If you actually ran the dltp command above, did you have any trouble? How long did it take? - If you'd use this, what sort of project are you thinking of on what sort of machine? (A public site hosting Wiki content or something else? Big server, VPS, desktop? Linux or another OS?) - Which wiki (enwiki, etc.) would you want diffs for?
Other than that the source is out there and anyone can use it, I'm not promising to post deltas or anything now (since perhaps many folks would rather wait on official delta dumps, etc.). But interested to see if there's any potential use here.
Best, Randall
[Adding wikitech once more with feeling. Sorry for all of the copies, Xmldatadumps-l.]
Hi, everyone.
I've been working on a tool to generate smaller update files from XML dumps by just saving the diff between the previous dump and the latest revision. It doesn't aim to do everything the new dump format does, and indeed it looks like the dumps project will make it obsolete at some point, but I'm putting this out there to see if it's useful in the interim.
You can get it at https://github.com/twotwotwo/dltp
After downloading one of the binaries, you can put it in a directory with enwiki-20130805-pages-articles.xml.bz2 and run:
./dltp http://www.rfarmer.net/dltp/en/enwiki-20130805-20130812-cut-merged.xml.dltp....
That download is ~50 MB, but it expands to about 4GB of XML, consisting of the latest revision's text for every page that was updated 8/5-8/12. (A diff for 7/9-8/5 was 440MB.) You can pipe the XML output to something, rather than saving it to a file, using the -c flag.
This all sounds great, and maybe it is if bandwidth is your bottleneck. There are lots of important caveats, though:
- *It looks likely to be obsoleted by official WMF diff dumps. *I started on this a while back, and at the time, from what I'd read, I thought diff dumps weren't a high priority for the official project so might be worth implementing unofficially [ed: looking back, I probably should've been more cautious about this]. More recently it sounds more like official diff dumps actually might not be all that far off, so if you invest in using this, it might not pay off for all that long. - These diffs give you *only the latest revision* of each page in namespace 0. No full history. - You have to keep the old dump file around. - The tool reads and *unzips* the old dump each time. bunzip2 is slow, so *you may want to keep reference XML in uncompressed form*, or (better) compressed with gzip or lzop. - On Windows, dltp unzips at slower than native speed, so there's even more reason to store your source file uncompressed. - No matter what, the old file can take a while to read through; *5-25 minutes to expand the file, depending on how your reference XML is stored, is entirely plausible.* - It uses adds-changes dumps, so it doesn't do anything to account for deletions or oversighting. - *It's new, non-battle-tested software, so caveat emptor.*
So, I'm curious:
- Does this help you, given the caveats? Would you start using, e.g., weekly deltas if I posted them? - If you actually ran the dltp command above, did you have any trouble? How long did it take? - If you'd use this, what sort of project are you thinking of on what sort of machine? (A public site hosting Wiki content or something else? Big server, VPS, desktop? Linux or another OS?) - Which wiki (enwiki, etc.) would you want diffs for?
Other than that the source is out there and anyone can use it, I'm not promising to post deltas or anything now (since perhaps many folks would rather wait on official delta dumps, etc.). But interested to see if there's any potential use here.
Best, Randall
It seems an excellent contribution. Official diffs may be released tomorrow or may come in 4 years - that is my experience with software promises around WP, so I think it is well worth releasing.
On 13/08/2013 23:48, Randall Farmer wrote:
[Adding wikitech once more with feeling. Sorry for all of the copies, Xmldatadumps-l.]
xmldatadumps-l@lists.wikimedia.org