Nice work on the program! To answer your questions:
- Does this help you, given the caveats? Would you
start using, e.g., weekly deltas if I posted them?
It is helpful in that it is ready for use now, and is very easy to
use. I would probably need some time to automate it for my
application, but it is possible that I could consume weekly deltas
within a week or two. I'm not sure if it'd be worth your while. See
more below.
- If you actually ran the dltp command above, did you
have any trouble? How long did it take?
I gave it a run with the two latest pages-article dumps from
simplewiki (
http://dumps.wikimedia.org/simplewiki/) and it worked well
with pack [1] and unpack [2]. Simplewiki is about 89 MB and dltp took
about 10 seconds to generate a file of 4.2 MB (the dump files were
already unzipped) . Unpacking took about 12 seconds to generate the
450 MB xml file. For reference, this is a Windows 7 machine with a
dual-core 2.2 GhZ processor and 2 GB memory. Zipped took longer: about
2 minutes. I'll try the English Wikipedia counterparts tomorrow, but
I'd guess it wouldn't take more than 25 min for unzipped.
I also tried cut / merge, but didn't really have anything meaningful
to use, so can only testify that they didn't fail.
[1] dltp386.exe simplewiki-20130813-pages-articles.xml
simplewiki-20130724-pages-articles.xml
[2] dltp386.exe simplewiki-20130813-pages-articles.xml.dltp.gz
- If you'd use this, what sort of project are you
thinking of on what sort of machine?
I'd use it for XOWA:
https://sourceforge.net/projects/xowa/. This is a
desktop app that runs on an individual user's machine for reading
Wikipedia offline. I've had a few users ask about not having to
download the entire dump every time, so you may see some delta usage
there. Unfortunately, the user base is currently small (700 total
downloads per month = ??? unique users), so it would not be worth your
time to post the deltas solely for my benefit.
- Which wiki (enwiki, etc.) would you want diffs for?
Certainly, the 3 largest ones: enwiki, dewiki, frwiki. If possible,
the 8 wikis with more than 1 million articles. XOWA can read any wiki,
but it wouldn't make sense to post diffs for a small Wikipedia (Latin)
when there would be no one to download them.
Also, this is probably outside the scope of dltp, but is there any way
for the diff file to be self-contained? For example, if the old dump
had 10 articles and the new dump had 11 articles with 1 new article
and 1 changed article, then the diff file would only have 2 articles:
the 1 new article and the 1 changed one. This may not save as much
space, but it'd be easier for users to work with the delta file, then
to remember to keep the original dump file around.
Hope my feedback is useful. Good luck.
On Tue, Aug 13, 2013 at 6:48 PM, Randall Farmer <randall(a)wawd.com> wrote:
[Adding wikitech once more with feeling. Sorry for all
of the copies,
Xmldatadumps-l.]
Hi, everyone.
I've been working on a tool to generate smaller update files from XML dumps
by just saving the diff between the previous dump and the latest revision.
It doesn't aim to do everything the new dump format does, and indeed it
looks like the dumps project will make it obsolete at some point, but I'm
putting this out there to see if it's useful in the interim.
You can get it at
https://github.com/twotwotwo/dltp
After downloading one of the binaries, you can put it in a directory
with enwiki-20130805-pages-articles.xml.bz2 and run:
./dltp
http://www.rfarmer.net/dltp/en/enwiki-20130805-20130812-cut-merged.xml.dltp…
That download is ~50 MB, but it expands to about 4GB of XML, consisting of
the latest revision's text for every page that was updated 8/5-8/12. (A
diff for 7/9-8/5 was 440MB.) You can pipe the XML output to something,
rather than saving it to a file, using the -c flag.
This all sounds great, and maybe it is if bandwidth is your bottleneck.
There are lots of important caveats, though:
- *It looks likely to be obsoleted by official WMF diff dumps. *I started
on this a while back, and at the time, from what I'd read, I thought diff
dumps weren't a high priority for the official project so might be worth
implementing unofficially [ed: looking back, I probably should've been more
cautious about this]. More recently it sounds more like official diff dumps
actually might not be all that far off, so if you invest in using this, it
might not pay off for all that long.
- These diffs give you *only the latest revision* of each page in namespace
0. No full history.
- You have to keep the old dump file around.
- The tool reads and *unzips* the old dump each time. bunzip2 is slow, so *you
may want to keep reference XML in uncompressed form*, or (better)
compressed with gzip or lzop.
- On Windows, dltp unzips at slower than native speed, so there's even more
reason to store your source file uncompressed.
- No matter what, the old file can take a while to read through; *5-25
minutes to expand the file, depending on how your reference XML is stored,
is entirely plausible.*
- It uses adds-changes dumps, so it doesn't do anything to account for
deletions or oversighting.
- *It's new, non-battle-tested software, so caveat emptor.*
So, I'm curious:
- Does this help you, given the caveats? Would you start using, e.g.,
weekly deltas if I posted them?
- If you actually ran the dltp command above, did you have any trouble? How
long did it take?
- If you'd use this, what sort of project are you thinking of on what sort
of machine? (A public site hosting Wiki content or something else? Big
server, VPS, desktop? Linux or another OS?)
- Which wiki (enwiki, etc.) would you want diffs for?
Other than that the source is out there and anyone can use it, I'm not
promising to post deltas or anything now (since perhaps many folks would
rather wait on official delta dumps, etc.). But interested to see if
there's any potential use here.
Best,
Randall
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l