Really the XML is pretty much that anyway. What would be neat (and a perl one-liner I suppose) is an indexing program that generates a file index giving the offset and major/desired keys in an XML file (revision, page name, date for example) and maybe length.
I have a PHP script (that runs on command line) that pretty much does that... it generates an XML index file with the following entry for each identified page from the XML dump:
<page id="%d" revision="%d" datetime="%s" length="%d" start="%d" end="%d" title="%s" />
Where id = page id revision = revision id datetime = revision date/time length = length of the revision <text> XML entity CDATA start = line number of the <page> entity end = line number of the </page> entity title = page title
Example of first three entries:
<page id="10" revision="381202555" datetime="2010-08-26T22:38:36Z" length="57" start="32" end="47" title="AccessibleComputing" /> <page id="12" revision="408067712" datetime="2011-01-15T19:28:25Z" length="96718" start="48" end="453" title="Anarchism" /> <page id="13" revision="74466652" datetime="2006-09-08T04:15:52Z" length="57" start="454" end="468" title="AfghanistanHistory" />
If this is of any use to anyone, I can put it up...
-- James