Really the XML is pretty much that anyway. What would
be neat (and a
perl one-liner I suppose) is an indexing program that generates a file
index giving the offset and major/desired keys in an XML file (revision,
page name, date for example) and maybe length.
I have a PHP script (that runs on command line) that pretty much does
that... it generates an XML index file with the following entry for each
identified page from the XML dump:
<page id="%d" revision="%d" datetime="%s"
length="%d" start="%d"
end="%d" title="%s" />
Where
id = page id
revision = revision id
datetime = revision date/time
length = length of the revision <text> XML entity CDATA
start = line number of the <page> entity
end = line number of the </page> entity
title = page title
Example of first three entries:
<page id="10" revision="381202555"
datetime="2010-08-26T22:38:36Z"
length="57" start="32" end="47"
title="AccessibleComputing" />
<page id="12" revision="408067712"
datetime="2011-01-15T19:28:25Z"
length="96718" start="48" end="453"
title="Anarchism" />
<page id="13" revision="74466652"
datetime="2006-09-08T04:15:52Z"
length="57" start="454" end="468"
title="AfghanistanHistory" />
If this is of any use to anyone, I can put it up...
-- James