Re: [Xmldatadumps-l] MediaWiki dumps in sqlite format

13 Apr 2011


      ...
Really the XML is pretty much that anyway.  What would be neat (and a
perl one-liner I suppose) is an indexing program that generates a file
index giving the offset and major/desired keys in an XML file (revision,
page name, date for example) and maybe length.
I have a PHP script (that runs on command line) that pretty much does 
that... it generates an XML index file with the following entry for each 
identified page from the XML dump:
<page id="%d" revision="%d" datetime="%s" length="%d" start="%d" 
end="%d" title="%s" />
Where
     id = page id
     revision = revision id
     datetime = revision date/time
     length = length of the revision <text> XML entity CDATA
     start = line number of the <page> entity
     end = line number of the </page> entity
     title = page title
Example of first three entries:
<page id="10" revision="381202555" datetime="2010-08-26T22:38:36Z" 
length="57" start="32" end="47" title="AccessibleComputing" />
<page id="12" revision="408067712" datetime="2011-01-15T19:28:25Z" 
length="96718" start="48" end="453" title="Anarchism" />
<page id="13" revision="74466652" datetime="2006-09-08T04:15:52Z" 
length="57" start="454" end="468" title="AfghanistanHistory" />
If this is of any use to anyone, I can put it up...
-- James

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Xmldatadumps-l] MediaWiki dumps in sqlite format