Re: [Xmldatadumps-l] MediaWiki dumps in sqlite format

13 Apr 2011

On 29/03/2011 16:09, Diederik van Liere wrote:
...
  On 2011-03-29, at 10:45 AM, Yuvi Panda wrote:

  Heya Diederik,

 On Tue, Mar 29, 2011 at 7:58 PM, Diederik van Liere&lt;dvanliere(a)gmail.com&gt;  wrote:
  I do not think that getting the data in sqlite
format is going to be very valuable. People can already get the data in Mysql databases
(although that is not that easy either) and so getting it in sqlite will not give
additional benefits in terms of querying capabilities. I am also not sure if sqlite can
handle such large databases  True, but it's not one large sqlite database -
it'll be split across
 multiple smaller ones, and explicit pointers will be maintained so
 that random access is as effecient as possible.
  Basically you are going to develop a sharding solution using sqlite. I think you
are overstretching the use case of sqlite (IMHO).

   What I do
think might be valuable is to work on a text format (JSON, CSV) to store the dumps. The
reason is that we are looking at a Nosql datastore solution (for example Hadoop) and
storing the data in a non-xml but still text format is going to be really useful. 
Interesting. How exactly would having JSON/CSV be better than XML from
 a import-into-nosql-datastore perspective?  When each revision is on a separate row
it will be way easier to run map/reduce jobs else you have to figure out where a revision
starts and ends, and each row should contain all variables (IMHO) Really the XML is
pretty much that anyway.  What would be neat (and a 
perl one-liner I suppose) is an indexing program that generates a file 
index giving the offset and major/desired keys in an XML file (revision, 
page name, date for example) and maybe length.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Xmldatadumps-l] MediaWiki dumps in sqlite format