Hi there,
I am using the Wiki dumps for work, and I have been following this list for some time now.

Since you bring it in, I do think too that “the simpler the better” regarding data formats for batch-processing with Hadoop.

Working with table-like data formats is easy as this is the default format.

Working with XML, JSON and other formats allowing complex nested structures (e.g. map, lists, and combination of them) is slightly more difficult as it involves serialization/de-serializations steps for which libraries need to be available for your favorite framework: Hadoop Java M/R, Hadoop Streaming, Pig, Hive, etc.

Nothing very difficult though...

I am glad to see some thinking around Hadoop (and NoSQL) as this the way to go when working with large datasets such as Wikipedia.


Best,
Nicolas Torzec.




On 3/29/11 7:45 AM, "Yuvi Panda" <yuvipanda@gmail.com> wrote:

Heya Diederik,

On Tue, Mar 29, 2011 at 7:58 PM, Diederik van Liere <dvanliere@gmail.com> wrote:
> I do not think that getting the data in sqlite format is going to be very valuable. People can already get the data in Mysql databases (although that is not that easy either) and so getting it in sqlite will not give additional benefits in terms of querying capabilities. I am also not sure if sqlite can handle such large databases.

True, but it's not one large sqlite database - it'll be split across
multiple smaller ones, and explicit pointers will be maintained so
that random access is as effecient as possible.

> What I do think might be valuable is to work on a text format (JSON, CSV) to store the dumps. The reason is that we are looking at a Nosql datastore solution (for example Hadoop) and storing the data in a non-xml but still text format is going to be really useful.

Interesting. How exactly would having JSON/CSV be better than XML from
a import-into-nosql-datastore perspective?
\
--
Yuvi Panda T
http://yuvi.in

_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l