I am using the Wiki dumps for work, and I have been following this list for some time
Since you bring it in, I do think too that "the simpler the better" regarding
data formats for batch-processing with Hadoop.
Working with table-like data formats is easy as this is the default format.
Working with XML, JSON and other formats allowing complex nested structures (e.g. map,
lists, and combination of them) is slightly more difficult as it involves
serialization/de-serializations steps for which libraries need to be available for your
favorite framework: Hadoop Java M/R, Hadoop Streaming, Pig, Hive, etc.
Nothing very difficult though...
I am glad to see some thinking around Hadoop (and NoSQL) as this the way to go when
working with large datasets such as Wikipedia.
On 3/29/11 7:45 AM, "Yuvi Panda" <yuvipanda(a)gmail.com> wrote:
On Tue, Mar 29, 2011 at 7:58 PM, Diederik van Liere <dvanliere(a)gmail.com> wrote:
I do not think that getting the data in sqlite format
is going to be very valuable. People can already get the data in Mysql databases (although
that is not that easy either) and so getting it in sqlite will not give additional
benefits in terms of querying capabilities. I am also not sure if sqlite can handle such
True, but it's not one large sqlite database - it'll be split across
multiple smaller ones, and explicit pointers will be maintained so
that random access is as effecient as possible.
What I do think might be valuable is to work on a text
format (JSON, CSV) to store the dumps. The reason is that we are looking at a Nosql
datastore solution (for example Hadoop) and storing the data in a non-xml but still text
format is going to be really useful.
Interesting. How exactly would having JSON/CSV be better than XML from
a import-into-nosql-datastore perspective?
Yuvi Panda T
Xmldatadumps-l mailing list