I'm a student looking to work on MediaWiki during this year's Google Summer of Code, and one of the ideas I've been interested in is in various formats for the data dumps (and dump work in general).
How useful would dumps from wikipedia be, if they were in sqlite databases? Would it be useful to have all the dumps as sqlite (history, stubs, current, etc)? Or are there certain dumps (current, for example) which would be very useful as databases?
The dumps wouldn't be direct dumps from the mysql database (unlike the old SQL Dumps) - they'll be in a format optimized for data processing and imports. I'll also write supporting code such as libraries for reading the databases, etc.
What do you folks think?
Dear Yuvi,
I do not think that getting the data in sqlite format is going to be very valuable. People can already get the data in Mysql databases (although that is not that easy either) and so getting it in sqlite will not give additional benefits in terms of querying capabilities. I am also not sure if sqlite can handle such large databases.
What I do think might be valuable is to work on a text format (JSON, CSV) to store the dumps. The reason is that we are looking at a Nosql datastore solution (for example Hadoop) and storing the data in a non-xml but still text format is going to be really useful.
Just my 2 cents.
Best, Diederik
On 2011-03-28, at 6:10 PM, Yuvi Panda wrote:
I'm a student looking to work on MediaWiki during this year's Google Summer of Code, and one of the ideas I've been interested in is in various formats for the data dumps (and dump work in general).
How useful would dumps from wikipedia be, if they were in sqlite databases? Would it be useful to have all the dumps as sqlite (history, stubs, current, etc)? Or are there certain dumps (current, for example) which would be very useful as databases?
The dumps wouldn't be direct dumps from the mysql database (unlike the old SQL Dumps) - they'll be in a format optimized for data processing and imports. I'll also write supporting code such as libraries for reading the databases, etc.
What do you folks think?
-- Yuvi Panda T http://yuvi.in/
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Heya Diederik,
On Tue, Mar 29, 2011 at 7:58 PM, Diederik van Liere dvanliere@gmail.com wrote:
I do not think that getting the data in sqlite format is going to be very valuable. People can already get the data in Mysql databases (although that is not that easy either) and so getting it in sqlite will not give additional benefits in terms of querying capabilities. I am also not sure if sqlite can handle such large databases.
True, but it's not one large sqlite database - it'll be split across multiple smaller ones, and explicit pointers will be maintained so that random access is as effecient as possible.
What I do think might be valuable is to work on a text format (JSON, CSV) to store the dumps. The reason is that we are looking at a Nosql datastore solution (for example Hadoop) and storing the data in a non-xml but still text format is going to be really useful.
Interesting. How exactly would having JSON/CSV be better than XML from a import-into-nosql-datastore perspective? \
On 2011-03-29, at 10:45 AM, Yuvi Panda wrote:
Heya Diederik,
On Tue, Mar 29, 2011 at 7:58 PM, Diederik van Liere dvanliere@gmail.com wrote:
I do not think that getting the data in sqlite format is going to be very valuable. People can already get the data in Mysql databases (although that is not that easy either) and so getting it in sqlite will not give additional benefits in terms of querying capabilities. I am also not sure if sqlite can handle such large databases.
True, but it's not one large sqlite database - it'll be split across multiple smaller ones, and explicit pointers will be maintained so that random access is as effecient as possible.
Basically you are going to develop a sharding solution using sqlite. I think you are overstretching the use case of sqlite (IMHO).
What I do think might be valuable is to work on a text format (JSON, CSV) to store the dumps. The reason is that we are looking at a Nosql datastore solution (for example Hadoop) and storing the data in a non-xml but still text format is going to be really useful.
Interesting. How exactly would having JSON/CSV be better than XML from a import-into-nosql-datastore perspective?
When each revision is on a separate row it will be way easier to run map/reduce jobs else you have to figure out where a revision starts and ends, and each row should contain all variables (IMHO)
On Tue, Mar 29, 2011 at 8:39 PM, Diederik van Liere dvanliere@gmail.com wrote:
Basically you are going to develop a sharding solution using sqlite. I think you are overstretching the use case of sqlite (IMHO).
Yes, kindof sharding.
When each revision is on a separate row it will be way easier to run map/reduce jobs else you have to figure out where a revision starts and ends, and each row should contain all variables (IMHO)
Really? Importing from CSV/JSON is easier than running it from sqlite?
Note that I'm not really 'fixated' on sqlite - if the community feels that JSON (I don't think CSV is very suitable for full fledged dumps) would be good, I'll be glad to work on that....
On 29/03/2011 16:09, Diederik van Liere wrote:
On 2011-03-29, at 10:45 AM, Yuvi Panda wrote:
Heya Diederik,
On Tue, Mar 29, 2011 at 7:58 PM, Diederik van Lieredvanliere@gmail.com wrote:
I do not think that getting the data in sqlite format is going to be very valuable. People can already get the data in Mysql databases (although that is not that easy either) and so getting it in sqlite will not give additional benefits in terms of querying capabilities. I am also not sure if sqlite can handle such large databases
True, but it's not one large sqlite database - it'll be split across multiple smaller ones, and explicit pointers will be maintained so that random access is as effecient as possible.
Basically you are going to develop a sharding solution using sqlite. I think you are overstretching the use case of sqlite (IMHO).
What I do think might be valuable is to work on a text format (JSON, CSV) to store the dumps. The reason is that we are looking at a Nosql datastore solution (for example Hadoop) and storing the data in a non-xml but still text format is going to be really useful.
Interesting. How exactly would having JSON/CSV be better than XML from a import-into-nosql-datastore perspective?
When each revision is on a separate row it will be way easier to run map/reduce jobs else you have to figure out where a revision starts and ends, and each row should contain all variables (IMHO)
Really the XML is pretty much that anyway. What would be neat (and a perl one-liner I suppose) is an indexing program that generates a file index giving the offset and major/desired keys in an XML file (revision, page name, date for example) and maybe length.
Really the XML is pretty much that anyway. What would be neat (and a perl one-liner I suppose) is an indexing program that generates a file index giving the offset and major/desired keys in an XML file (revision, page name, date for example) and maybe length.
I have a PHP script (that runs on command line) that pretty much does that... it generates an XML index file with the following entry for each identified page from the XML dump:
<page id="%d" revision="%d" datetime="%s" length="%d" start="%d" end="%d" title="%s" />
Where id = page id revision = revision id datetime = revision date/time length = length of the revision <text> XML entity CDATA start = line number of the <page> entity end = line number of the </page> entity title = page title
Example of first three entries:
<page id="10" revision="381202555" datetime="2010-08-26T22:38:36Z" length="57" start="32" end="47" title="AccessibleComputing" /> <page id="12" revision="408067712" datetime="2011-01-15T19:28:25Z" length="96718" start="48" end="453" title="Anarchism" /> <page id="13" revision="74466652" datetime="2006-09-08T04:15:52Z" length="57" start="454" end="468" title="AfghanistanHistory" />
If this is of any use to anyone, I can put it up...
-- James
Richard Farmbrough wrote:
What would be neat (and a perl one-liner I suppose) is an indexing program that generates a file index giving the offset and major/desired keys in an XML file (revision, page name, date for example) and maybe length.
I also have some programs for doing such kind of things. What do you want to do exactly?
Hi there, I am using the Wiki dumps for work, and I have been following this list for some time now.
Since you bring it in, I do think too that "the simpler the better" regarding data formats for batch-processing with Hadoop.
Working with table-like data formats is easy as this is the default format.
Working with XML, JSON and other formats allowing complex nested structures (e.g. map, lists, and combination of them) is slightly more difficult as it involves serialization/de-serializations steps for which libraries need to be available for your favorite framework: Hadoop Java M/R, Hadoop Streaming, Pig, Hive, etc.
Nothing very difficult though...
I am glad to see some thinking around Hadoop (and NoSQL) as this the way to go when working with large datasets such as Wikipedia.
Best, Nicolas Torzec.
On 3/29/11 7:45 AM, "Yuvi Panda" yuvipanda@gmail.com wrote:
Heya Diederik,
On Tue, Mar 29, 2011 at 7:58 PM, Diederik van Liere dvanliere@gmail.com wrote:
I do not think that getting the data in sqlite format is going to be very valuable. People can already get the data in Mysql databases (although that is not that easy either) and so getting it in sqlite will not give additional benefits in terms of querying capabilities. I am also not sure if sqlite can handle such large databases.
True, but it's not one large sqlite database - it'll be split across multiple smaller ones, and explicit pointers will be maintained so that random access is as effecient as possible.
What I do think might be valuable is to work on a text format (JSON, CSV) to store the dumps. The reason is that we are looking at a Nosql datastore solution (for example Hadoop) and storing the data in a non-xml but still text format is going to be really useful.
Interesting. How exactly would having JSON/CSV be better than XML from a import-into-nosql-datastore perspective? \ -- Yuvi Panda T http://yuvi.in
_______________________________________________ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
xmldatadumps-l@lists.wikimedia.org