Re: [Wikitech-l] Suggested file format of new incremental dumps

1 Jul 2013


      Στις 01-07-2013, ημέρα Δευ, και ώρα 16:00 +0200, ο/η Petr Onderka
έγραψε:
...
For my GSoC project Incremental data dumps [1], I'm creating a new file
format to replace Wikimedia's XML data dumps.
A sketch of how I imagine the file format to look like is at
http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format.
What do you think? Does it make sense? Would it work for your use case?
Any comments or suggestions are welcome.
Petr Onderka
[[User:Svick]]
Dumps v 2.0 finally on the horizon!
A few comments/questions:
I was envisioning that we would produce "diff dumps" in one pass
(presumably in a much shorter time than the fulls we generate now) and
would apply those against previous fulls (in the new format) to produce
new fulls, hopefully also in less time.  What do you have in mind for
the production of the new fulls?
It might be worth seeing how large the resulting en wp history files are
going to be if you compress each revision separaately for version 1 of
this project.  My fear is that even with 7z it's going to make the size
unwieldy.  If the thought is that it's a first round prototype, not
meant to be run on large projects, that's another story.
I'm not sure about removing the restrictions data; someone must have
wanted it, like the other various fields that have crept in over time.
And we should expect there will be more such fields over time...
We need to get some of the wikidata users in on the model/format
dicussion, to see what use they plan to make of those fields and what
would be most convenient for them.
It's quite likely that these new fulls will need to be split into chunks
much as we do with the current en wp files.  I don't know what that
would mean for the diff files.  Currently we split in an arbitrary way
based on sequences of page numbers, writing out separate stub files and
using those for the content dumps.  Any thoughts?
Ariel

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Suggested file format of new incremental dumps