On 01/07/13 23:21, Nicolas Torzec wrote:
Hi there,
In principle, I understand the need for binary formats and compression in a context with limited resources. On the other hand, plain text formats are easy to work with, especially for third-party users and organizations.
Playing the devil advocate, I could even argue that you should keep the data dumps in plain text, and keep your processing dead simple, and then let distributed processing systems such as Hadoop MapReduce (or Storm, Spark, etc.) handle the scale and compute diffs whenever needed or on the fly.
Reading the Wiki mentioned at the beginning of this thread, it is not clear to me what the requirements are for this new incremental update format, and why? Therefore, it is not easy to provide input and help.
Cheers.
- Nicolas Torzec.
+1
The simplest possible dump format is the best, and there's already a thriving ecosystem around the current XML dumps, which would be broken by moving to a binary format. Binary file formats and APIs defined by code are not the way to go if you want long-term archival that can endure through decades of technological change.
If more money is needed for dump processing, it should be budgeted for and added to the IT budget, instead of over-optimizing by using a potentially fragile, and therefore risky, binary format.
Archival in a stable format is not a luxury or an optional extra; it's a core part of the Foundation's mission. The value is in the data, which is priceless. Computers and storage are (relatively) cheap by comparison, and Wikipedia is growing significantly more slowly than the year-on-year improvements in storage, processing and communication links. Moreover, re-making the dumps every time provides defence in depth against subtle database corruption that might slowly corrupt a database dump.
Please keep the dumps themselves simple and their format stable, and, as Nicolas says, do the clever stuff elsewhere, in which you can use whatever efficient representation you like to do the processing.
Neil