I know it's been mentioned on this list before,
but it would be incredibly useful to have incremental dumps of
Wikidata, as downloading the current dumps can now take several
hours over a poor-bandwidth Internet connection.
Here's my proposal:
* the incremental dumps should have exactly the same format as the current JSON dumps, with two exceptions:
** entries which are unchanged since the previous dump (as determined by their "modified" timestamp) should be omitted
** entries which have been deleted since the previous dump should have stub entries of the form {"id": "Q123", "deleted": true}
I would imagine that these dumps would be vastly
smaller than the standard dumps, and would, for many re-users
who only want to know about changed data, be just as useful, with
a fraction of the download time and in many cases without
significant modification of any of their tools. Doing this would
only need a small amount of processing time, and add only an insignificant
amount to the disk storage needed on the servers, yet could save
considerable amounts of Internet bandwidth.
This difference-file format should be easy to
generate using slight tweaks to the existing dump code, but, if
needed, I can easily write a simple Python script to take two
existing dump files and generate the differences between them in
the format above. Please drop me an email, or reply here, if you
would like me to write this.
Kind regards,
Neil