I know it's been mentioned on this list before, but it would be
incredibly useful to have incremental dumps of Wikidata, as downloading
the current dumps can now take several hours over a poor-bandwidth
Internet connection.
Here's my proposal:
* the incremental dumps should have exactly the same format as the
current JSON dumps, with two exceptions:
** entries which are unchanged since the previous dump (as determined by
their "modified" timestamp) should be omitted
** entries which have been deleted since the previous dump should have
stub entries of the form {"id": "Q123", "deleted": true}
I would imagine that these dumps would be vastly smaller than the
standard dumps, and would, for many re-users who only want to know about
changed data, be just as useful, with a fraction of the download time
and in many cases without significant modification of any of their
tools. Doing this would only need a small amount of processing time, and
add only an insignificant amount to the disk storage needed on the
servers, yet could save considerable amounts of Internet bandwidth.
This difference-file format should be easy to generate using slight
tweaks to the existing dump code, but, if needed, I can easily write a
simple Python script to take two existing dump files and generate the
differences between them in the format above. Please drop me an email,
or reply here, if you would like me to write this.
Kind regards,
Neil