Currently, the abstracts dump for Wikidata consists of 62 million entries, all of which contain <abstract 'not-applicable' /> instead of any real abstract. Instead of this, I am considering producing abstract files that would contain only the mediawiki header and footer and the usual siteinfo contents. What do people think about this?
Rationale:
It takes 36 hours of time to produce these useless files. It places an extra burden on the db servers for no good reason. It requires more bandwidth to download and process these useless files than having a file with no entries. Wikidata will only ever have Q-entities or other entities in the main namespace that are not text or wikitext and so are not suitable for abstracts.
Please comment here or on the task: https://phabricator.wikimedia.org/T236006
If there are no comments or blockers after a week, I'll start implementing this, and it will likely go into effect for the November 20th run.
Your faithful dumps wrangler,
Ariel Glenn
xmldatadumps-l@lists.wikimedia.org