Hi,
Just wanted to express my belated support for such dumps: - We encounter the same problem in research, and both for efficiency, reproducibility, and authoritativeness a centralized solution would be great. - Besides the filtering for existence in Wikipedia, I'd see much potential in removing labels. In most of our use cases labels are not needed in the computations, and where introspection is needed, one can selectively add them post-hoc. Alternatively, only retaining English labels would also save much (and I don't see concerns of cultural bias, as long as we only use them as decoration, not inside computations).
Thanks also for pointing out the WDumper tool, this looks great. Maybe it would be worth to highlight selected dumps prominently on its start page? (the names in "recent dumps" alone are not always informative, so one has to inspect the specs one by one, and arguably, the larger number also means some loss of authoritativeness)
Cheers, Simon
Hey all,
As someone who likes to use Wikidata in their research, and likes to give students projects relating to Wikidata, I am finding it more and more difficult to (recommend to) work with recent versions of Wikidata due to the increasing dump sizes, where even the truthy version now costs considerable time and machine resources to process and handle. In some cases we just grin and bear the costs, while in other cases we apply an ad hoc sampling to be able to play around with the data and try things quickly.
More generally, I think the growing data volumes might inadvertently scare people off taking the dumps and using them in their research.
One idea we had recently to reduce the data size for a student project while keeping the most notable parts of Wikidata was to only keep claims that involve an item linked to Wikipedia; in other words, if the statement involves a Q item (in the "subject" or "object") not linked to Wikipedia, the statement is removed.
I wonder would it be possible for Wikidata to provide such a dump to download (e.g., in RDF) for people who prefer to work with a more concise sub-graph that still maintains the most "notable" parts? While of course one could compute this from the full-dump locally, making such a version available as a dump directly would save clients some resources, potentially encourage more research using/on Wikidata, and having such a version "rubber-stamped" by Wikidata would also help to justify the use of such a dataset for research purposes.
... just an idea I thought I would float out there. Perhaps there is another (better) way to define a concise dump.
Best, Aidan