On Tue, Dec 17, 2019 at 7:16 PM Aidan Hogan <aidhog(a)gmail.com> wrote:
Hey all,
As someone who likes to use Wikidata in their research, and likes to
give students projects relating to Wikidata, I am finding it more and
more difficult to (recommend to) work with recent versions of Wikidata
due to the increasing dump sizes, where even the truthy version now
costs considerable time and machine resources to process and handle. In
some cases we just grin and bear the costs, while in other cases we
apply an ad hoc sampling to be able to play around with the data and try
things quickly.
More generally, I think the growing data volumes might inadvertently
scare people off taking the dumps and using them in their research.
One idea we had recently to reduce the data size for a student project
while keeping the most notable parts of Wikidata was to only keep claims
that involve an item linked to Wikipedia; in other words, if the
statement involves a Q item (in the "subject" or "object") not linked
to
Wikipedia, the statement is removed.
I wonder would it be possible for Wikidata to provide such a dump to
download (e.g., in RDF) for people who prefer to work with a more
concise sub-graph that still maintains the most "notable" parts? While
of course one could compute this from the full-dump locally, making such
a version available as a dump directly would save clients some
resources, potentially encourage more research using/on Wikidata, and
having such a version "rubber-stamped" by Wikidata would also help to
justify the use of such a dataset for research purposes.
... just an idea I thought I would float out there. Perhaps there is
another (better) way to define a concise dump.
Best,
Aidan
Hi Aiden,
That the dumps are becoming too big is an issue I've heard a number of
times now. It's something we need to tackle. My biggest issue is
deciding how to slice and dice it though in a way that works for many
use cases. We have
https://phabricator.wikimedia.org/T46581 to
brainstorm about that and figure it out. Input from several people
very welcome. I also added a link to Benno's tool there.
As for the specific suggestion: I fear relying on the existence of
sitelinks will kick out a lot of important things you would care about
like professions so I'm not sure that's a good thing to offer
officially for a larger audience.
Cheers
Lydia
--
Lydia Pintscher -
http://about.me/lydia.pintscher
Product Manager for Wikidata
Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.