Hi all,
Yes, Benno's WDumper could be used for this purpose. The motivation for
the whole project was very similar to what Aidan describes. We realised
thought that there won't be a single good way to build smaller dump that
would serve every conceivable use in research, which is why the UI let's
users make custom dumps.
In general, we are happy to hear more ideas on how to build useful
smaller dumps that would be interesting. We are also accepting pull
requests.
Benno, could you add a feature to include only items with Wikipedia page
(in some language, or in any language)?
Edgard, I don't think making this more "official" will be very important
for most researchers. Benno spent quite some time on aligning the RDF
export with the official dumps, so in practice, WDumper mostly produces
a subset of the triples of the official dump (which one could also have
extracted manually). If there are differences left between the formats,
we will be happy to hear about them (a github issue would be the best
way to report it).
As Benno already wrote, WDumper connects to Zenodo to ensure that
exported datasets are archived in a permanent and citable fashion. This
is very important for research. As far as I know, none of the existing
dumps (official or not) guarantee long-term availability at the moment.
Cheers,
Markus
On 18/12/2019 13:37, Edgard Marx wrote:
It certainly helps, however, I think Aidan's
suggestion goes into the
direction of having an official dump distribution.
Imagine how many CO2 can be spared just by avoiding the computational
resource to recreate this dump every time ones need it.
Besides, it standardise the dataset used for research purposes.
On Wed, Dec 18, 2019, 11:26 Marco Fossati <fossati(a)spaziodati.eu
<mailto:fossati@spaziodati.eu>> wrote:
Hi everyone,
Benno (in CC) has recently announced this tool:
https://tools.wmflabs.org/wdumps/
I haven't checked it out yet, but it sounds related to Aidan's inquiry.
Hope this helps.
Cheers,
Marco
On 12/18/19 8:01 AM, Edgard Marx wrote:
+1
On Tue, Dec 17, 2019, 19:14 Aidan Hogan <aidhog(a)gmail.com
<mailto:aidhog@gmail.com>
<mailto:aidhog@gmail.com
<mailto:aidhog@gmail.com>>> wrote:
Hey all,
As someone who likes to use Wikidata in their research, and
likes to
give students projects relating to Wikidata,
I am finding it
more and
more difficult to (recommend to) work with
recent versions of
Wikidata
due to the increasing dump sizes, where even
the truthy
version now
costs considerable time and machine resources
to process and
handle. In
some cases we just grin and bear the costs,
while in other
cases we
apply an ad hoc sampling to be able to play
around with the
data and
try
things quickly.
More generally, I think the growing data volumes might
inadvertently
scare people off taking the dumps and using
them in their
research.
One idea we had recently to reduce the data size for a
student project
while keeping the most notable parts of
Wikidata was to only keep
claims
that involve an item linked to Wikipedia; in other words, if the
statement involves a Q item (in the "subject" or "object") not
linked to
Wikipedia, the statement is removed.
I wonder would it be possible for Wikidata to provide such a
dump to
download (e.g., in RDF) for people who prefer
to work with a more
concise sub-graph that still maintains the most "notable"
parts?
While
of course one could compute this from the
full-dump locally,
making
such
a version available as a dump directly would save clients some
resources, potentially encourage more research using/on
Wikidata, and
having such a version
"rubber-stamped" by Wikidata would also
help to
justify the use of such a dataset for
research purposes.
... just an idea I thought I would float out there. Perhaps
there is
another (better) way to define a concise
dump.
Best,
Aidan
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
<mailto:Wikidata@lists.wikimedia.org>
<mailto:Wikidata@lists.wikimedia.org
<mailto:Wikidata@lists.wikimedia.org>>
https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata