Hello all,
Regarding the limiting of dumps, I fear it nullifies one of the huge advantages of
wikidata, which is to expand structured, referenced data beyond the often too narrow
confines of Wikipedia. Women and marginalized communities who are frequently eliminated
for lack of “notability” by overzealous or misguided Wikipedia editors risk being
accidentally re-eliminated by confining dumps to items with wikilinks. (Remember the
female researcher whose Wikipedia page was rejected for “lack of notability” - just before
she won a Noble prize?)
I think Wikidata dumps should be complete, with a possibility of user-controlled selection
by topic or period or other query, but not by what amounts to a kind of a “hidden” filter
of approval by a Wikipedia editor somewhere outside of Wikidata in a widely disseminated
dump marked, misleadingly, as “notable”.
Selection is very powerful in the digital world, where people assume (wrongly) that what
they see is what exists
Sent from my iPad
On Dec 20, 2019, at 13:00,
wikidata-request(a)lists.wikimedia.org wrote:
Send Wikidata mailing list submissions to
wikidata(a)lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit
https://lists.wikimedia.org/mailman/listinfo/wikidata
or, via email, send a message with subject or body 'help' to
wikidata-request(a)lists.wikimedia.org
You can reach the person managing the list at
wikidata-owner(a)lists.wikimedia.org
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Wikidata digest..."
Today's Topics:
1. Re: Concise/Notable Wikidata Dump (Aidan Hogan)
----------------------------------------------------------------------
Message: 1
Date: Thu, 19 Dec 2019 19:15:09 -0300
From: Aidan Hogan <aidhog(a)gmail.com>
To: wikidata(a)lists.wikimedia.org
Subject: Re: [Wikidata] Concise/Notable Wikidata Dump
Message-ID: <dc03559e-a670-b1dc-c88f-b73b9902fb30(a)gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Hey all,
Just a general response to all the comments thus far.
- @Marco et al., regarding the WDumper by Benno, this is a very cool
initiative! In fact I spotted it just *after* posting so I think this
goes quite some ways towards addressing the general issue raised.
- @Markus, I partially disagree regarding the importance of
rubber-stamping a "notable dump" on the Wikidata side. I would see it's
value as being something like the "truthy dump", which I believe has
been widely used in research for working with a concise sub-set of
Wikidata. Perhaps a middle ground is for a sporadic "notable dump" to be
generated by WDumper and published on Zenodo. This may be sufficient in
terms of making the dump available and reusable for research purposes
(or even better than the current dumps, given the permanence you
mention). Also it would reduce costs on the Wikidata side (I don't think
a notable dump would be necessary to generate on a weekly basis, for
example).
- @Lydia, good point! I was thinking that filtering by wikilinks will
just drop some more obscure nodes (like Q51366847 for example), but had
not considered that there are some more general "concepts" that do not
have a corresponding Wikipedia article. All the same, in a lot of the
research we use Wikidata for, we are not particularly interested in one
thing or another, but more interested in facilitating what other people
are interested in. Examples would be query performance, finding paths,
versioning, finding references, etc. But point taken! Maybe there is a
way to identify "general entities" that do not have wikilinks, but do
have a high degree or centrality, for example? Would a degree-based or
centrality-based filter be possible in something like WDumper (perhaps
it goes beyond the original purpose; certainly it does not seem trivial
in terms of resources used)? Would it be a good idea?
In summary, I like the idea of using WDumper to sporadically generate --
and publish on Zenodo -- a "notable version" of Wikidata filtered by
sitelinks (perhaps also allowing other high-degree or high-PageRank
nodes to pass the filter). At least I know I would use such a dump.
Best,
Aidan
On 2019-12-19 6:46, Lydia Pintscher wrote:
On Tue, Dec 17, 2019 at 7:16 PM Aidan Hogan
<aidhog(a)gmail.com> wrote:
Hey all,
As someone who likes to use Wikidata in their research, and likes to
give students projects relating to Wikidata, I am finding it more and
more difficult to (recommend to) work with recent versions of Wikidata
due to the increasing dump sizes, where even the truthy version now
costs considerable time and machine resources to process and handle. In
some cases we just grin and bear the costs, while in other cases we
apply an ad hoc sampling to be able to play around with the data and try
things quickly.
More generally, I think the growing data volumes might inadvertently
scare people off taking the dumps and using them in their research.
One idea we had recently to reduce the data size for a student project
while keeping the most notable parts of Wikidata was to only keep claims
that involve an item linked to Wikipedia; in other words, if the
statement involves a Q item (in the "subject" or "object") not linked
to
Wikipedia, the statement is removed.
I wonder would it be possible for Wikidata to provide such a dump to
download (e.g., in RDF) for people who prefer to work with a more
concise sub-graph that still maintains the most "notable" parts? While
of course one could compute this from the full-dump locally, making such
a version available as a dump directly would save clients some
resources, potentially encourage more research using/on Wikidata, and
having such a version "rubber-stamped" by Wikidata would also help to
justify the use of such a dataset for research purposes.
... just an idea I thought I would float out there. Perhaps there is
another (better) way to define a concise dump.
Best,
Aidan
Hi Aiden,
That the dumps are becoming too big is an issue I've heard a number of
times now. It's something we need to tackle. My biggest issue is
deciding how to slice and dice it though in a way that works for many
use cases. We have
https://phabricator.wikimedia.org/T46581 to
brainstorm about that and figure it out. Input from several people
very welcome. I also added a link to Benno's tool there.
As for the specific suggestion: I fear relying on the existence of
sitelinks will kick out a lot of important things you would care about
like professions so I'm not sure that's a good thing to offer
officially for a larger audience.
Cheers
Lydia
------------------------------
Subject: Digest Footer
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata
------------------------------
End of Wikidata Digest, Vol 97, Issue 13
****************************************