RDF export for SDC - Wikidata-tech

2 May 2019

Hi!

I started looking into how to produce RDF dump of MediaInfo entities,
and I've encountered some roadblocks that I am not sure how to get
around. Would like to hear suggestions on this, here or on phabricator
directly, or on IRC:

1. https://phabricator.wikimedia.org/T222299
Basically right now when we are enumerating entities for certain types,
we are just looking at pages from namespace related to entity types and
assume page title is parseable directly into entity ID. However, with
slot entities like MediaInfo it is not the case. So, we need there
generic service that would take a page and set of entity types, and
figure out:
a. Which of those entity types are "regular" entities with dedicated
page IDs and which ones live in slots
b. For the regular entities, do $this->entityIdParser->parse(
$row->page_title ) as before
c. For slot entities, check that the slot is present and if so, produce
entity ID specific to this slot. Preferably this is also done without
separate db access (may not be easy) since SqlEntityIdPager needs to
have good performance.

I am not sure whether there's an API that does that.
EntityByLinkedTitleLookup comes very close and even has a hook that does
the right thing, but it does DB access even for local IDs for Wikidata
(can be fixed) and does not support batching. Any other suggestions how
the above can be properly done?

There's also complication that pages to slots is no longer one-to-one,
so fetch() operation can return not only $limit but anywhere from 0 to
(number of slots)*$limit entity Ids. Probably not a huge deal but might
need some careful handling.

2. https://phabricator.wikimedia.org/T222306
The entities in SDC are not local entities - e.g. if I am looking at
https://commons.wikimedia.org/wiki/Special:EntityData/M9103972.json P180
and Q83043 do not come from Commons, they come from Wikidata. However,
they do not have prefixes, which means RDF builder thinks they are
local, and assigns them Commons-based namespaces, which is obviously
wrong, since they are Wikidata entities. While Commons has a bunch of
redirects set up, RDF identifies data by literal URL and has no idea
about redirects, so querying data would be problematic if Wikidata
datataset is combined with Commons dataset. It would, for example, make
it next to impossible to run federated queries between Wikidata and
Commons, as two stores would use different URIs for Wikidata entities.

Additionally, current RDF generation process assumes wd: prefix always
belongs to local wiki, so on Commons wd: is
<https://commons.wikimedia.org/entity/> but on Wikidata it's of course
the Wikidata URL. This may be very confusing to people. If wd: means
different things in Commons and Wikidata, then federated queries may be
confusing as it'd be unclear which wd: means what where. Ideally, we'd
not use wd: prefix for Commons at all, but this goes against the
assumption hardcoded in RdfVocabulary that local wiki entities are wd:.
So again I am not sure what's the best way to treat this situation,
since I am not sure how federation model in SDC is working - the code
suggests there should be some kinds of prefixes for entity IDs, but SDC
does not seem to use any.

Any suggestions about the above are welcome.
Thanks,
-- 
Stas Malyshev
smalyshev(a)wikimedia.org