Hi!
I started looking into how to produce RDF dump of MediaInfo entities, and I've encountered some roadblocks that I am not sure how to get around. Would like to hear suggestions on this, here or on phabricator directly, or on IRC:
1. https://phabricator.wikimedia.org/T222299 Basically right now when we are enumerating entities for certain types, we are just looking at pages from namespace related to entity types and assume page title is parseable directly into entity ID. However, with slot entities like MediaInfo it is not the case. So, we need there generic service that would take a page and set of entity types, and figure out: a. Which of those entity types are "regular" entities with dedicated page IDs and which ones live in slots b. For the regular entities, do $this->entityIdParser->parse( $row->page_title ) as before c. For slot entities, check that the slot is present and if so, produce entity ID specific to this slot. Preferably this is also done without separate db access (may not be easy) since SqlEntityIdPager needs to have good performance.
I am not sure whether there's an API that does that. EntityByLinkedTitleLookup comes very close and even has a hook that does the right thing, but it does DB access even for local IDs for Wikidata (can be fixed) and does not support batching. Any other suggestions how the above can be properly done?
There's also complication that pages to slots is no longer one-to-one, so fetch() operation can return not only $limit but anywhere from 0 to (number of slots)*$limit entity Ids. Probably not a huge deal but might need some careful handling.
2. https://phabricator.wikimedia.org/T222306 The entities in SDC are not local entities - e.g. if I am looking at https://commons.wikimedia.org/wiki/Special:EntityData/M9103972.json P180 and Q83043 do not come from Commons, they come from Wikidata. However, they do not have prefixes, which means RDF builder thinks they are local, and assigns them Commons-based namespaces, which is obviously wrong, since they are Wikidata entities. While Commons has a bunch of redirects set up, RDF identifies data by literal URL and has no idea about redirects, so querying data would be problematic if Wikidata datataset is combined with Commons dataset. It would, for example, make it next to impossible to run federated queries between Wikidata and Commons, as two stores would use different URIs for Wikidata entities.
Additionally, current RDF generation process assumes wd: prefix always belongs to local wiki, so on Commons wd: is https://commons.wikimedia.org/entity/ but on Wikidata it's of course the Wikidata URL. This may be very confusing to people. If wd: means different things in Commons and Wikidata, then federated queries may be confusing as it'd be unclear which wd: means what where. Ideally, we'd not use wd: prefix for Commons at all, but this goes against the assumption hardcoded in RdfVocabulary that local wiki entities are wd:. So again I am not sure what's the best way to treat this situation, since I am not sure how federation model in SDC is working - the code suggests there should be some kinds of prefixes for entity IDs, but SDC does not seem to use any.
Any suggestions about the above are welcome. Thanks,