[Apologies for length]

I hope to build a knowledge graph of current and past public officials in Wisconsin, both elected and appointed. This entire dataset is not appropriate for Wikidata, as much of it will not meet the notability guidelines - very few people care about who was on the planning commission of the Town of Westport here in Dane County, Wisconsin back in 2006.

I would like to use Wikibase as the software to manage this hyper-specific knowledge graph. Wikibase is easy to use for displaying and editing data. Wikibase is type-aware, e.g. if a type is richer than a string or an integer, Wikibase can give a richer display, like showing a coordinate on a map. Wikibase can export data to a SPARQL query service. With upcoming projects like WBStack, Wikibase could be available as a service and might be a great place for me to host my hyper-specific knowledge graph, provided there is a very clear path for how I could export my data and seamlessly import it into a new Wikibase install with minimal disruption, should WBStack ever shut down.

I would like to use data from Wikidata (or federate back to Wikidata) as much as possible, for two main reasons.

First, while much of my data is not notable, there are items that will overlap with Wikidata. The Town of Westport is Q3505274 in Wikidata, and I would prefer not to have to duplicate the work that has already been done collecting statements about Westport, such as population or geographical footprint. Those statements, however, would be useful in queries in my hyper-specific knowledge graph - a typical query might be to find members of planning commissions of towns with populations between 2500 and 5000 people.

Second, and possibly more important, there is a very useful ontology that has developed through Wikidata, defining a set of classes for the world, and the properties and datatypes to describe those classes, and the relationships that connect those classes together. For example, ‘member of’, P463 on Wikidata, is a property that my hyper-specific knowledge graph would also use quite heavily. Equally important is that Wikidata has talk pages, mailing lists, and wiki pages that have hashed out over the years exactly how to use this ontology and what are best practices. Does P463 apply only to people or can other things be ‘members of’? Five years ago Wikidata editors thought about it and agreed that it’s probably OK for things other than humans to be ‘members of’ other things. The knowledge and best practices from Wikidata can be reused in other knowledge graphs even if they’re not stored in Wikidata, and people who know how to model and query data in Wikidata will feel at home in non-Wikidata knowledge graphs.

I’ve been thinking about what my options are to mix data between wikidata and my hyper-specific knowledge graph. I am not a Wikidata expert so parts of this might be incorrect.

The simplest possible thing I can do is to federate through SPARQL engines. I can dump my hyper-specific graph into a local triple store, and use SERVICE statements in SPARQL to reach out into Wikidata and fetch data in my queries to combine the two datasets.

As an improvement, I can load the Wikidata triples into my triplestore and query the unified dataset directly with fully namedspaced-IRIs triples. As an added bonus, any query that worked on query.wikidata.org would work unchanged in my local Wikibase. Queries that used both datasets should be faster than federating through SERVICE, but the challenge is the Wikidata dumps are huge and take days to load. Worse, it’s difficult to get incremental changes, so to stay current I’d have to reload many gigabytes every few days.

However, what I really want is to have as much relevant Wikidata data as possible in my local Wikibase, so all the great Wikibase features work, like fulltext autocomplete to edit entities, or ensuring that property constraints can be applied, or that qualifiers are cleanly handled. One redflag is that I don’t think I’ve ever seen anyone with a good import of Wikidata into a local Wikibase, or even a dump of a smaller Wikibase instance into a new Wikibase instance. It’s probably even harder than importing into the query service. I am also pretty sure that I don’t want to try to import an entire Wikidata dataset into a local Wikibase, because Wikidata is huge and there’s no good way to get an incremental set of changes.

One idea that probably isn’t possible today is to store a statement in a local Wikibase that references a remote Wikibase. I wish that I could create a statement in my local Wikibase: Q2 P1 <http://www.wikidata.org/entity/Q3505274>, where Q2 and P1 are entities created in my local Wikibase. As a workaround, I could do something like put a statement of Q2 P1 Q1 in my local Wikibase, and then put a property on my local Q1 to say that that the local Q1 has an external identifier of Q3505274 on Wikidata. A fancier feature might be to have a feature in Wikibase that knows if an entity references a remote Wikibase, the local wikibase mirrors the entity from the remote Wikibase into the local Wikibase.

I don’t think that I would care that my local Q and P numbers don’t match up with any data federated into my local Wikibase, though to reduce cognitive load I would probably try to load common properties to matching P numbers at least (I think I will always want P31 and P279 to mean the same thing as Wikidata’s P31 and P279, even if I have to create a bunch of empty properties to ensure the numbering is right.)

Being able to customize the prefixes of the IRI in my RDF exports is a must-have, though. It looks like that is possible today, though a bit clunky.

Editing items through the Wikibase UI already doesn’t matter much for P and Q numbers, because autocomplete hides all of that. If I’m adding a statement about ‘educated at’, I don’t need to know the P number for ‘educated at’ nor the Q number for the school involved. It would be nice if that was easier to do in the query service, so I could use labels in place of P and Q numbers.

What is probably most realistic today for my use case is just manually “federating” to Wikidata by copying Wikidata entities into my Wikibase as needed, with new local Q and P numbers. For my hyper-specific knowledge graph, it’s probably not that many entities that I need to pull from Wikidata so as I discover I need to create entities, I can check to see if Wikidata has one already and import it first.

My subset is probably much smaller than any of the subsets envisioned in the recent discussions about “concise” or “notable” dumps of Wikidata. I assume that my hyper-specific knowledge graph would want a few hundred properties from Wikidata and only a few thousand items from Wikidata at most.

I will treat the entities that I pull from Wikidata as “read-only” copies - most of them would be anyway. If there is a new statement I need to make about one of these, like updating something about the Town of Westport, it is probably of interest to other Wikidata consumers and the edit should be made on Wikidata and federated back. I can track the (much smaller) set that I mirrored from Wikidata and periodically refresh them, so I don’t need to try to process an entire dump.

A nice potential feature of Wikibase would be to be able to explicitly flag an entity as a mirrored entity so Wikibase prevented it from being locally edited and it included provenance information to point back to the wikibase(Wikidata) where it came from. Another feature that might be nice (but I might be wrong and maybe this is a bad feature) is that if Wikibase knew that this was an import, that when it exported to RDF it could also dump the triples using the same IRIs as the source system.

So, perhaps in summary, I would like to be able to reuse some data from Wikidata in my local wikibase, but I am concerned about:
- I want a subset of Wikidata, but I want only a very small subset of Wikidata
- How do I track the entities copied from Wikidata into my local Wikibase so I can update them as new statements are added or updated on Wikidata?
- How can I make it easy for people who know the ontology and data already in Wikidata to be able to edit the hyper-specific knowledge graph in my local Wikibase?
- How can I make it easy to query my hyper-specific knowledge graph using SPARQL while maximizing query reuse between the Wikidata query service and my local query service, potentially to the point of having the same query work on both if the query only involves the Wikidata subset?

Thanks,

-Erik