Some datatypes already kinda support federation. The globe used with globecoordinate can
be set to any URL. While it's usually
http://www.wikidata.org/entity/Q2 (Earth) on
Wikidata, it could refer to a foreign entity on a different Wikibase instance, allowing
for some federation. The same goes for the unit in a quantity, and the calendarmodel in a
date/time.
---- On Sat, 22 Feb 2020 20:05:57 -0500 Erik Paulson <epaulson(a)unit1127.com> wrote
----
[Apologies for length]
I hope to build a knowledge graph of current and past public officials in Wisconsin, both
elected and appointed. This entire dataset is not appropriate for Wikidata, as much of it
will not meet the notability guidelines - very few people care about who was on the
planning commission of the Town of Westport here in Dane County, Wisconsin back in 2006.
I would like to use Wikibase as the software to manage this hyper-specific knowledge
graph. Wikibase is easy to use for displaying and editing data. Wikibase is type-aware,
e.g. if a type is richer than a string or an integer, Wikibase can give a richer display,
like showing a coordinate on a map. Wikibase can export data to a SPARQL query service.
With upcoming projects like WBStack, Wikibase could be available as a service and might be
a great place for me to host my hyper-specific knowledge graph, provided there is a very
clear path for how I could export my data and seamlessly import it into a new Wikibase
install with minimal disruption, should WBStack ever shut down.
I would like to use data from Wikidata (or federate back to Wikidata) as much as possible,
for two main reasons.
First, while much of my data is not notable, there are items that will overlap with
Wikidata. The Town of Westport is Q3505274 in Wikidata, and I would prefer not to have to
duplicate the work that has already been done collecting statements about Westport, such
as population or geographical footprint. Those statements, however, would be useful in
queries in my hyper-specific knowledge graph - a typical query might be to find members of
planning commissions of towns with populations between 2500 and 5000 people.
Second, and possibly more important, there is a very useful ontology that has developed
through Wikidata, defining a set of classes for the world, and the properties and
datatypes to describe those classes, and the relationships that connect those classes
together. For example, ‘member of’, P463 on Wikidata, is a property that my hyper-specific
knowledge graph would also use quite heavily. Equally important is that Wikidata has talk
pages, mailing lists, and wiki pages that have hashed out over the years exactly how to
use this ontology and what are best practices. Does P463 apply only to people or can other
things be ‘members of’? Five years ago Wikidata editors thought about it and agreed that
it’s probably OK for things other than humans to be ‘members of’ other things. The
knowledge and best practices from Wikidata can be reused in other knowledge graphs even if
they’re not stored in Wikidata, and people who know how to model and query data in
Wikidata will feel at home in non-Wikidata knowledge graphs.
I’ve been thinking about what my options are to mix data between wikidata and my
hyper-specific knowledge graph. I am not a Wikidata expert so parts of this might be
incorrect.
The simplest possible thing I can do is to federate through SPARQL engines. I can dump my
hyper-specific graph into a local triple store, and use SERVICE statements in SPARQL to
reach out into Wikidata and fetch data in my queries to combine the two datasets.
As an improvement, I can load the Wikidata triples into my triplestore and query the
unified dataset directly with fully namedspaced-IRIs triples. As an added bonus, any query
that worked on
http://query.wikidata.org would work unchanged in my local Wikibase.
Queries that used both datasets should be faster than federating through SERVICE, but the
challenge is the Wikidata dumps are huge and take days to load. Worse, it’s difficult to
get incremental changes, so to stay current I’d have to reload many gigabytes every few
days.
However, what I really want is to have as much relevant Wikidata data as possible in my
local Wikibase, so all the great Wikibase features work, like fulltext autocomplete to
edit entities, or ensuring that property constraints can be applied, or that qualifiers
are cleanly handled. One redflag is that I don’t think I’ve ever seen anyone with a good
import of Wikidata into a local Wikibase, or even a dump of a smaller Wikibase instance
into a new Wikibase instance. It’s probably even harder than importing into the query
service. I am also pretty sure that I don’t want to try to import an entire Wikidata
dataset into a local Wikibase, because Wikidata is huge and there’s no good way to get an
incremental set of changes.
One idea that probably isn’t possible today is to store a statement in a local Wikibase
that references a remote Wikibase. I wish that I could create a statement in my local
Wikibase: Q2 P1 <http://www.wikidata.org/entity/Q3505274>, where Q2 and P1 are
entities created in my local Wikibase. As a workaround, I could do something like put a
statement of Q2 P1 Q1 in my local Wikibase, and then put a property on my local Q1 to say
that that the local Q1 has an external identifier of Q3505274 on Wikidata. A fancier
feature might be to have a feature in Wikibase that knows if an entity references a remote
Wikibase, the local wikibase mirrors the entity from the remote Wikibase into the local
Wikibase.
I don’t think that I would care that my local Q and P numbers don’t match up with any data
federated into my local Wikibase, though to reduce cognitive load I would probably try to
load common properties to matching P numbers at least (I think I will always want P31 and
P279 to mean the same thing as Wikidata’s P31 and P279, even if I have to create a bunch
of empty properties to ensure the numbering is right.)
Being able to customize the prefixes of the IRI in my RDF exports is a must-have, though.
It looks like that is possible today, though a bit clunky.
Editing items through the Wikibase UI already doesn’t matter much for P and Q numbers,
because autocomplete hides all of that. If I’m adding a statement about ‘educated at’, I
don’t need to know the P number for ‘educated at’ nor the Q number for the school
involved. It would be nice if that was easier to do in the query service, so I could use
labels in place of P and Q numbers.
What is probably most realistic today for my use case is just manually “federating” to
Wikidata by copying Wikidata entities into my Wikibase as needed, with new local Q and P
numbers. For my hyper-specific knowledge graph, it’s probably not that many entities that
I need to pull from Wikidata so as I discover I need to create entities, I can check to
see if Wikidata has one already and import it first.
My subset is probably much smaller than any of the subsets envisioned in the recent
discussions about “concise” or “notable” dumps of Wikidata. I assume that my
hyper-specific knowledge graph would want a few hundred properties from Wikidata and only
a few thousand items from Wikidata at most.
I will treat the entities that I pull from Wikidata as “read-only” copies - most of them
would be anyway. If there is a new statement I need to make about one of these, like
updating something about the Town of Westport, it is probably of interest to other
Wikidata consumers and the edit should be made on Wikidata and federated back. I can track
the (much smaller) set that I mirrored from Wikidata and periodically refresh them, so I
don’t need to try to process an entire dump.
A nice potential feature of Wikibase would be to be able to explicitly flag an entity as a
mirrored entity so Wikibase prevented it from being locally edited and it included
provenance information to point back to the wikibase(Wikidata) where it came from. Another
feature that might be nice (but I might be wrong and maybe this is a bad feature) is that
if Wikibase knew that this was an import, that when it exported to RDF it could also dump
the triples using the same IRIs as the source system.
So, perhaps in summary, I would like to be able to reuse some data from Wikidata in my
local wikibase, but I am concerned about:
- I want a subset of Wikidata, but I want only a very small subset of Wikidata
- How do I track the entities copied from Wikidata into my local Wikibase so I can update
them as new statements are added or updated on Wikidata?
- How can I make it easy for people who know the ontology and data already in Wikidata to
be able to edit the hyper-specific knowledge graph in my local Wikibase?
- How can I make it easy to query my hyper-specific knowledge graph using SPARQL while
maximizing query reuse between the Wikidata query service and my local query service,
potentially to the point of having the same query work on both if the query only involves
the Wikidata subset?
Thanks,
-Erik
_______________________________________________
Wikidata mailing list
mailto:Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata