[Apologies for length]
I hope to build a knowledge graph of current and past public officials in
Wisconsin, both elected and appointed. This entire dataset is not
appropriate for Wikidata, as much of it will not meet the notability
guidelines - very few people care about who was on the planning commission
of the Town of Westport here in Dane County, Wisconsin back in 2006.
I would like to use Wikibase as the software to manage this hyper-specific
knowledge graph. Wikibase is easy to use for displaying and editing data.
Wikibase is type-aware, e.g. if a type is richer than a string or an
integer, Wikibase can give a richer display, like showing a coordinate on a
map. Wikibase can export data to a SPARQL query service. With upcoming
projects like WBStack, Wikibase could be available as a service and might
be a great place for me to host my hyper-specific knowledge graph, provided
there is a very clear path for how I could export my data and seamlessly
import it into a new Wikibase install with minimal disruption, should
WBStack ever shut down.
I would like to use data from Wikidata (or federate back to Wikidata) as
much as possible, for two main reasons.
First, while much of my data is not notable, there are items that will
overlap with Wikidata. The Town of Westport is Q3505274 in Wikidata, and I
would prefer not to have to duplicate the work that has already been done
collecting statements about Westport, such as population or geographical
footprint. Those statements, however, would be useful in queries in my
hyper-specific knowledge graph - a typical query might be to find members
of planning commissions of towns with populations between 2500 and 5000
people.
Second, and possibly more important, there is a very useful ontology that
has developed through Wikidata, defining a set of classes for the world,
and the properties and datatypes to describe those classes, and the
relationships that connect those classes together. For example, ‘member
of’, P463 on Wikidata, is a property that my hyper-specific knowledge graph
would also use quite heavily. Equally important is that Wikidata has talk
pages, mailing lists, and wiki pages that have hashed out over the years
exactly how to use this ontology and what are best practices. Does P463
apply only to people or can other things be ‘members of’? Five years ago
Wikidata editors thought about it and agreed that it’s probably OK for
things other than humans to be ‘members of’ other things. The knowledge and
best practices from Wikidata can be reused in other knowledge graphs even
if they’re not stored in Wikidata, and people who know how to model and
query data in Wikidata will feel at home in non-Wikidata knowledge graphs.
I’ve been thinking about what my options are to mix data between wikidata
and my hyper-specific knowledge graph. I am not a Wikidata expert so parts
of this might be incorrect.
The simplest possible thing I can do is to federate through SPARQL engines.
I can dump my hyper-specific graph into a local triple store, and use
SERVICE statements in SPARQL to reach out into Wikidata and fetch data in
my queries to combine the two datasets.
As an improvement, I can load the Wikidata triples into my triplestore and
query the unified dataset directly with fully namedspaced-IRIs triples. As
an added bonus, any query that worked on query.wikidata.org would work
unchanged in my local Wikibase. Queries that used both datasets should be
faster than federating through SERVICE, but the challenge is the Wikidata
dumps are huge and take days to load. Worse, it’s difficult to get
incremental changes, so to stay current I’d have to reload many gigabytes
every few days.
However, what I really want is to have as much relevant Wikidata data as
possible in my local Wikibase, so all the great Wikibase features work,
like fulltext autocomplete to edit entities, or ensuring that property
constraints can be applied, or that qualifiers are cleanly handled. One
redflag is that I don’t think I’ve ever seen anyone with a good import of
Wikidata into a local Wikibase, or even a dump of a smaller Wikibase
instance into a new Wikibase instance. It’s probably even harder than
importing into the query service. I am also pretty sure that I don’t want
to try to import an entire Wikidata dataset into a local Wikibase, because
Wikidata is huge and there’s no good way to get an incremental set of
changes.
One idea that probably isn’t possible today is to store a statement in a
local Wikibase that references a remote Wikibase. I wish that I could
create a statement in my local Wikibase: Q2 P1 <
http://www.wikidata.org/entity/Q3505274>, where Q2 and P1 are entities
created in my local Wikibase. As a workaround, I could do something like
put a statement of Q2 P1 Q1 in my local Wikibase, and then put a property
on my local Q1 to say that that the local Q1 has an external identifier of
Q3505274 on Wikidata. A fancier feature might be to have a feature in
Wikibase that knows if an entity references a remote Wikibase, the local
wikibase mirrors the entity from the remote Wikibase into the local
Wikibase.
I don’t think that I would care that my local Q and P numbers don’t match
up with any data federated into my local Wikibase, though to reduce
cognitive load I would probably try to load common properties to matching P
numbers at least (I think I will always want P31 and P279 to mean the same
thing as Wikidata’s P31 and P279, even if I have to create a bunch of empty
properties to ensure the numbering is right.)
Being able to customize the prefixes of the IRI in my RDF exports is a
must-have, though. It looks like that is possible today, though a bit
clunky.
Editing items through the Wikibase UI already doesn’t matter much for P and
Q numbers, because autocomplete hides all of that. If I’m adding a
statement about ‘educated at’, I don’t need to know the P number for
‘educated at’ nor the Q number for the school involved. It would be nice if
that was easier to do in the query service, so I could use labels in place
of P and Q numbers.
What is probably most realistic today for my use case is just manually
“federating” to Wikidata by copying Wikidata entities into my Wikibase as
needed, with new local Q and P numbers. For my hyper-specific knowledge
graph, it’s probably not that many entities that I need to pull from
Wikidata so as I discover I need to create entities, I can check to see if
Wikidata has one already and import it first.
My subset is probably much smaller than any of the subsets envisioned in
the recent discussions about “concise” or “notable” dumps of Wikidata. I
assume that my hyper-specific knowledge graph would want a few hundred
properties from Wikidata and only a few thousand items from Wikidata at
most.
I will treat the entities that I pull from Wikidata as “read-only” copies -
most of them would be anyway. If there is a new statement I need to make
about one of these, like updating something about the Town of Westport, it
is probably of interest to other Wikidata consumers and the edit should be
made on Wikidata and federated back. I can track the (much smaller) set
that I mirrored from Wikidata and periodically refresh them, so I don’t
need to try to process an entire dump.
A nice potential feature of Wikibase would be to be able to explicitly flag
an entity as a mirrored entity so Wikibase prevented it from being locally
edited and it included provenance information to point back to the
wikibase(Wikidata) where it came from. Another feature that might be nice
(but I might be wrong and maybe this is a bad feature) is that if Wikibase
knew that this was an import, that when it exported to RDF it could also
dump the triples using the same IRIs as the source system.
So, perhaps in summary, I would like to be able to reuse some data from
Wikidata in my local wikibase, but I am concerned about:
- I want a subset of Wikidata, but I want only a very small subset of
Wikidata
- How do I track the entities copied from Wikidata into my local Wikibase
so I can update them as new statements are added or updated on Wikidata?
- How can I make it easy for people who know the ontology and data already
in Wikidata to be able to edit the hyper-specific knowledge graph in my
local Wikibase?
- How can I make it easy to query my hyper-specific knowledge graph using
SPARQL while maximizing query reuse between the Wikidata query service and
my local query service, potentially to the point of having the same query
work on both if the query only involves the Wikidata subset?
Thanks,
-Erik
Hi all,
join us for our monthly Analytics/Research Office hours on 2020-02-26 at
17.00-18.00 (UTC). Bring all your research questions and ideas to discuss
projects, data, analysis, etc…
To participate, please join the IRC channel: #wikimedia-research [1].
More detailed information can be found here [2] or on the etherpad [3] if
you would like to add items to agenda or check notes from previous meetings.
Best,
Martin
[1] irc://chat.freenode.net:6667/wikimedia-research
[2] https://www.mediawiki.org/wiki/Wikimedia_Research/Office_hours
[3] https://etherpad.wikimedia.org/p/Research-Analytics-Office-hours
--
Martin Gerlach
Research Scientist
Wikimedia Foundation
Hi everyone,
---------------------------------------------------------------
TL;DR: soweego 2 is on its way.
Here's the Project Grant proposal:
https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego_2
---------------------------------------------------------------
Does the name *soweego* ring you a bell?
It's an artificial intelligence that links Wikidata to large catalogs [1].
It's a close friend of Mix'n'match [2], which mainly caters for small
catalogs.
The next big step is to check Wikidata content against third-party
trusted sources.
In a nutshell, we want to enable feedback loops between Wikidatans and
catalog maintainers.
The ultimate goal is to foster mutual benefits in the open knowledge
landscape.
I'd be really grateful if you could have a look at the proposal page [3].
Can't wait for your feedback.
Best,
Marco
[1] https://soweego.readthedocs.io/
[2] https://tools.wmflabs.org/mix-n-match/
[3] https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego_2
Hello all,
Being able to use data from Wikidata in semantic wikis would be really cool
and open up a lot of possibilities.
We'd like to make this a reality and are looking both for feedback on our
initial idea and for funding.
https://professional.wiki/en/articles/semantic-wikibase
For those not familiar with Semantic MediaWiki, a very short and simplified
summary: Semantic MediaWiki was created by Denny and Markus (and several
contributors) and is the precursor software to Wikidata. While it was not a
good fit for Wikipedia itself, it gained a lot of popularity outside of
Wikimedia, and is used by many companies and organizations for managing
structured data. Later on a dedicated software for the Wikipedia/Wikimedia
usecase was created: Wikibase. Hence creating a connection between the two
is very fitting.
Cheers
--
Jeroen De Dauw | Technical Director Professional.Wiki
<https://www.Professional.Wiki>
Enterprise wiki consulting and hosting. We are MediaWiki & SMW experts.
Professional.Wiki - Jeroen De Dauw & Karsten Hoffmeyer GbR
Tieckstraße 24-25, 10115 Berlin | +49 (30) 55 87 42 65 | USt-IdNr.
DE322440293
Dear all,
I thank you for your efforts. I have seen with a lot of interest the report of Dr. David Abián about property constraints for Wikidata as shown at https://www.wikidata.org/wiki/Wikidata:2020_report_on_Property_constraints. The report provides a significant overview of the usefulness and current status of Wikidata property constraints to ameliorate the consistency of the Wikidata ontology. However, I think that property constraints suffer from critical limitations that currently harm the quality of linked data of Wikidata. That is why I propose a roadmap for adding support of ontological reasoning to Wikidata. This proposal is inspired from the works about the use of ShEx for ontology validation of Wikidata and from a work we sent for review to World Wide Web Journal:
1. Link Shape Expressions to corresponding Wikidata classes and use them to validate the use of properties in Wikidata. This is possible through the acceptance of https://www.wikidata.org/wiki/Wikidata:Property_proposal/Shape_Expression_f….
2. Infer Shape Expressions for all major Wikidata classes. This is possible using https://wikitech.wikimedia.org/wiki/Tool:Wikidata_Shape_Expressions_Inferen….
3. Verify Shape Expressions for all Wikidata classes by hand
4. Propose two new Wikidata properties: Valid Subject Class and Valid Object Class. These two properties will be used to define the accurate classes for the subject and object of a Wikidata property. For example, The subject of a “Drug used for treatment” relation should be a disease or a symptom and its object should be a drug or a chemical substance.
5. Define practical guidelines of adding Inverse property (P1696) statements. This property links between a property and its inverse. For example, “Drug used for treatment” and “Medical condition treated” are two inverse properties. Where the relation is symmetric, the inverse of a given property is the same property. For example, “Significant drug interaction” is the inverse property of “Significant drug interaction”
6. Add Valid Subject Classes and Valid Object Classes for all Wikidata properties when applicable. Wikidata Query Service can be used to automate the process. This is shown in the paper that was sent for review to World Wide Web Journal.
7. Add Inverse properties to each Wikidata properties. “Valid Subject Class” and “Valid Object Class” are used as constraints for Inverse property statements. In fact, a property can have more than one inverse property and each inverse property is used according to context. Wikidata Query Service can be used to automate the process. This is shown in the paper that was sent for review to World Wide Web Journal.
8. Develop a tool that combines statements between WIkidata properties and Shape Expressions of Wikidata properties to validate the use of Wikidata property and identify deficient statements.
I ask about what you think about this detailed roadmap. I am currently convinced that this method will help us solve the linked data quality matter for Wikidata. I can help in applying this roadmap. Please reply me soon.
Yours Sincerely,
Houcemeddine Turki
Hello all!
First of all, my apologies for the long silence. We need to do better in
terms of communication. I'll try my best to send a monthly update from now
on. Keep me honest, remind me if I fail.
First, we had a security incident at the end of December, which forced us
to move from our Kafka based update stream back to the RecentChanges
poller. The details are still private, but you will be able to get the full
story soon on phabricator [1]. The RecentChange poller is less efficient
and this is leading to high update lag again (just when we thought we had
things slightly under control). We tried to mitigate this by improving the
parallelism in the updater [2], which helped a bit, but not as much as we
need.
Another attempt to get update lag under control is to apply back pressure
on edits, by adding the WDQS update lag to the Wikdiata maxlag [6]. This is
obviously less than ideal (at least as long as WDQS updates are lagging as
often as they are), but does allow the service to recover from time to
time. We probably need to iterate on this, provide better granularity,
differentiate better between operations that have an impact on update lag
and those which don't.
On the slightly better news side, we now have a much better understanding
of the update process and of its shortcomings. The current process does a
full diff between each updated entity and what we have in blazegraph. Even
if a single triple needs to change, we still read tons of data from
Blazegraph. While this approach is simple and robust, it is obviously not
efficient. We need to rewrite the updater to take a more event streaming /
reactive approach, and only work on the actual changes. This is a big chunk
of work, almost a complete rewrite of the updater, and we need a new
solution to stream changes with guaranteed ordering (something that our
kafka queues don't offer). This is where we are focusing our energy at the
moment, this looks like the best option to improve the situation in the
medium term. This change will probably have some functional impacts [3].
Some misc things:
We have done some work to get better metrics and better understanding of
what's going on. From collecting more metrics during the update [4] to
loading RDF dumps into Hadoop for further analysis [5] and better logging
of SPARQL requests. We are not focusing on this analysis until we are in a
more stable situation regarding update lag.
We have a new team member working on WDQS. He is still ramping up, but we
should have a bit more capacity from now on.
Some longer term thoughts:
Keeping all of Wikidata in a single graph is most probably not going to
work long term. We have not found examples of public SPARQL endpoints with
> 10 B triples and there is probably a good reason for that. We will
probably need to split the graphs at some point. We don't know how yet
(that's why we loaded the dumps into Hadoop, that might give us some more
insight). We might expose a subgraph with only truthy statements. Or have
language specific graphs, with only language specific labels. Or something
completely different.
Keeping WDQS / Wikidata as open as they are at the moment might not be
possible in the long term. We need to think if / how we want to implement
some form of authentication and quotas. Potentially increasing quotas for
some use cases, but keeping them strict for others. Again, we don't know
how this will look like, but we're thinking about it.
What you can do to help:
Again, we're not sure. Of course, reducing the load (both in terms of edits
on Wikidata and of reads on WDQS) will help. But not using those services
makes them useless.
We suspect that some use cases are more expensive than others (a single
property change to a large entity will require a comparatively insane
amount of work to update it on the WDQS side). We'd like to have real data
on the cost of various operations, but we only have guesses at this point.
If you've read this far, thanks a lot for your engagement!
Have fun!
Guillaume
[1] https://phabricator.wikimedia.org/T241410
[2] https://phabricator.wikimedia.org/T238045
[3] https://phabricator.wikimedia.org/T244341
[4] https://phabricator.wikimedia.org/T239908
[5] https://phabricator.wikimedia.org/T241125
[6] https://phabricator.wikimedia.org/T221774
--
Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation
UTC+1 / CET
Dear Wikidata community,
As part of my PhD, I am currently developing a visualisation tool to help data producers, such as Wikidata contributors, manage incompleteness in Linked Data.
This is in line with tools such as Recoin. We will present our prototype at the Wikiworkshop in Taipei in April.
We are currently running an evaluation, and looking for participants.
Basically this consists in:
- a first video talk to understand specific problems the participant may have, and agree on a small set of data of interest for her/him to visualise in the tool (≈30 min)
- a second video talk once the data have been analysed (a few days later) to show how the tool works and get spontaneous feedback (≈45 min)
- the participant uses of the tool on his own, giving us feedback such as:
- how much it was used: if not at all or a little, what were the problems encountered / if a lot, which specific problems did it helped solving
- what could be bettered
Feedback can be given through gitlab tickets, or more video talks, as the participant prefers.
If you are interested, please get in touch!
Best regards,
Marie Destandau
marie.destandau(a)inria.fr <mailto:marie.destandau@inria.fr>https://www.lri.fr/~mdestandau <https://www.lri.fr/~mdestandau>