Hi all!
First of all, let me say that we all love the SPARQL endpoint, it's a
great service and it has become essential to how we interact with
Wikidata and run our bots. Great job by Stas and others!
I am also aware that it is still in beta mode. There is just one
issue, which plagues us and I have filed a bug report regarding that
in Sep 2015 (
https://phabricator.wikimedia.org/T112397), so the issue
got alleviated, but it turned out that it did not get fully resolved:
-Occassionally, data written to an item in Wikidata via the API does
not make it into the triple store. (Frequency of the issue is hard to
determine)
-It is a crucial issue because it can lead to data inconsistency by
creating duplicate items or incorrect properties/values on items.
-It seems to happen while the SPARQL endpoint is under high load (just
my impression)
How data is affected:
-New data does not make it into the triple store
-Updates to and merges of items do not make it to the triple store, so
'ghost items' are returned which have actually been merged or queries
show/miss resutls/items incorreclty because freshly added/deleted data
has not been completely serialized.
Example: item
https://www.wikidata.org/wiki/Q416356, a protein,
recently got added protein domains via the 'has part' property. This
did not show up in SPARQL queries and a DESCRIBE query for that item
returned that these triples were not there indeed. (item has been
modified, so it is fine now.)
A solution seems to be to modify the item as this seems to trigger
re-serialization. But this is certainly not practical for larger
imports. Furthermore, as long as such an item does not get modified,
data could be missing/ghosting from/in the triple store for weeks or
even months. And it turns out to be quite difficult to determine how
much of a certain import effort finally made it into the triple store,
if you do not want to iterate through all items modified and check if
everything is in the triple store, which would take significant
amounts of time.
Could you maybe give us more info on the status of this issue and if
we could do something to help alleviating it?
Thank you!
Sebastian
(sebotic)
--
Sebastian Burgstaller-Muehlbacher, PhD
Research Associate
Andrew Su Lab
MEM-216, Department of Molecular and Experimental Medicine
The Scripps Research Institute
10550 North Torrey Pines Road
La Jolla, CA 92037