Hi all!
First of all, let me say that we all love the SPARQL endpoint, it's a great service and it has become essential to how we interact with Wikidata and run our bots. Great job by Stas and others!
I am also aware that it is still in beta mode. There is just one issue, which plagues us and I have filed a bug report regarding that in Sep 2015 (https://phabricator.wikimedia.org/T112397), so the issue got alleviated, but it turned out that it did not get fully resolved:
-Occassionally, data written to an item in Wikidata via the API does not make it into the triple store. (Frequency of the issue is hard to determine) -It is a crucial issue because it can lead to data inconsistency by creating duplicate items or incorrect properties/values on items. -It seems to happen while the SPARQL endpoint is under high load (just my impression)
How data is affected: -New data does not make it into the triple store -Updates to and merges of items do not make it to the triple store, so 'ghost items' are returned which have actually been merged or queries show/miss resutls/items incorreclty because freshly added/deleted data has not been completely serialized.
Example: item https://www.wikidata.org/wiki/Q416356, a protein, recently got added protein domains via the 'has part' property. This did not show up in SPARQL queries and a DESCRIBE query for that item returned that these triples were not there indeed. (item has been modified, so it is fine now.)
A solution seems to be to modify the item as this seems to trigger re-serialization. But this is certainly not practical for larger imports. Furthermore, as long as such an item does not get modified, data could be missing/ghosting from/in the triple store for weeks or even months. And it turns out to be quite difficult to determine how much of a certain import effort finally made it into the triple store, if you do not want to iterate through all items modified and check if everything is in the triple store, which would take significant amounts of time.
Could you maybe give us more info on the status of this issue and if we could do something to help alleviating it?
Thank you!
Sebastian (sebotic)
Hi!
Could you maybe give us more info on the status of this issue and if we could do something to help alleviating it?
I think the best way if you discover this issue to tell me (by leaving a note on https://www.wikidata.org/wiki/User_talk:Smalyshev_(WMF) or by filing an issue in Phabricator and tagging it with wikidata-query-service), specifying: - Affected ID - Specific property that is wrong and which values are missing (test case queries welcome too :) - When the property was updated, if known (sometimes I can see it in history but bot updates are often massive and hard to search)
It's even better if you could hold off updating the item for a little bit (I'd usually be able to look into it within a day or two, and then update it), since after it's updated it's be hard for me to know which of the servers had wrong data, and research why it happened. Preserving the error state for a bit enables me to research in more details.
Mostly unrelated to this, there might be still some deleted items lurking around in the DB, since the deletion problem we had about a month ago. If you notice any, please tell me. Eventually I'll do the DB reload to sync them all, but that may take some time due to some other issues, so for now just please tell me about them.
Thanks,