We have a SPARQL query that we execute to locate as many of the "business shaped" entities as we can find. In order to do this, we execute a UNION query looking for subclasses of business (Q4830453), things that have an OpenCorporates ID (P1320), and things that have a Legal Entity Identifier (P1278). We additionally fetch the version of the entity, so we know when entities that we've previously seen have been updated.
We have had problems with this query returning duplicates. We figured this might be a concurrent update problem (the SPARQL query is relatively long running), so we added a GROUP BY on ?item and then use the MAX(?version) aggregate to ensure we're only seeing the maximum version reported by WDQS:
SELECT ?item (MAX(?version) AS ?maxVersion) WHERE { { # Fetch all Wikidata records that are an instance of (P31), through zero or more # subclass of (P279) business (Q4830453). ?item (wdt:P31/(wdt:P279*)) wd:Q4830453; schema:version ?version. } UNION { # Fetch all Wikidata records that have an associated OpenCorporates ID. ?item wdt:P1320 ?ocid; schema:version ?version. } UNION { # Fetch all Wikidata records that have an associated Legal Entity Identifier. ?item wdt:P1278 ?lei; schema:version ?version. } } GROUP BY ?item
Despite the GROUP BY ?item, we occasionally encounter situations where the same entity is reported more than once. Here's an error reported by our processing code, which failed to insert a row into a database file because one of the entity IDs (Q134614000) had already been inserted.
time=2025-07-28T17:16:54.242Z level=INFO source=:0 msg="failed to load businesses and/or close database: failed to flush db: error: Duplicate key "id: Q134614000" violates primary key constraint."
The SPARQL results themselves were received at 16:59:
time=2025-07-28T16:59:13.984Z level=INFO msg="SPARQL results fetched and stored" path=/tmp/kgraph1255612088/sparql-results.json
Looking at the revision history for Q134614000 (sadly not viewable, as the entity was recently deleted), I can see there were a flurry of modifications to that record from 16:54 to 17:03, which points towards a potential concurrency issue within WDQS, which prevented it from properly executing the GROUP BY portion of the query, which should have ensured that all ?item values are unique.
We re-ran our processor, utilizing the same SPARQL query at 17:30 and encountered no duplicate key problems, indicating that the SPARQL query did not return any duplicate entities.
We will likely code some additional safeties to catch and handle these issues on our end. However, we want to understand if this is a known issue/behavior in Wikibase.
Ryan Kennedy
Hello Ryan,
I think you can avoid this problem by using DISTINCT in the query:
SELECT DISTINCT ?item (MAX(?version) AS ?maxVersion) WHERE { [...]
Does this work for you?
Yours, Dragan
On Fri, Aug 1, 2025 at 7:17 PM Ryan Kennedy ryan.kennedy@groupargus.com wrote:
We have a SPARQL query that we execute to locate as many of the "business shaped" entities as we can find. In order to do this, we execute a UNION query looking for subclasses of business (Q4830453), things that have an OpenCorporates ID (P1320), and things that have a Legal Entity Identifier (P1278). We additionally fetch the version of the entity, so we know when entities that we've previously seen have been updated.
We have had problems with this query returning duplicates. We figured this might be a concurrent update problem (the SPARQL query is relatively long running), so we added a GROUP BY on ?item and then use the MAX(?version) aggregate to ensure we're only seeing the maximum version reported by WDQS:
SELECT ?item (MAX(?version) AS ?maxVersion) WHERE { { # Fetch all Wikidata records that are an instance of (P31), through zero or more # subclass of (P279) business (Q4830453). ?item (wdt:P31/(wdt:P279*)) wd:Q4830453; schema:version ?version. } UNION { # Fetch all Wikidata records that have an associated OpenCorporates ID. ?item wdt:P1320 ?ocid; schema:version ?version. } UNION { # Fetch all Wikidata records that have an associated Legal Entity Identifier. ?item wdt:P1278 ?lei; schema:version ?version. } } GROUP BY ?item
Despite the GROUP BY ?item, we occasionally encounter situations where the same entity is reported more than once. Here's an error reported by our processing code, which failed to insert a row into a database file because one of the entity IDs (Q134614000) had already been inserted.
time=2025-07-28T17:16:54.242Z level=INFO source=:0 msg="failed to load businesses and/or close database: failed to flush db: error: Duplicate key "id: Q134614000" violates primary key constraint."
The SPARQL results themselves were received at 16:59:
time=2025-07-28T16:59:13.984Z level=INFO msg="SPARQL results fetched and stored" path=/tmp/kgraph1255612088/sparql-results.json
Looking at the revision history for Q134614000 (sadly not viewable, as the entity was recently deleted), I can see there were a flurry of modifications to that record from 16:54 to 17:03, which points towards a potential concurrency issue within WDQS, which prevented it from properly executing the GROUP BY portion of the query, which should have ensured that all ?item values are unique.
We re-ran our processor, utilizing the same SPARQL query at 17:30 and encountered no duplicate key problems, indicating that the SPARQL query did not return any duplicate entities.
We will likely code some additional safeties to catch and handle these issues on our end. However, we want to understand if this is a known issue/behavior in Wikibase.
Ryan Kennedy _______________________________________________ Wikibase Community User Group mailing list -- wikibaseug@lists.wikimedia.org To unsubscribe send an email to wikibaseug-leave@lists.wikimedia.org
It's difficult to say for sure if this would help, since I can't reproduce this on demand. I'd need to run the SPARQL repeatedly for a lengthy period of time (in order to encounter what I believe the race condition to be) to feel even somewhat confident that the issue is addressed (though not 100%, because it could be the case that the race simply didn't occur during the time I was testing).
However, what would DISTINCT accomplish that the GROUP BY shouldn't already be doing? The GROUP BY ?item should already ensure there are no duplicate ?item values, but that's not the case in our experience with WDQS. If adding DISTINCT would fix the problem, I'd like to understand why GROUP BY ?item wasn't sufficient to better understand SPARQL and/or Wikibase's internals.
I'm wondering if, internal to Wikibase, there's a potential race here that would explain things. Is Wikibase being run in some parallel fashion, where separate threads/processes can return duplicate items during a race where an item is being updated during the query? If so, is there a coordinator responsible for processing the GROUP BY, which could ensure an item is only emitted once? If so, is there a bug in the coordinator in this case?
On Fri, Aug 1, 2025 at 12:52 PM Dragan Espenschied via Wikibase Community User Group wikibaseug@lists.wikimedia.org wrote:
Hello Ryan,
I think you can avoid this problem by using DISTINCT in the query:
SELECT DISTINCT ?item (MAX(?version) AS ?maxVersion) WHERE { [...]
Does this work for you?
Yours, Dragan
wikibaseug@lists.wikimedia.org