We have a SPARQL query that we execute to locate as many of the "business
shaped" entities as we can find. In order to do this, we execute a UNION
query looking for subclasses of business (Q4830453), things that have an
OpenCorporates ID (P1320), and things that have a Legal Entity Identifier
(P1278). We additionally fetch the version of the entity, so we know when
entities that we've previously seen have been updated.
We have had problems with this query returning duplicates. We figured this
might be a concurrent update problem (the SPARQL query is relatively long
running), so we added a GROUP BY on ?item and then use the MAX(?version)
aggregate to ensure we're only seeing the maximum version reported by WDQS:
SELECT ?item (MAX(?version) AS ?maxVersion) WHERE {
{
# Fetch all Wikidata records that are an instance of (P31), through
zero or more
# subclass of (P279) business (Q4830453).
?item (wdt:P31/(wdt:P279*)) wd:Q4830453;
schema:version ?version.
}
UNION
{
# Fetch all Wikidata records that have an associated OpenCorporates ID.
?item wdt:P1320 ?ocid;
schema:version ?version.
}
UNION
{
# Fetch all Wikidata records that have an associated Legal Entity
Identifier.
?item wdt:P1278 ?lei;
schema:version ?version.
}
}
GROUP BY ?item
Despite the GROUP BY ?item, we occasionally encounter situations where the
same entity is reported more than once. Here's an error reported by our
processing code, which failed to insert a row into a database file because
one of the entity IDs (Q134614000) had already been inserted.
time=2025-07-28T17:16:54.242Z level=INFO source=:0 msg="failed to load
businesses and/or close database: failed to flush db: error: Duplicate key
\"id: Q134614000\" violates primary key constraint."
The SPARQL results themselves were received at 16:59:
time=2025-07-28T16:59:13.984Z level=INFO msg="SPARQL results fetched and
stored" path=/tmp/kgraph1255612088/sparql-results.json
Looking at the revision history for Q134614000 (sadly not viewable, as the
entity was recently deleted), I can see there were a flurry of
modifications to that record from 16:54 to 17:03, which points towards a
potential concurrency issue within WDQS, which prevented it from properly
executing the GROUP BY portion of the query, which should have ensured that
all ?item values are unique.
We re-ran our processor, utilizing the same SPARQL query at 17:30 and
encountered no duplicate key problems, indicating that the SPARQL query did
not return any duplicate entities.
We will likely code some additional safeties to catch and handle these
issues on our end. However, we want to understand if this is a known
issue/behavior in Wikibase.
Ryan Kennedy