Dear Laurence,
Thank you for your prompt and insightful response!
Your answers will certainly be very helpful for our team.
Best Regards,
Elton
F. de
S. Soares
Advisory Software Engineer
Rio de Janeiro, RJ, Brazil
IBM Research
E-mail: eltons@ibm.com
From:
Laurence Parry <greenreaper@hotmail.com>
Date: Thursday, 1 February 2024 at 20:43
To: Wikibase Community User Group <wikibaseug@lists.wikimedia.org>, wikidata-tech@lists.wikimedia.org <wikidata-tech@lists.wikimedia.org>
Cc: Renato F Maia <renato.maia@ibm.com>, Guilherme Lima <Guilherme.Lima@ibm.com>, Leonardo Guerreiro Azevedo <lga@br.ibm.com>, Marcelo O C Machado <mmachado@ibm.com>, Joao Marcello Bessa Rodrigues <joao.bessa@ibm.com>, Raphael Melo Thiago <raphaelt@br.ibm.com>,
Elton Figueiredo de Souza Soares <eltons@ibm.com>
Subject: [EXTERNAL] Re: Wikibase/Wikidata Database Technologies and Strategies
Dear Elton (and others), Wikibase uses the main MediaWiki database (which is normally MySQL/MariaDB but may be PostgreSQL or SQLite - see Special: Version) to store
data about entities. They are stored as JSON blobs in a custom slot type as
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Dear Elton (and others),
Wikibase uses the main MediaWiki database (which is normally MySQL/MariaDB but may be PostgreSQL or SQLite - see Special:Version) to store data about entities. They are stored as JSON blobs in a custom slot type as the primary content of
the pages in certain namespaces. Examples of a particular entity:
https://furry.wikibase.cloud/wiki/Special:EntityData/Q4.json (similar
to what Wikibase might store)
https://furry.wikibase.cloud/wiki/Special:EntityData/Q4.ttl (close
to the triples that WDQS might consume - it might prefer the rdf form)
The storage system is quite complicated because MediaWiki has come to store many different types of content and many revisions of it, but see
Secondary Wikibase storage is used in the form of tables, containing what in some ways might be seen as a view or indexes on the JSON blobs in the pages which constitute primary storage. It is updated in a
deferred manner. See the schemas at https://mediawiki.org/wiki/Wikibase/Schema
and https://doc.wikimedia.org/Wikibase/master/php/docs_topics_storage.html and explanation at
(Wikimedia Commons uses a secondary slot to store a separate Mediainfo entity type:
https://mediawiki.org/wiki/Extension:WikibaseMediaInfo - https://mediawiki.org/wiki/Extension:WikibaseMediaInfo/Data_Model/JSON )
A separate graph database called Blazegraph is used as the storage component for the Wikidata Query Service (WDQS), a Java-centric system which consists of repository, updater and proxy components, plus a web front end.
The updater either reads the MediaWiki Recent changes feed to identify pages that it needs to access and retrieve triples from by accessing the entity data endpoints (there is a link for this in the source of the main page as well; these
triple view formats are provided by the Wikibase extension), or it is fed changes through a Kafka-mediated Flink updater used on Wikidata:
https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater
(I am not sure if others have implemented this - the distribution's updater uses recent changes.)
It is possible that some other data is cached within Wikibase using MediaWiki's cache solution but the most common caching is that of rendered pages. Perhaps a developer who knows the details could chime in w.r.t. your question; failing
that, the source code is available, if complex.
Data may also be indexed within an attached ElasticSearch (Cirrus) index which if provisioned can hook into Wikibase and WDQS to enable search features which in some cases may be more efficient than a Blazegraph query or access data not
stored in tuples:
Otherwise, yes, WDQS is the main complex query interface (though Blazegraph might not be the only store for it in the future, its lack of ongoing external support makes that problematic - the team was bought out by AWS to create Neptune).
There is to my knowledge no pagination of statements, in the sense that you access all statements at once even if they may be consumed one at a time within e.g. Lua. This is why accessing an entity is considered expensive.
For this reason it is an inefficient anti-pattern to have thousands of statements on a single entity, especially if editing them actively, as there will be a lot of serialisation going on - it is likely better to have thousands of finely-divided
entities.
The REST API is as far as I know just a different way to access the main MediaWiki component vs. the Action API and isn't a standalone system.
This maybe doesn't answer all your questions but hopefully it helps.
Best regards,
--
Laurence 'GreenReaper' Parry
Wikibase Community User Group
From: Elton Figueiredo de Souza Soares via Wikibase Community User Group <wikibaseug@lists.wikimedia.org>
Sent: Thursday, February 1, 2024 10:20:10 pm
To: wikidata-tech@lists.wikimedia.org <wikidata-tech@lists.wikimedia.org>; wikibaseug@lists.wikimedia.org <wikibaseug@lists.wikimedia.org>
Cc: Renato F Maia <renato.maia@ibm.com>; Guilherme Lima <Guilherme.Lima@ibm.com>; Leonardo Guerreiro Azevedo <lga@br.ibm.com>; Marcelo O C Machado <mmachado@ibm.com>; Joao Marcello Bessa
Rodrigues <joao.bessa@ibm.com>; Raphael Melo Thiago <raphaelt@br.ibm.com>; Elton Figueiredo de Souza Soares <eltons@ibm.com>
Subject: [Wikibase] Wikibase/Wikidata Database Technologies and Strategies
Dear Wikibase/Wikidata Community,
We are trying to understand which database technologies and strategies Wikibase/Wikidata uses for storing, updating, and querying the data (knowledge) it manipulates.
By looking at the documentation we
understood that RDF is only used for the Wikidata Query Service, but we could not find out exactly how Wikibase/Wikidata stores the information that is translated to RDF during the data dump.
More specifically, we understood that a MySQL (or is it MariaDB?) relational database is used as the key persistence component for most of Wikibase/Wikidata services and that the information that is maintained in this
database is periodically exported to multiple formats, including RDF.
In addition, looking at the relational database schema published in the documentation we could not locate tables that are easily mappable to the Wikibase Data Model.
Thus, we hypothesize that there is some software component (Wikibase Common Data Access?) that dynamically
translates the data contained in those tables to Statements, Entities, etc. Is that hypothesis, correct?
If yes, does this software component uses any intermediate storage mechanism for caching those Statements, Entities, ...? Or are those translations always performed at
runtime on-the-fly (be it for querying, adding, or updating Statements, Entities, …)?
Finally, we would like to understand more about how Wikidata REST API is implemented:
· In which database
are the statements that added/retrieved through it stored? Is it being stored in the central MySQL database or in another database?
· Does it have
any support for pagination of statements? For example, if an item has many statements associated with a property, does the API assumes that both the underlying database and the network will support the retrieval of all those statements?
· Are you currently
considering implementing the support for more flexible querying of statements, or such requirement has been fully delegated to the Wikidata Query Service?
If there is an updated documentation that could help us answer those questions, could you kindly point us to it? Otherwise, would you be able to share this information with us?
Best Regards,
Elton
F. de
S. Soares
Advisory Software Engineer
Rio de Janeiro, RJ, Brazil
IBM Research
E-mail: eltons@ibm.com