Dear Elton (and others),
Wikibase uses the main MediaWiki database (which is normally MySQL/MariaDB but may be
PostgreSQL or SQLite - see Special:Version) to store data about entities. They are stored
as JSON blobs in a custom slot type as the primary content of the pages in certain
namespaces. Examples of a particular entity:
https://furry.wikibase.cloud/wiki/Special:EntityData/Q4.json (similar to what Wikibase
might store)
https://furry.wikibase.cloud/wiki/Special:EntityData/Q4.ttl (close to the triples that
WDQS might consume - it might prefer the rdf form)
https://furry.wikibase.cloud/wiki/Item:Q4 (WebUI)
The storage system is quite complicated because MediaWiki has come to store many different
types of content and many revisions of it, but see
https://mediawiki.org/wiki/Manual:Database_layout
<https://mediawiki.org/wiki/Manual:Content_table><https://mediawiki.org/wiki/Manual:Database_layout>
https://mediawiki.org/wiki/Multi-Content_Revisions
https://mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema
https://mediawiki.org/wiki/Multi-Content_Revisions/Content_Meta-Data
Secondary Wikibase storage is used in the form of tables, containing what in some ways
might be seen as a view or indexes on the JSON blobs in the pages which constitute primary
storage. It is updated in a deferred manner. See the schemas at
https://mediawiki.org/wiki/Wikibase/Schema
and
https://doc.wikimedia.org/Wikibase/master/php/docs_topics_storage.html and explanation
at
https://doc.wikimedia.org/Wikibase/master/php/docs_storage_terms.html
(Wikimedia Commons uses a secondary slot to store a separate Mediainfo entity type:
https://mediawiki.org/wiki/Extension:WikibaseMediaInfo -
https://mediawiki.org/wiki/Extension:WikibaseMediaInfo/Data_Model/JSON )
A separate graph database called Blazegraph is used as the storage component for the
Wikidata Query Service (WDQS), a Java-centric system which consists of repository, updater
and proxy components, plus a web front end.
https://mediawiki.org/wiki/Wikidata_Query_Service/User_Manual
https://mediawiki.org/wiki/Wikidata_Query_Service/Implementation
https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service
The updater either reads the MediaWiki Recent changes feed to identify pages that it needs
to access and retrieve triples from by accessing the entity data endpoints (there is a
link for this in the source of the main page as well; these triple view formats are
provided by the Wikibase extension), or it is fed changes through a Kafka-mediated Flink
updater used on Wikidata:
https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater
(I am not sure if others have implemented this - the distribution's updater uses
recent changes.)
It is possible that some other data is cached within Wikibase using MediaWiki's cache
solution but the most common caching is that of rendered pages. Perhaps a developer who
knows the details could chime in w.r.t. your question; failing that, the source code is
available, if complex.
Data may also be indexed within an attached ElasticSearch (Cirrus) index which if
provisioned can hook into Wikibase and WDQS to enable search features which in some cases
may be more efficient than a Blazegraph query or access data not stored in tuples:
https://mediawiki.org/wiki/Help:Extension:WikibaseCirrusSearch
Otherwise, yes, WDQS is the main complex query interface (though Blazegraph might not be
the only store for it in the future, its lack of ongoing external support makes that
problematic - the team was bought out by AWS to create Neptune).
There is to my knowledge no pagination of statements, in the sense that you access all
statements at once even if they may be consumed one at a time within e.g. Lua. This is why
accessing an entity is considered expensive.
For this reason it is an inefficient anti-pattern to have thousands of statements on a
single entity, especially if editing them actively, as there will be a lot of
serialisation going on - it is likely better to have thousands of finely-divided
entities.
The REST API is as far as I know just a different way to access the main MediaWiki
component vs. the Action API and isn't a standalone system.
This maybe doesn't answer all your questions but hopefully it helps.
Best regards,
--
Laurence 'GreenReaper' Parry
Wikibase Community User Group
https://GreenReaper.co.uk -
https://wikifur.com
________________________________
From: Elton Figueiredo de Souza Soares via Wikibase Community User Group
<wikibaseug(a)lists.wikimedia.org>
Sent: Thursday, February 1, 2024 10:20:10 pm
To: wikidata-tech(a)lists.wikimedia.org <wikidata-tech(a)lists.wikimedia.org>rg>;
wikibaseug(a)lists.wikimedia.org <wikibaseug(a)lists.wikimedia.org>
Cc: Renato F Maia <renato.maia(a)ibm.com>om>; Guilherme Lima
<Guilherme.Lima(a)ibm.com>om>; Leonardo Guerreiro Azevedo <lga(a)br.ibm.com>om>; Marcelo
O C Machado <mmachado(a)ibm.com>om>; Joao Marcello Bessa Rodrigues
<joao.bessa(a)ibm.com>om>; Raphael Melo Thiago <raphaelt(a)br.ibm.com>om>; Elton
Figueiredo de Souza Soares <eltons(a)ibm.com>
Subject: [Wikibase] Wikibase/Wikidata Database Technologies and Strategies
Dear Wikibase/Wikidata Community,
We are trying to understand which database technologies and strategies Wikibase/Wikidata
uses for storing, updating, and querying the data (knowledge) it manipulates.
By looking at the
documentation<https://wmde.github.io/wikidata-wikibase-architecture/assets/img/03-dataflow-out.drawio.17c12ee9.svg>
we understood that RDF is only used for the Wikidata Query Service, but we could not find
out exactly how Wikibase/Wikidata stores the information that is translated to RDF during
the data dump.
More specifically, we understood that a MySQL (or is it MariaDB?) relational database is
used as the key persistence component for most of Wikibase/Wikidata services and that the
information that is maintained in this database is periodically exported to multiple
formats, including RDF.
In addition, looking at the relational database schema published in the
documentation<https://www.mediawiki.org/wiki/Manual:Database_layout> we could not
locate tables that are easily mappable to the Wikibase Data
Model<https://www.mediawiki.org/wiki/Wikibase/DataModel>el>.
Thus, we hypothesize that there is some software component (Wikibase Common Data Access?)
that dynamically translates the data contained in those tables to Statements, Entities,
etc. Is that hypothesis, correct?
If yes, does this software component uses any intermediate storage mechanism for caching
those Statements, Entities, ...? Or are those translations always performed at runtime
on-the-fly (be it for querying, adding, or updating Statements, Entities, …)?
Finally, we would like to understand more about how Wikidata REST
API<https://www.wikidata.org/wiki/Wikidata:REST_API> is implemented:
• In which database are the statements that added/retrieved through it stored? Is
it being stored in the central MySQL database or in another database?
• Does it have any support for pagination of statements? For example, if an item
has many statements associated with a property, does the API assumes that both the
underlying database and the network will support the retrieval of all those statements?
• Are you currently considering implementing the support for more flexible
querying of statements, or such requirement has been fully delegated to the Wikidata Query
Service?
If there is an updated documentation that could help us answer those questions, could you
kindly point us to it? Otherwise, would you be able to share this information with us?
Best Regards,
Elton F. de S. Soares
Advisory Software Engineer
Rio de Janeiro, RJ, Brazil
IBM Research
E-mail: eltons@ibm.com<mailto:eltons@ibm.com>