Dear Wikibase/Wikidata Community,
We are trying to understand which database technologies and strategies Wikibase/Wikidata uses for storing, updating, and querying the data (knowledge) it manipulates.
By looking at the documentationhttps://wmde.github.io/wikidata-wikibase-architecture/assets/img/03-dataflow-out.drawio.17c12ee9.svg we understood that RDF is only used for the Wikidata Query Service, but we could not find out exactly how Wikibase/Wikidata stores the information that is translated to RDF during the data dump.
More specifically, we understood that a MySQL (or is it MariaDB?) relational database is used as the key persistence component for most of Wikibase/Wikidata services and that the information that is maintained in this database is periodically exported to multiple formats, including RDF.
In addition, looking at the relational database schema published in the documentationhttps://www.mediawiki.org/wiki/Manual:Database_layout we could not locate tables that are easily mappable to the Wikibase Data Modelhttps://www.mediawiki.org/wiki/Wikibase/DataModel. Thus, we hypothesize that there is some software component (Wikibase Common Data Access?) that dynamically translates the data contained in those tables to Statements, Entities, etc. Is that hypothesis, correct? If yes, does this software component uses any intermediate storage mechanism for caching those Statements, Entities, ...? Or are those translations always performed at runtime on-the-fly (be it for querying, adding, or updating Statements, Entities, …)?
Finally, we would like to understand more about how Wikidata REST APIhttps://www.wikidata.org/wiki/Wikidata:REST_API is implemented:
• In which database are the statements that added/retrieved through it stored? Is it being stored in the central MySQL database or in another database?
• Does it have any support for pagination of statements? For example, if an item has many statements associated with a property, does the API assumes that both the underlying database and the network will support the retrieval of all those statements?
• Are you currently considering implementing the support for more flexible querying of statements, or such requirement has been fully delegated to the Wikidata Query Service?
If there is an updated documentation that could help us answer those questions, could you kindly point us to it? Otherwise, would you be able to share this information with us?
Best Regards,
Elton F. de S. Soares Advisory Software Engineer Rio de Janeiro, RJ, Brazil IBM Research E-mail: eltons@ibm.commailto:eltons@ibm.com
Dear Elton (and others),
Wikibase uses the main MediaWiki database (which is normally MySQL/MariaDB but may be PostgreSQL or SQLite - see Special:Version) to store data about entities. They are stored as JSON blobs in a custom slot type as the primary content of the pages in certain namespaces. Examples of a particular entity: https://furry.wikibase.cloud/wiki/Special:EntityData/Q4.json (similar to what Wikibase might store) https://furry.wikibase.cloud/wiki/Special:EntityData/Q4.ttl (close to the triples that WDQS might consume - it might prefer the rdf form) https://furry.wikibase.cloud/wiki/Item:Q4 (WebUI)
The storage system is quite complicated because MediaWiki has come to store many different types of content and many revisions of it, but see https://mediawiki.org/wiki/Manual:Database_layout https://mediawiki.org/wiki/Manual:Content_tablehttps://mediawiki.org/wiki/Manual:Database_layout https://mediawiki.org/wiki/Multi-Content_Revisions https://mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema https://mediawiki.org/wiki/Multi-Content_Revisions/Content_Meta-Data
Secondary Wikibase storage is used in the form of tables, containing what in some ways might be seen as a view or indexes on the JSON blobs in the pages which constitute primary storage. It is updated in a deferred manner. See the schemas at https://mediawiki.org/wiki/Wikibase/Schema and https://doc.wikimedia.org/Wikibase/master/php/docs_topics_storage.html and explanation at https://doc.wikimedia.org/Wikibase/master/php/docs_storage_terms.html
(Wikimedia Commons uses a secondary slot to store a separate Mediainfo entity type: https://mediawiki.org/wiki/Extension:WikibaseMediaInfo - https://mediawiki.org/wiki/Extension:WikibaseMediaInfo/Data_Model/JSON )
A separate graph database called Blazegraph is used as the storage component for the Wikidata Query Service (WDQS), a Java-centric system which consists of repository, updater and proxy components, plus a web front end. https://mediawiki.org/wiki/Wikidata_Query_Service/User_Manual https://mediawiki.org/wiki/Wikidata_Query_Service/Implementation https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service
The updater either reads the MediaWiki Recent changes feed to identify pages that it needs to access and retrieve triples from by accessing the entity data endpoints (there is a link for this in the source of the main page as well; these triple view formats are provided by the Wikibase extension), or it is fed changes through a Kafka-mediated Flink updater used on Wikidata: https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater (I am not sure if others have implemented this - the distribution's updater uses recent changes.)
It is possible that some other data is cached within Wikibase using MediaWiki's cache solution but the most common caching is that of rendered pages. Perhaps a developer who knows the details could chime in w.r.t. your question; failing that, the source code is available, if complex.
Data may also be indexed within an attached ElasticSearch (Cirrus) index which if provisioned can hook into Wikibase and WDQS to enable search features which in some cases may be more efficient than a Blazegraph query or access data not stored in tuples: https://mediawiki.org/wiki/Help:Extension:WikibaseCirrusSearch
Otherwise, yes, WDQS is the main complex query interface (though Blazegraph might not be the only store for it in the future, its lack of ongoing external support makes that problematic - the team was bought out by AWS to create Neptune).
There is to my knowledge no pagination of statements, in the sense that you access all statements at once even if they may be consumed one at a time within e.g. Lua. This is why accessing an entity is considered expensive.
For this reason it is an inefficient anti-pattern to have thousands of statements on a single entity, especially if editing them actively, as there will be a lot of serialisation going on - it is likely better to have thousands of finely-divided entities.
The REST API is as far as I know just a different way to access the main MediaWiki component vs. the Action API and isn't a standalone system.
This maybe doesn't answer all your questions but hopefully it helps.
Best regards, -- Laurence 'GreenReaper' Parry Wikibase Community User Group https://GreenReaper.co.uk - https://wikifur.com
________________________________ From: Elton Figueiredo de Souza Soares via Wikibase Community User Group wikibaseug@lists.wikimedia.org Sent: Thursday, February 1, 2024 10:20:10 pm To: wikidata-tech@lists.wikimedia.org wikidata-tech@lists.wikimedia.org; wikibaseug@lists.wikimedia.org wikibaseug@lists.wikimedia.org Cc: Renato F Maia renato.maia@ibm.com; Guilherme Lima Guilherme.Lima@ibm.com; Leonardo Guerreiro Azevedo lga@br.ibm.com; Marcelo O C Machado mmachado@ibm.com; Joao Marcello Bessa Rodrigues joao.bessa@ibm.com; Raphael Melo Thiago raphaelt@br.ibm.com; Elton Figueiredo de Souza Soares eltons@ibm.com Subject: [Wikibase] Wikibase/Wikidata Database Technologies and Strategies
Dear Wikibase/Wikidata Community,
We are trying to understand which database technologies and strategies Wikibase/Wikidata uses for storing, updating, and querying the data (knowledge) it manipulates.
By looking at the documentationhttps://wmde.github.io/wikidata-wikibase-architecture/assets/img/03-dataflow-out.drawio.17c12ee9.svg we understood that RDF is only used for the Wikidata Query Service, but we could not find out exactly how Wikibase/Wikidata stores the information that is translated to RDF during the data dump.
More specifically, we understood that a MySQL (or is it MariaDB?) relational database is used as the key persistence component for most of Wikibase/Wikidata services and that the information that is maintained in this database is periodically exported to multiple formats, including RDF.
In addition, looking at the relational database schema published in the documentationhttps://www.mediawiki.org/wiki/Manual:Database_layout we could not locate tables that are easily mappable to the Wikibase Data Modelhttps://www.mediawiki.org/wiki/Wikibase/DataModel. Thus, we hypothesize that there is some software component (Wikibase Common Data Access?) that dynamically translates the data contained in those tables to Statements, Entities, etc. Is that hypothesis, correct? If yes, does this software component uses any intermediate storage mechanism for caching those Statements, Entities, ...? Or are those translations always performed at runtime on-the-fly (be it for querying, adding, or updating Statements, Entities, …)?
Finally, we would like to understand more about how Wikidata REST APIhttps://www.wikidata.org/wiki/Wikidata:REST_API is implemented:
• In which database are the statements that added/retrieved through it stored? Is it being stored in the central MySQL database or in another database?
• Does it have any support for pagination of statements? For example, if an item has many statements associated with a property, does the API assumes that both the underlying database and the network will support the retrieval of all those statements?
• Are you currently considering implementing the support for more flexible querying of statements, or such requirement has been fully delegated to the Wikidata Query Service?
If there is an updated documentation that could help us answer those questions, could you kindly point us to it? Otherwise, would you be able to share this information with us?
Best Regards,
Elton F. de S. Soares Advisory Software Engineer Rio de Janeiro, RJ, Brazil IBM Research E-mail: eltons@ibm.commailto:eltons@ibm.com
Dear Laurence,
Thank you for your prompt and insightful response! Your answers will certainly be very helpful for our team.
Best Regards,
Elton F. de S. Soares Advisory Software Engineer Rio de Janeiro, RJ, Brazil IBM Research E-mail: eltons@ibm.commailto:eltons@ibm.com
From: Laurence Parry greenreaper@hotmail.com Date: Thursday, 1 February 2024 at 20:43 To: Wikibase Community User Group wikibaseug@lists.wikimedia.org, wikidata-tech@lists.wikimedia.org wikidata-tech@lists.wikimedia.org Cc: Renato F Maia renato.maia@ibm.com, Guilherme Lima Guilherme.Lima@ibm.com, Leonardo Guerreiro Azevedo lga@br.ibm.com, Marcelo O C Machado mmachado@ibm.com, Joao Marcello Bessa Rodrigues joao.bessa@ibm.com, Raphael Melo Thiago raphaelt@br.ibm.com, Elton Figueiredo de Souza Soares eltons@ibm.com Subject: [EXTERNAL] Re: Wikibase/Wikidata Database Technologies and Strategies Dear Elton (and others), Wikibase uses the main MediaWiki database (which is normally MySQL/MariaDB but may be PostgreSQL or SQLite - see Special: Version) to store data about entities. They are stored as JSON blobs in a custom slot type as ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. https://us-phishalarm-ewt.proofpoint.com/EWT/v1/PjiDSg!1a-tDZ5zRvma9Yv7-IGljqkJvZSMUfbNB8t501hIV-aPNWilPr4TfGoaHddQ6JsEc8IFT5evXJJocldic5Z9gBlC8yB1728RMyMIHFjATR8PcC6gx1-OPU6r_S0nKFkM9raKeQ8RKw$ Report Suspicious https://us-phishalarm-ewt.proofpoint.com/EWT/v1/PjiDSg!1a-tDZ5zRvma9Yv7-IGljqkJvZSMUfbNB8t501hIV-aPNWilPr4TfGoaHddQ6JsEc8IFT5evXJJocldic5Z9gBlC8yB1728RMyMIHFjATR8PcC6gx1-OPU6r_S0nKFkM9raKeQ8RKw$
ZjQcmQRYFpfptBannerEnd Dear Elton (and others),
Wikibase uses the main MediaWiki database (which is normally MySQL/MariaDB but may be PostgreSQL or SQLite - see Special:Version) to store data about entities. They are stored as JSON blobs in a custom slot type as the primary content of the pages in certain namespaces. Examples of a particular entity: https://furry.wikibase.cloud/wiki/Special:EntityData/Q4.jsonhttps://furry.wikibase.cloud/wiki/Special:EntityData/Q4.json (similar to what Wikibase might store) https://furry.wikibase.cloud/wiki/Special:EntityData/Q4.ttlhttps://furry.wikibase.cloud/wiki/Special:EntityData/Q4.ttl (close to the triples that WDQS might consume - it might prefer the rdf form) https://furry.wikibase.cloud/wiki/Item:Q4https://furry.wikibase.cloud/wiki/Item:Q4 (WebUI)
The storage system is quite complicated because MediaWiki has come to store many different types of content and many revisions of it, but see https://mediawiki.org/wiki/Manual:Database_layouthttps://mediawiki.org/wiki/Manual:Database_layout https://mediawiki.org/wiki/Multi-Content_Revisionshttps://mediawiki.org/wiki/Multi-Content_Revisions https://mediawiki.org/wiki/Multi-Content_Revisions/Database_Schemahttps://mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema https://mediawiki.org/wiki/Multi-Content_Revisions/Content_Meta-Datahttps://mediawiki.org/wiki/Multi-Content_Revisions/Content_Meta-Data
Secondary Wikibase storage is used in the form of tables, containing what in some ways might be seen as a view or indexes on the JSON blobs in the pages which constitute primary storage. It is updated in a deferred manner. See the schemas at https://mediawiki.org/wiki/Wikibase/Schemahttps://mediawiki.org/wiki/Wikibase/Schema and https://doc.wikimedia.org/Wikibase/master/php/docs_topics_storage.htmlhttps://doc.wikimedia.org/Wikibase/master/php/docs_topics_storage.html and explanation at https://doc.wikimedia.org/Wikibase/master/php/docs_storage_terms.htmlhttps://doc.wikimedia.org/Wikibase/master/php/docs_storage_terms.html
(Wikimedia Commons uses a secondary slot to store a separate Mediainfo entity type: https://mediawiki.org/wiki/Extension:WikibaseMediaInfohttps://mediawiki.org/wiki/Extension:WikibaseMediaInfo - https://mediawiki.org/wiki/Extension:WikibaseMediaInfo/Data_Model/JSONhttps://mediawiki.org/wiki/Extension:WikibaseMediaInfo/Data_Model/JSON )
A separate graph database called Blazegraph is used as the storage component for the Wikidata Query Service (WDQS), a Java-centric system which consists of repository, updater and proxy components, plus a web front end. https://mediawiki.org/wiki/Wikidata_Query_Service/User_Manualhttps://mediawiki.org/wiki/Wikidata_Query_Service/User_Manual https://mediawiki.org/wiki/Wikidata_Query_Service/Implementationhttps://mediawiki.org/wiki/Wikidata_Query_Service/Implementation https://wikitech.wikimedia.org/wiki/Wikidata_Query_Servicehttps://wikitech.wikimedia.org/wiki/Wikidata_Query_Service
The updater either reads the MediaWiki Recent changes feed to identify pages that it needs to access and retrieve triples from by accessing the entity data endpoints (there is a link for this in the source of the main page as well; these triple view formats are provided by the Wikibase extension), or it is fed changes through a Kafka-mediated Flink updater used on Wikidata: https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updaterhttps://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater (I am not sure if others have implemented this - the distribution's updater uses recent changes.)
It is possible that some other data is cached within Wikibase using MediaWiki's cache solution but the most common caching is that of rendered pages. Perhaps a developer who knows the details could chime in w.r.t. your question; failing that, the source code is available, if complex.
Data may also be indexed within an attached ElasticSearch (Cirrus) index which if provisioned can hook into Wikibase and WDQS to enable search features which in some cases may be more efficient than a Blazegraph query or access data not stored in tuples: https://mediawiki.org/wiki/Help:Extension:WikibaseCirrusSearchhttps://mediawiki.org/wiki/Help:Extension:WikibaseCirrusSearch
Otherwise, yes, WDQS is the main complex query interface (though Blazegraph might not be the only store for it in the future, its lack of ongoing external support makes that problematic - the team was bought out by AWS to create Neptune).
There is to my knowledge no pagination of statements, in the sense that you access all statements at once even if they may be consumed one at a time within e.g. Lua. This is why accessing an entity is considered expensive.
For this reason it is an inefficient anti-pattern to have thousands of statements on a single entity, especially if editing them actively, as there will be a lot of serialisation going on - it is likely better to have thousands of finely-divided entities.
The REST API is as far as I know just a different way to access the main MediaWiki component vs. the Action API and isn't a standalone system.
This maybe doesn't answer all your questions but hopefully it helps.
Best regards, -- Laurence 'GreenReaper' Parry Wikibase Community User Group https://GreenReaper.co.ukhttps://GreenReaper.co.uk - https://wikifur.comhttps://wikifur.com
________________________________ From: Elton Figueiredo de Souza Soares via Wikibase Community User Group wikibaseug@lists.wikimedia.org Sent: Thursday, February 1, 2024 10:20:10 pm To: wikidata-tech@lists.wikimedia.org wikidata-tech@lists.wikimedia.org; wikibaseug@lists.wikimedia.org wikibaseug@lists.wikimedia.org Cc: Renato F Maia renato.maia@ibm.com; Guilherme Lima Guilherme.Lima@ibm.com; Leonardo Guerreiro Azevedo lga@br.ibm.com; Marcelo O C Machado mmachado@ibm.com; Joao Marcello Bessa Rodrigues joao.bessa@ibm.com; Raphael Melo Thiago raphaelt@br.ibm.com; Elton Figueiredo de Souza Soares eltons@ibm.com Subject: [Wikibase] Wikibase/Wikidata Database Technologies and Strategies
Dear Wikibase/Wikidata Community,
We are trying to understand which database technologies and strategies Wikibase/Wikidata uses for storing, updating, and querying the data (knowledge) it manipulates.
By looking at the documentationhttps://wmde.github.io/wikidata-wikibase-architecture/assets/img/03-dataflow-out.drawio.17c12ee9.svg we understood that RDF is only used for the Wikidata Query Service, but we could not find out exactly how Wikibase/Wikidata stores the information that is translated to RDF during the data dump.
More specifically, we understood that a MySQL (or is it MariaDB?) relational database is used as the key persistence component for most of Wikibase/Wikidata services and that the information that is maintained in this database is periodically exported to multiple formats, including RDF.
In addition, looking at the relational database schema published in the documentationhttps://www.mediawiki.org/wiki/Manual:Database_layout we could not locate tables that are easily mappable to the Wikibase Data Modelhttps://www.mediawiki.org/wiki/Wikibase/DataModel. Thus, we hypothesize that there is some software component (Wikibase Common Data Access?) that dynamically translates the data contained in those tables to Statements, Entities, etc. Is that hypothesis, correct? If yes, does this software component uses any intermediate storage mechanism for caching those Statements, Entities, ...? Or are those translations always performed at runtime on-the-fly (be it for querying, adding, or updating Statements, Entities, …)?
Finally, we would like to understand more about how Wikidata REST APIhttps://www.wikidata.org/wiki/Wikidata:REST_API is implemented:
• In which database are the statements that added/retrieved through it stored? Is it being stored in the central MySQL database or in another database?
• Does it have any support for pagination of statements? For example, if an item has many statements associated with a property, does the API assumes that both the underlying database and the network will support the retrieval of all those statements?
• Are you currently considering implementing the support for more flexible querying of statements, or such requirement has been fully delegated to the Wikidata Query Service?
If there is an updated documentation that could help us answer those questions, could you kindly point us to it? Otherwise, would you be able to share this information with us?
Best Regards,
Elton F. de S. Soares Advisory Software Engineer Rio de Janeiro, RJ, Brazil IBM Research E-mail: eltons@ibm.commailto:eltons@ibm.com
wikidata-tech@lists.wikimedia.org