Dear Laurence,

 

Thank you for your prompt and insightful response!

Your answers will certainly be very helpful for our team.

  

Best Regards,

Elton F. de S. Soares

Advisory Software Engineer

Rio de Janeiro, RJ, Brazil

IBM Research

E-mail: eltons@ibm.com

 

 

From: Laurence Parry <greenreaper@hotmail.com>
Date: Thursday, 1 February 2024 at 20:43
To: Wikibase Community User Group <wikibaseug@lists.wikimedia.org>, wikidata-tech@lists.wikimedia.org <wikidata-tech@lists.wikimedia.org>
Cc: Renato F Maia <renato.maia@ibm.com>, Guilherme Lima <Guilherme.Lima@ibm.com>, Leonardo Guerreiro Azevedo <lga@br.ibm.com>, Marcelo O C Machado <mmachado@ibm.com>, Joao Marcello Bessa Rodrigues <joao.bessa@ibm.com>, Raphael Melo Thiago <raphaelt@br.ibm.com>, Elton Figueiredo de Souza Soares <eltons@ibm.com>
Subject: [EXTERNAL] Re: Wikibase/Wikidata Database Technologies and Strategies

Dear Elton (and others), Wikibase uses the main MediaWiki database (which is normally MySQL/MariaDB but may be PostgreSQL or SQLite - see Special: Version) to store data about entities. They are stored as JSON blobs in a custom slot type as

ZjQcmQRYFpfptBannerStart

This Message Is From an External Sender

This message came from outside your organization.

 

ZjQcmQRYFpfptBannerEnd

Dear Elton (and others),

 

Wikibase uses the main MediaWiki database (which is normally MySQL/MariaDB but may be PostgreSQL or SQLite - see Special:Version) to store data about entities. They are stored as JSON blobs in a custom slot type as the primary content of the pages in certain namespaces. Examples of a particular entity:

https://furry.wikibase.cloud/wiki/Special:EntityData/Q4.json (similar to what Wikibase might store)

https://furry.wikibase.cloud/wiki/Special:EntityData/Q4.ttl (close to the triples that WDQS might consume - it might prefer the rdf form)

https://furry.wikibase.cloud/wiki/Item:Q4 (WebUI)

 

The storage system is quite complicated because MediaWiki has come to store many different types of content and many revisions of it, but see

https://mediawiki.org/wiki/Manual:Database_layout

https://mediawiki.org/wiki/Multi-Content_Revisions

https://mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema

https://mediawiki.org/wiki/Multi-Content_Revisions/Content_Meta-Data

 

Secondary Wikibase storage is used in the form of tables, containing what in some ways might be seen as a view or indexes on the JSON blobs in the pages which constitute primary storage. It is updated in a deferred manner. See the schemas at https://mediawiki.org/wiki/Wikibase/Schema

and https://doc.wikimedia.org/Wikibase/master/php/docs_topics_storage.html and explanation at

https://doc.wikimedia.org/Wikibase/master/php/docs_storage_terms.html

 

(Wikimedia Commons uses a secondary slot to store a separate Mediainfo entity type: https://mediawiki.org/wiki/Extension:WikibaseMediaInfo - https://mediawiki.org/wiki/Extension:WikibaseMediaInfo/Data_Model/JSON )

 

A separate graph database called Blazegraph is used as the storage component for the Wikidata Query Service (WDQS), a Java-centric system which consists of repository, updater and proxy components, plus a web front end.

https://mediawiki.org/wiki/Wikidata_Query_Service/User_Manual

https://mediawiki.org/wiki/Wikidata_Query_Service/Implementation

https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service

 

The updater either reads the MediaWiki Recent changes feed to identify pages that it needs to access and retrieve triples from by accessing the entity data endpoints (there is a link for this in the source of the main page as well; these triple view formats are provided by the Wikibase extension), or it is fed changes through a Kafka-mediated Flink updater used on Wikidata: https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater

(I am not sure if others have implemented this - the distribution's updater uses recent changes.)

 

It is possible that some other data is cached within Wikibase using MediaWiki's cache solution but the most common caching is that of rendered pages. Perhaps a developer who knows the details could chime in w.r.t. your question; failing that, the source code is available, if complex.

 

Data may also be indexed within an attached ElasticSearch (Cirrus) index which if provisioned can hook into Wikibase and WDQS to enable search features which in some cases may be more efficient than a Blazegraph query or access data not stored in tuples:

https://mediawiki.org/wiki/Help:Extension:WikibaseCirrusSearch

 

Otherwise, yes, WDQS is the main complex query interface (though Blazegraph might not be the only store for it in the future, its lack of ongoing external support makes that problematic - the team was bought out by AWS to create Neptune).

 

There is to my knowledge no pagination of statements, in the sense that you access all statements at once even if they may be consumed one at a time within e.g. Lua. This is why accessing an entity is considered expensive.

 

For this reason it is an inefficient anti-pattern to have thousands of statements on a single entity, especially if editing them actively, as there will be a lot of serialisation going on - it is likely better to have thousands of finely-divided entities.

 

The REST API is as far as I know just a different way to access the main MediaWiki component vs. the Action API and isn't a standalone system.

 

This maybe doesn't answer all your questions but hopefully it helps.

 

Best regards,

-- 

Laurence 'GreenReaper' Parry

Wikibase Community User Group

https://GreenReaper.co.uk - https://wikifur.com

 


From: Elton Figueiredo de Souza Soares via Wikibase Community User Group <wikibaseug@lists.wikimedia.org>
Sent: Thursday, February 1, 2024 10:20:10 pm
To: wikidata-tech@lists.wikimedia.org <wikidata-tech@lists.wikimedia.org>; wikibaseug@lists.wikimedia.org <wikibaseug@lists.wikimedia.org>
Cc: Renato F Maia <renato.maia@ibm.com>; Guilherme Lima <Guilherme.Lima@ibm.com>; Leonardo Guerreiro Azevedo <lga@br.ibm.com>; Marcelo O C Machado <mmachado@ibm.com>; Joao Marcello Bessa Rodrigues <joao.bessa@ibm.com>; Raphael Melo Thiago <raphaelt@br.ibm.com>; Elton Figueiredo de Souza Soares <eltons@ibm.com>
Subject: [Wikibase] Wikibase/Wikidata Database Technologies and Strategies

 

Dear Wikibase/Wikidata Community,

 

We are trying to understand which database technologies and strategies Wikibase/Wikidata uses for storing, updating, and querying the data (knowledge) it manipulates. 

 

By looking at the documentation we understood that RDF is only used for the Wikidata Query Service, but we could not find out exactly how Wikibase/Wikidata stores the information that is translated to RDF during the data dump.

 

More specifically, we understood that a MySQL (or is it MariaDB?) relational database is used as the key persistence component for most of Wikibase/Wikidata services and that the information that is maintained in this database is periodically exported to multiple formats, including RDF.

 

In addition, looking at the relational database schema published in the documentation we could not locate tables that are easily mappable to the Wikibase Data Model. 

Thus, we hypothesize that there is some software component (Wikibase Common Data Access?) that dynamically translates the data contained in those tables to Statements, Entities, etc. Is that hypothesis, correct?

If yes, does this software component uses any intermediate storage mechanism for caching those Statements, Entities, ...? Or are those translations always performed at runtime on-the-fly (be it for querying, adding, or updating Statements, Entities, …)?

 

Finally, we would like to understand more about how Wikidata REST API is implemented:

·         In which database are the statements that added/retrieved through it stored? Is it being stored in the central MySQL database or in another database?

·         Does it have any support for pagination of statements? For example, if an item has many statements associated with a property, does the API assumes that both the underlying database and the network will support the retrieval of all those statements?

·         Are you currently considering implementing the support for more flexible querying of statements, or such requirement has been fully delegated to the Wikidata Query Service?

 

If there is an updated documentation that could help us answer those questions, could you kindly point us to it? Otherwise, would you be able to share this information with us?

 

Best Regards,

Image

 

Elton F. de S. Soares

Advisory Software Engineer

Rio de Janeiro, RJ, Brazil

IBM Research

E-mail: eltons@ibm.com