Hello all!
Here is a summary of what the Search Platform team is doing around WDQS:
* The database responsible for unit conversions [7] has been updated on
Friday Jan 29. It means that entities served from WDQS and updated since
this date will use the new conversion data for normalized quantities. The
WDQS database will be fully reloaded this month [8] so that all entities
will be coherent with the new conversion data.
* Now that we have the full functional coverage on the Flink based WDQS
Streaming Updater [1], we've done some more testing, and as expected we
found a few bugs and are correcting them.
* Exposing a test server [2] to gather feedback both on this new Flink
based Streaming Updater and on the long standing issue of solemnization of
blank nodes. We'll make an announcement when ready.
* Architecture review of the new Flink based Streaming Updater
with Ververica (the company behind Flink). We will probably uncover a few
more things that need to be improved.
* Productionizing the new Flink based Streaming Updater [8].
* Manual review of a sample of queries to WDQS. We learned a few things:
* Human intuition is not good at predicting which queries are expensive
* We have a large scope of very different queries / use cases, larger
than we expected
* Most of the request we've seen seem to be useful and valuable
* More in depth analysis and categorization of WDQS traffic [6]:
* Instead of focusing on a way to provide more performant solutions for
expensive queries that we see on WDQS, this analysis focuses on the query
groups that we see the most, even if they are already efficient.
* One key finding is that the top 90 query groups represent more than
80% of the queries we serve. Those queries are mostly "simple" queries:
only using the truthy graph, only doing a very limited number of hops in
the graph, etc... This opens the possibility to create a service that is
scalable and efficient for those classes of queries.
* This is very early work, we don't know yet what this service could
look like or if it is even feasible to create it. But it is an interesting
new approach in our problem space.
* The analysis is a bit raw, feel free to ask clarifying questions,
I'll route them to the appropriate person.
* Search Platform Office Hours are happening today (16:00-17:00 GMT /
08:00-09:00 PST / 11:00-12:00 EST / 17:00-18:00 CET) [9]. Feel free to join
if you have any additional questions, or just want to chat with the team!
Have fun!
Guillaume
[1]
https://phabricator.wikimedia.org/T244590
[2]
https://phabricator.wikimedia.org/T266470
[3]
https://phabricator.wikimedia.org/T244341
[4]
https://phabricator.wikimedia.org/T264006
[5]
https://www.wikidata.org/wiki/Wikidata:REST_API_feedback_round
[6]
https://wikitech.wikimedia.org/wiki/User:Joal/WDQS_Queries_Analysis
[7]
https://phabricator.wikimedia.org/T267644
[8]
https://phabricator.wikimedia.org/T267927
[9]
https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
--
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>