On Mon, Jun 5, 2023 at 12:46 PM Vivian Rook vrook@wikimedia.org wrote:
On Mon, Jun 5, 2023 at 2:43 PM Hal Triedman htriedman@wikimedia.org wrote:
Hi cloud admins!
My name is Hal Triedman — I'm a Privacy Engineer at WMF, but in my spare time I do a lot of work on machine learning. One of the things we've been looking into is the creation of label-query datasets for Mediawiki database queries, with the goal of being able to finetune an AI model to help users write queries with more ease/create embeddings that allow for easier searching of past queries.
Quarry is particularly interesting for this project because it has the following qualities:
- it is entirely on Mediawiki databases
- it has been used to make hundreds of thousands of queries
- many of those queries have relatively descriptive titles about what is happening in the SQL
Is there any easy way of assembling a database of existing public title-query pairs (i.e. by running a database query that excludes things like "Untitled query", or just pulling published queries)? Please let me know, and thanks.
I don't see a reason that you can't have access to the quarry db. Does anyone else?
It seems both reasonable and useful to me. See also backlog tasks like https://phabricator.wikimedia.org/T93907 (Database dump for analysis) and https://phabricator.wikimedia.org/T151158 (Support queries against Quarry's own database and ToolsDB).
Bryan