I was checking sixdegreesofwikipedia.com [0] and I saw that it implements an application-driven breath-first search [1], like many other gadgets for Wikipedia.
Very recently I have been experimenting with recursive Common Table Expressions [2], which are or will be available on the latest versions of MySQL and MariaDB.
With a single query on can obtain all titles directly or indirectly in a category:
WITH RECURSIVE cte (cl_from, cl_type) AS ( SELECT cl_from, cl_type FROM categorylinks WHERE cl_to = 'Database_management_systems' -- starting category UNION SELECT categorylinks.cl_from, categorylinks.cl_type FROM cte JOIN page ON cl_from = page_id JOIN categorylinks ON page_title = cl_to WHERE cte.cl_type = 'subcat' -- subcat addition on each iteration ) SELECT page_title FROM cte JOIN page ON cl_from = page_id WHERE page_namespace = 0 ORDER BY page_title; -- printing only articles in the end , ordered by title
(it is more complex than needed because table denormalization, other examples would be much simpler)
Thanks to CTEs, we can traverse hierarchies, without the need of external tools, in a single SQL query and much more efficiently-- it doesn't need an external application.
None of these features are present on the minimum required versions of Mediawiki, or the latest version available on WMF servers-- but I wonder if people- Mediawiki hackers and Tools creators- would be interested on doing those?
[0] https://www.sixdegreesofwikipedia.com [1] < https://github.com/jwngr/sdow/blob/master/sdow/breadth_first_search.py#L36%3... [2] < https://dbahire.com/mysql-8-0-new-features-in-real-life-applications-roles-a...
---------- Forwarded message ---------- From: mathieu stumpf guntz psychoslave@culture-libre.org Date: Tue, Feb 27, 2018 at 11:17 AM Subject: Re: [Wikitech-l] [Wikimedia-l] What's making you happy this week? (Week of 18 February 2018) To: Wikimedia Mailing List wikimedia-l@lists.wikimedia.org, Pine W < wiki.pine@gmail.com>, "wikitech-l@lists.wikimedia.org" < wikitech-l@lists.wikimedia.org>
What's making me happy this week is joining the "Telegrafo" discussion for ELISo https://t.me/joinchat/CQ8tET7pcCXQSBO1ERPJug and I also just found Six Degrees of Wikipedia https://www.sixdegreesofwikip edia.com/?source=Peace&target=Epistemology.
Le 18/02/2018 à 23:12, Pine W a écrit :
What's making me happy this week is Isarra's persistence in working on the Timeless skin. Timeless is based on Winter. [0] [1]
For anyone who would like to try Timeless, it's available in Preferences under Appearance / Skin.
What's making you happy this week?
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
[0] https://www.mediawiki.org/wiki/Skin:Timeless [1] https://www.mediawiki.org/wiki/Winter _______________________________________________ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wik i/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l New messages to: Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Wed, Feb 28, 2018 at 8:47 AM, Jaime Crespo jcrespo@wikimedia.org wrote:
Very recently I have been experimenting with recursive Common Table Expressions [2], which are or will be available on the latest versions of MySQL and MariaDB.
Do the other databases MediaWiki tries to support have that feature?
With a single query on can obtain all titles directly or indirectly in a category:
WITH RECURSIVE cte (cl_from, cl_type) AS ( SELECT cl_from, cl_type FROM categorylinks WHERE cl_to = 'Database_management_systems' -- starting category UNION SELECT categorylinks.cl_from, categorylinks.cl_type FROM cte JOIN page ON cl_from = page_id JOIN categorylinks ON page_title = cl_to WHERE cte.cl_type = 'subcat' -- subcat addition on each iteration ) SELECT page_title FROM cte JOIN page ON cl_from = page_id WHERE page_namespace = 0 ORDER BY page_title; -- printing only articles in the end , ordered by title
Does that work efficiently on huge categories, or does it wind up fetching millions of rows and filesorting?
On Wed, Feb 28, 2018 at 5:26 PM, Brad Jorsch (Anomie) <bjorsch@wikimedia.org
wrote:
On Wed, Feb 28, 2018 at 8:47 AM, Jaime Crespo jcrespo@wikimedia.org wrote:
Very recently I have been experimenting with recursive Common Table Expressions [2], which are or will be available on the latest versions of MySQL and MariaDB.
Do the other databases MediaWiki tries to support have that feature?
Actually, MySQL/MariaDB is the *last* database to conform to the sql:1999 WITH standard: https://modern-sql.com/feature/with#compatibility Even sqlite suported a limited set of those!
The good news is that, probably because it arrived last, it got a pretty feature-full implementation: https://twitter.com/MarkusWinand/status/852862475699707904
With a single query on can obtain all titles directly or indirectly in a category:
WITH RECURSIVE cte (cl_from, cl_type) AS ( SELECT cl_from, cl_type FROM categorylinks WHERE cl_to = 'Database_management_systems' -- starting category UNION SELECT categorylinks.cl_from, categorylinks.cl_type FROM cte JOIN
page
ON cl_from = page_id JOIN categorylinks ON page_title = cl_to WHERE cte.cl_type = 'subcat' -- subcat addition on each iteration ) SELECT page_title FROM cte JOIN page ON cl_from = page_id WHERE page_namespace = 0 ORDER BY page_title; -- printing only articles in the end , ordered by title
Does that work efficiently on huge categories, or does it wind up fetching millions of rows and filesorting?
Needs more testing-- that query worked for me well enough on my laptop to expose it directly on a webrequest (<0.1 s), but I only imported the categorylinks and page tables, so I was working with memory. Obviously, the more complex the query, and the more results it returns, the less likely it is to be able to be exposed to, e.g. a public API. But at least there are configurable limits on recursivitiy and max execution time.
Honestly, given it is a new feature, I don't expect mediawiki --which at the moment has to support 5.5- to embrace it any time soon. However, I wanted to ask if it was interesting enough to setup some test hosts for mediawiki to "play" with it --e.g. evaluate performance--, and maybe (?) some upgraded mariadb/mysql servers for WMF labsdb (for long-running analytics or gadgets that generates reports).
On Wed, Feb 28, 2018 at 5:07 PM, Jaime Crespo jcrespo@wikimedia.org wrote:
On Wed, Feb 28, 2018 at 5:26 PM, Brad Jorsch (Anomie) <bjorsch@wikimedia.org
wrote:
On Wed, Feb 28, 2018 at 8:47 AM, Jaime Crespo jcrespo@wikimedia.org wrote:
Very recently I have been experimenting with recursive Common Table Expressions [2], which are or will be available on the latest versions of MySQL and MariaDB.
Do the other databases MediaWiki tries to support have that feature?
Actually, MySQL/MariaDB is the *last* database to conform to the sql:1999 WITH standard: https://modern-sql.com/feature/with#compatibility Even sqlite suported a limited set of those!
The good news is that, probably because it arrived last, it got a pretty feature-full implementation: https://twitter.com/MarkusWinand/status/852862475699707904
With a single query on can obtain all titles directly or indirectly in a category:
WITH RECURSIVE cte (cl_from, cl_type) AS ( SELECT cl_from, cl_type FROM categorylinks WHERE cl_to = 'Database_management_systems' -- starting category UNION SELECT categorylinks.cl_from, categorylinks.cl_type FROM cte JOIN
page
ON cl_from = page_id JOIN categorylinks ON page_title = cl_to WHERE cte.cl_type = 'subcat' -- subcat addition on each iteration ) SELECT page_title FROM cte JOIN page ON cl_from = page_id WHERE page_namespace = 0 ORDER BY page_title; -- printing only articles in the end , ordered by title
Does that work efficiently on huge categories, or does it wind up fetching millions of rows and filesorting?
Needs more testing-- that query worked for me well enough on my laptop to expose it directly on a webrequest (<0.1 s), but I only imported the categorylinks and page tables, so I was working with memory. Obviously, the more complex the query, and the more results it returns, the less likely it is to be able to be exposed to, e.g. a public API. But at least there are configurable limits on recursivitiy and max execution time.
Honestly, given it is a new feature, I don't expect mediawiki --which at the moment has to support 5.5- to embrace it any time soon. However, I wanted to ask if it was interesting enough to setup some test hosts for mediawiki to "play" with it --e.g. evaluate performance--, and maybe (?) some upgraded mariadb/mysql servers for WMF labsdb (for long-running analytics or gadgets that generates reports).
I certainly think labsdb users would welcome this feature. Traversing category trees (Or sometimes even the graph of page links) is certainly a common thing that people want to do.
-- Brian
Hi!
None of these features are present on the minimum required versions of Mediawiki, or the latest version available on WMF servers-- but I wonder if people- Mediawiki hackers and Tools creators- would be interested on doing those?
It would be interesting to see how this can work in deepcategory searches - we now have a keyword for it (driven by SPARQL for now) and what would happen if it is ported to SQL. If we get it on labs db replicas, we could set up mediawiki so that we could test how good is it on real data. Thanks for posting about it!
wikitech-l@lists.wikimedia.org