Andre Engels wrote:
My first intuition would be that a gain can well be made in the area of checking the existence and size of linked-to pages when reading a page, probably at the cost of making changing pages slower. Problem is that I don't know what is currently done (this has been proposed before, so maybe it has been wholly or partly implemented already) nor what exactly MySQL can and cannot do and in what speed (in particular, performing a binary search on a table that it knows to be ordered).
Size is only checked if you've got stub display on, which is fairly rare. 172 users on en have it on, 110 of whom have logged in this month (out of 4206 total accounts used this month).
When stub links are _disabled_: a first pass for the existence check is done by grabbing the entries for the links table marked as linking from the current page; this takes one query, and is an indexed join to cur. Anything that wasn't in the links table (for instance the talk page; new links when looking at a preview; linked user pages when looking at Recent changes etc; and any actual broken links) are then double-checked individually as they come along.
With stub links _enabled_: we check every link individually for existence and size. The namespace/title pair is indexed, so lookup shouldn't be _too_ much of a burden in either case, but I don't know if it _knows_ the text blob size or if it's got to count up bytes or something. In any case lots of little queries means lots of network packets and lots of SQL parsing.
Links that are checked are kept in an array, so if you've got a history list with a billion entries by one user, we only look up their user page's existence once during the run; but this information is lost when the script ends.
(The testing code includes some limited support for keeping existence checks in memcached, so we retain this info across script calls and can reuse for multiple source pages.)
Possible additional optimizations, which may or may not be good ideas:
* use the brokenlinks table as well as links to cut down queries for pages with many broken links
* for stubs, grab page sizes of known links through the links join
* store up all the links to check while parsing, then check them all in one big query and replace the final links (this would help with memcached as well, as we can do batch queries)
* for recentchanges, history etc use some sort of join to get check existence of users' talk pages during the main query
-- brion vibber (brion @ pobox.com)