Andre Engels wrote:
My first intuition would be that a gain can well be
made in the area of
checking the existence and size of linked-to pages when reading a page,
probably at the cost of making changing pages slower. Problem is that I
don't know what is currently done (this has been proposed before, so maybe
it has been wholly or partly implemented already) nor what exactly MySQL
can and cannot do and in what speed (in particular, performing a binary
search on a table that it knows to be ordered).
Size is only checked if you've got stub display on, which is fairly
rare. 172 users on en have it on, 110 of whom have logged in this month
(out of 4206 total accounts used this month).
When stub links are _disabled_: a first pass for the existence check is
done by grabbing the entries for the links table marked as linking from
the current page; this takes one query, and is an indexed join to cur.
Anything that wasn't in the links table (for instance the talk page; new
links when looking at a preview; linked user pages when looking at
Recent changes etc; and any actual broken links) are then double-checked
individually as they come along.
With stub links _enabled_: we check every link individually for
existence and size. The namespace/title pair is indexed, so lookup
shouldn't be _too_ much of a burden in either case, but I don't know if
it _knows_ the text blob size or if it's got to count up bytes or
something. In any case lots of little queries means lots of network
packets and lots of SQL parsing.
Links that are checked are kept in an array, so if you've got a history
list with a billion entries by one user, we only look up their user
page's existence once during the run; but this information is lost when
the script ends.
(The testing code includes some limited support for keeping existence
checks in memcached, so we retain this info across script calls and can
reuse for multiple source pages.)
Possible additional optimizations, which may or may not be good ideas:
* use the brokenlinks table as well as links to cut down queries for
pages with many broken links
* for stubs, grab page sizes of known links through the links join
* store up all the links to check while parsing, then check them all in
one big query and replace the final links (this would help with
memcached as well, as we can do batch queries)
* for recentchanges, history etc use some sort of join to get check
existence of users' talk pages during the main query
-- brion vibber (brion @
pobox.com)