On 1/31/13, Max Semenik maxsem.wiki@gmail.com wrote:
A month ago, PageImages extension[1] was black-deployed, intended to automatically associate images with articles. It populates its data when LinksUpdate is run, i.e. when a page or templates it trascludes is edited or purged. Since then, most of pages were re-parsed, however slightly less than a million English WP articles remain:
select count(*), avg(page_len) from page where page_namespace=0 and page_is_redirect=0 and page_touched < '20121229000000'; +----------+---------------+ | count(*) | avg(page_len) | +----------+---------------+ | 977568 | 3172.0948 | +----------+---------------+ 1 row in set (5 min 59.55 sec)
[..]
You do realize that page_touched gets updated by a bunch of things, many of which do not cause a LinksUpdate to happen? So running the script as you proposed will not populate the table for all data.
Of course there really isn't any way to figure out when the last LinksUpdate happened, so I suppose page_touched is as close as we can get. I guess in most cases if something has had its page_touched updated by a non-LinksUpdate event, that probably means people actually look at the article, so someone has or will probably edit the article soon.
--bawolff