On Mon, Sep 14, 2015 at 4:49 PM, Platonides platonides@gmail.com wrote:
You know it will fail for all kind of images included through templates (particularly infoboxes), right?
Indeed, it is not possible to find out what thumbnails are used by a page without actually parsing it. Your best bet is to wait until Parsoid dumps become available (T17017 https://phabricator.wikimedia.org/T17017), then go through those with an XML parser and extract the thumb URLs. That's still slow but not as slow as the MediaWiki parser. (Or you can try to find a regexp which matches thumbnail URLs but we all know what happens http://stackoverflow.com/a/1732454/323407 when you use a regexp to parse HTML.) After that, just throw those URLs at the 404 handler.