Dear Brian,
On 9/13/15, Brian Wolff bawolff@gmail.com wrote:
On 9/12/15, wp mirror wpmirrordev@gmail.com wrote:
- Context
I am currently developing new features for WP-MIRROR (see < https://www.mediawiki.org/wiki/Wp-mirror%3E).
- Objective
I would like WP-MIRROR to generate all image thumbs during the mirror
build
process. This is so that mediawiki can render pages quickly using precomputed thumbs.
- Dump importation
maintenance/importDump.php - this computes thumbs during importation, but is too slow. mwxml2sql - loads databases quickly, but does not compute thumbs.
- Question
Is there a way to compute all the thumbs after loading databases quickly with mwxml2sql?
Sincerely Yours, Kent ______________________________
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hi. My understanding is that wp-mirror sets up a MediaWiki instance for rendering the mirror. One solution would be to set up 404-thumb rendering. This makes it so that instead of pre-rendering the needed thumbs, MediaWiki will render the thumbs on-demand whenever the web browser requests a thumb. There's some instructions for how this works at https://www.mediawiki.org/wiki/Manual:Thumb.php This is probably the best solution to your problem.
Right. Currently, wp-mirror does set up mediawiki to use 404-thumb rendering.
This works fine, but can cause a few seconds latency when rendering pages. Also, it would be nice to be able to generate thumb dump tarballs, just like we used to generate original size media dump tarballs. I would like wp-mirror have such dump features.
Otherwise, MW needs to know what thumbs are needed for all pages, which involves parsing pages (e.g. via refreshLinks.php). This is a very slow process. If you already had all the thumbnail's generated, you could just copy over the thumb directory perhaps, but I'm not sure where you would get a pre-generated thumb directory.
Wp-mirror does load the *links.sql.gz dump files into the *links tables, because this method is two orders of magnitude faster than maintenance/refreshLinks.php.
-- -bawolff
Idea. I am thinking of piping the *pages-articles.xml.bz2 dump file through an AWK script to write all unique [[File:*]] tags into a file. This can be done quickly. The question then is: Given a file with all the media tags, how can I generate all the thumbs. What mediawiki function shall I call? Can this be done using the web API? Any other ideas?
Sincerely Yours, Kent
On 15/09/15 01:34, wp mirror wrote:
Idea. I am thinking of piping the *pages-articles.xml.bz2 dump file through an AWK script to write all unique [[File:*]] tags into a file. This can be done quickly. The question then is: Given a file with all the media tags, how can I generate all the thumbs. What mediawiki function shall I call? Can this be done using the web API? Any other ideas?
Sincerely Yours, Kent
You know it will fail for all kind of images included through templates (particularly infoboxes), right?
On Mon, Sep 14, 2015 at 4:49 PM, Platonides platonides@gmail.com wrote:
You know it will fail for all kind of images included through templates (particularly infoboxes), right?
Indeed, it is not possible to find out what thumbnails are used by a page without actually parsing it. Your best bet is to wait until Parsoid dumps become available (T17017 https://phabricator.wikimedia.org/T17017), then go through those with an XML parser and extract the thumb URLs. That's still slow but not as slow as the MediaWiki parser. (Or you can try to find a regexp which matches thumbnail URLs but we all know what happens http://stackoverflow.com/a/1732454/323407 when you use a regexp to parse HTML.) After that, just throw those URLs at the 404 handler.
wikitech-l@lists.wikimedia.org