On 06.04.2011 09:15, Alex Brollo wrote:
I saved the HTML source of a typical Page: page from it.source, the resulting txt file having ~ 28 kBy; then I saved the "core html" only, t.i. the content of <div class="pagetext">, and this file have 2.1 kBy; so there's a more than tenfold ratio between "container" and "real content".
wow, really? that seems a lot...
I there a trick to download the "core html" only?
there are two ways:
a) the old style "render" action, like this: http://en.wikipedia.org/wiki/Foo?action=render
b) the api "parse" action, like this: http://en.wikipedia.org/w/api.php?action=parse&page=Foo&redirects=1&format=xml
To learn more about the web API, have a look at http://www.mediawiki.org/wiki/API
And, most important: could this save a little bit of server load/bandwidth?
No, quite to the contrary. The full page HTML is heavily cached. If you pull the full page (without being logged in), it's quite likely that the page will be served from a front tier reverse proxy (squid or varnish). API requests and render actions however always go through to the actual Apache servers and cause more load.
However, as long as you don't make several requests at once, you are not putting any serious strain on the servers. Wikimedia servers more than a hundret thousand requests per second. One more is not so terrible...
I humbly think that "core html" alone could be useful as a means to obtain a "well formed page content", and that this could be useful to obtain derived formats of the page (i.e. ePub).
It is indeed frequently used for that.
cheers, daniel