On 06.04.2011 09:15, Alex Brollo wrote:
I saved the HTML source of a typical Page: page from
it.source, the
resulting txt file having ~ 28 kBy; then I saved the "core html" only, t.i.
the content of <div class="pagetext">, and this file have 2.1 kBy; so
there's a more than tenfold ratio between "container" and "real
content".
wow, really? that seems a lot...
I there a trick to download the "core html"
only?
there are two ways:
a) the old style "render" action, like this:
<http://en.wikipedia.org/wiki/Foo?action=render>
b) the api "parse" action, like this:
<http://en.wikipedia.org/w/api.php?action=parse&page=Foo&redirects=1&format=xml>
To learn more about the web API, have a look at <http://www.mediawiki.org/wiki/API>
And, most important: could
this save a little bit of server load/bandwidth?
No, quite to the contrary. The full page HTML is heavily cached. If you pull the
full page (without being logged in), it's quite likely that the page will be
served from a front tier reverse proxy (squid or varnish). API requests and
render actions however always go through to the actual Apache servers and cause
more load.
However, as long as you don't make several requests at once, you are not putting
any serious strain on the servers. Wikimedia servers more than a hundret
thousand requests per second. One more is not so terrible...
I humbly think that "core
html" alone could be useful as a means to obtain a "well formed page
content", and that this could be useful to obtain derived formats of the
page (i.e. ePub).
It is indeed frequently used for that.
cheers,
daniel