Re: [Wikitech-l] Core html of a wikisource page

6 Apr 2011


      On 06.04.2011 09:15, Alex Brollo wrote:
...
I saved the HTML source of a typical Page: page from it.source, the
resulting txt file having ~ 28 kBy; then I saved the "core html" only, t.i.
the content of <div class="pagetext">, and this file have 2.1 kBy; so
there's a more than tenfold ratio between "container" and "real content".
wow, really? that seems a lot...
...
I there a trick to download the "core html" only?
there are two ways:
a) the old style "render" action, like this:
http://en.wikipedia.org/wiki/Foo?action=render
b) the api "parse" action, like this:
http://en.wikipedia.org/w/api.php?action=parse&page=Foo&redirects=1&format=xml
To learn more about the web API, have a look at http://www.mediawiki.org/wiki/API
...
And, most important: could
this save a little bit of server load/bandwidth?
No, quite to the contrary. The full page HTML is heavily cached. If you pull the
full page (without being logged in), it's quite likely that the page will be
served from a front tier reverse proxy (squid or varnish). API requests and
render actions however always go through to the actual Apache servers and cause
more load.
However, as long as you don't make several requests at once, you are not putting
any serious strain on the servers. Wikimedia servers more than a hundret
thousand requests per second. One more is not so terrible...
...
I humbly think that "core
html" alone could be useful as a means to obtain a "well formed page
content",  and that this could be useful to obtain derived formats of the
page (i.e. ePub).
It is indeed frequently used for that.
cheers,
daniel

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Core html of a wikisource page