Core html of a wikisource page

List overview All Threads
Download

newer

older

Undefined property:...

Lack of search results from...

Alex Brollo

6 Apr 2011 6 Apr '11

3:15 a.m.

I saved the HTML source of a typical Page: page from it.source, the resulting txt file having ~ 28 kBy; then I saved the "core html" only, t.i. the content of <div class="pagetext">, and this file have 2.1 kBy; so there's a more than tenfold ratio between "container" and "real content".

I there a trick to download the "core html" only? And, most important: could this save a little bit of server load/bandwidth? I humbly think that "core html" alone could be useful as a means to obtain a "well formed page content", and that this could be useful to obtain derived formats of the page (i.e. ePub).

Alex brollo

Show replies by date

Daniel Kinzler

6 Apr 6 Apr

4:18 a.m.

On 06.04.2011 09:15, Alex Brollo wrote:

...

I saved the HTML source of a typical Page: page from it.source, the resulting txt file having ~ 28 kBy; then I saved the "core html" only, t.i. the content of <div class="pagetext">, and this file have 2.1 kBy; so there's a more than tenfold ratio between "container" and "real content".

wow, really? that seems a lot...

...

I there a trick to download the "core html" only?

there are two ways:

a) the old style "render" action, like this: http://en.wikipedia.org/wiki/Foo?action=render

b) the api "parse" action, like this: http://en.wikipedia.org/w/api.php?action=parse&page=Foo&redirects=1&format=xml

To learn more about the web API, have a look at http://www.mediawiki.org/wiki/API

...

And, most important: could this save a little bit of server load/bandwidth?

No, quite to the contrary. The full page HTML is heavily cached. If you pull the full page (without being logged in), it's quite likely that the page will be served from a front tier reverse proxy (squid or varnish). API requests and render actions however always go through to the actual Apache servers and cause more load.

However, as long as you don't make several requests at once, you are not putting any serious strain on the servers. Wikimedia servers more than a hundret thousand requests per second. One more is not so terrible...

...

I humbly think that "core html" alone could be useful as a means to obtain a "well formed page content", and that this could be useful to obtain derived formats of the page (i.e. ePub).

It is indeed frequently used for that.

cheers, daniel

Alex Brollo

4:43 a.m.

2011/4/6 Daniel Kinzler daniel@brightbyte.de

...

On 06.04.2011 09:15, Alex Brollo wrote:

...
I saved the HTML source of a typical Page: page from it.source, the resulting txt file having ~ 28 kBy; then I saved the "core html" only,

t.i.

...
the content of <div class="pagetext">, and this file have 2.1 kBy; so there's a more than tenfold ratio between "container" and "real content".

wow, really? that seems a lot...

...
I there a trick to download the "core html" only?

there are two ways:

a) the old style "render" action, like this: http://en.wikipedia.org/wiki/Foo?action=render

b) the api "parse" action, like this: < http://en.wikipedia.org/w/api.php?action=parse&page=Foo&redirects=1&...

...
To learn more about the web API, have a look at < http://www.mediawiki.org/wiki/API%3E

Thanks Daniel, API stuff is a little hard for me: the more I study, the less I edit. :-)

Just to have a try, I called the same page, "render" action gives a file of ~ 3.4 kBy, "api" action a file of ~ 5.6 kBy. Obviuosly I'm thinking to bot download. You are suggesting that it would be a good idea to use a *unlogged * bot to avoid page parsing, and to catch the page code from some cache? I know that some thousands of calls are nothing for wiki servers, but... I always try to get a good performance, even from the most banal template.

Alex

Daniel Kinzler

5:09 a.m.

Hi Alex

...

Thanks Daniel, API stuff is a little hard for me: the more I study, the less I edit. :-)

Just to have a try, I called the same page, "render" action gives a file of ~ 3.4 kBy, "api" action a file of ~ 5.6 kBy.

That's because the render call returns just the HTML, while the API call includes some meta-info in the XML wrapper.

...

Obviuosly I'm thinking to bot download. You are suggesting that it would be a good idea to use a *unlogged

bot to avoid page parsing, and to catch the page code from some cache?

No. I'm saying that non-logged-in views of full pages are what causes the least server load. I'm not saying that this is what you should use. For one thing, it wasts bandwidth and causes additional work on your side (tripping the skin cruft).

I would recommend to use action=render if you need just the plain old html, or the API if you need a bit more control, e.g. over whether templates are resolved or not, how redirects are handled, etc.

If your bot is logged in when fetching the pages would only matter if you requested full page html. Which, as I said, isn't the best option for what you are doing. So, log in or not, it doesn't matter. But do use a distinctive and descriptive User Agent string for your bot, ideally containing some contact info http://meta.wikimedia.org/wiki/User-Agent_policy.

Note that as soon as the bot does any editing, it really should be logged in, and, depending on the wiki's rules, have a bot flag, or have some specific info on its user page.

...

I know that some thousands of calls are nothing for wiki servers, but... I always try to get a good performance, even from the most banal template.

That'S always a good idea :)

-- daniel

Alex Brollo

5:40 a.m.

2011/4/6 Daniel Kinzler daniel@brightbyte.de

...

...
I know that some thousands of calls are nothing for wiki servers, but... I always try to get a good performance, even from the most banal template.

That'S always a good idea :)

-- daniel

Thanks Daniel. So, my edits will drop again. I'll put myself into "study mode" :-D

Alex

Aryeh Gregor

5:45 p.m.

On Wed, Apr 6, 2011 at 3:15 AM, Alex Brollo alex.brollo@gmail.com wrote:

...

I saved the HTML source of a typical Page: page from it.source, the resulting txt file having ~ 28 kBy; then I saved the "core html" only, t.i. the content of <div class="pagetext">, and this file have 2.1 kBy; so there's a more than tenfold ratio between "container" and "real content".

I there a trick to download the "core html" only? And, most important: could this save a little bit of server load/bandwidth?

It could save a huge amount of bandwidth. This could be a big deal for mobile devices, in particular, but it could also reduce TCP round-trips and make things noticeably snappier for everyone. I recently read about an interesting technique on Steve Souders' blog, which he got by analyzing Google and Bing:

http://www.stevesouders.com/blog/2011/03/28/storager-case-study-bing-google/

The gist is that when the page loads, assuming script is enabled, you store static pieces of the page in localStorage (available on the large majority of browsers, including IE8) and set a cookie. Then if the cookie is present on a subsequent request, the server doesn't send the repetitive static parts, and instead sends a <script> that inserts the desired contents synchronously from localStorage. I guess this breaks horribly if you request a page with script enabled and then request another page with script disabled, but otherwise the basic idea seems powerful and reliable, if a bit tricky to get right.

Of course, it would at least double the amount of space pages take in Squid cache, so for us it might not be a great idea. Still interesting, and worth keeping in mind.

5015

Age (days ago)

5015

Last active (days ago)

wikitech-l@lists.wikimedia.org

5 comments

3 participants

tags (0)

participants (3)

Alex Brollo
Aryeh Gregor
Daniel Kinzler