Plain text of page?

List overview All Threads
Download

newer

older

Can't delete a 0 byte image

Article validation

Karlheinz Toni

23 Jul 2004 23 Jul '04

12:52 p.m.

Hi there,

what possibilities do I have to get the plain text of a page? On the one hand I would like to have access to the source on the other hand I would like to be able to get the formatted text. Are there any possibilities of doing so?

Sincerely Charly

Show replies by date

Timwi

23 Jul 23 Jul

1:25 p.m.

Karlheinz Toni wrote:

...

what possibilities do I have to get the plain text of a page?

http://en.wikipedia.org/w/wiki.phtml?title=Main_Page&action=raw if you want the "correct" but uncommon and usually unsupported content-type of text/x-wiki; or

http://en.wikipedia.org/w/wiki.phtml?title=Main_Page&action=raw&ctyp... if you want the "hacky" but more useful text/css which makes it display as plain text in browsers.

Gutza

3 p.m.

Timwi wrote:

...

Karlheinz Toni wrote:

...
what possibilities do I have to get the plain text of a page?

http://en.wikipedia.org/w/wiki.phtml?title=Main_Page&action=raw if you want the "correct" but uncommon and usually unsupported content-type of text/x-wiki; or

http://en.wikipedia.org/w/wiki.phtml?title=Main_Page&action=raw&ctyp...

if you want the "hacky" but more useful text/css which makes it display as plain text in browsers.

Speaking of which (don't want to start a new thread for this), is there any way to retrieve only the status of a page? Meaning, "0" if it doesn't exist, "1" if it does or an URL if it's a redirect -- I guess that would be the logical way to do it. I think it would be nice to have for automated quick searches.

--Gutza

Brion Vibber

5:35 p.m.

Gutza wrote:

...

Speaking of which (don't want to start a new thread for this), is there any way to retrieve only the status of a page? Meaning, "0" if it doesn't exist, "1" if it does or an URL if it's a redirect -- I guess that would be the logical way to do it. I think it would be nice to have for automated quick searches.

No.

-- brion vibber (brion @ pobox.com)

Timwi

8:19 p.m.

Gutza wrote:

...

Speaking of which (don't want to start a new thread for this), is there any way to retrieve only the status of a page? Meaning, "0" if it doesn't exist, "1" if it does or an URL if it's a redirect -- I guess that would be the logical way to do it. I think it would be nice to have for automated quick searches.

Well, if you try this with a non-existent page:

http://en.wikipedia.org/w/wiki.phtml?title=Nonexistantpage&action=raw&am...

you get "<html><body></body></html>". Not the most professional of ways of saying "this page doesn't exist", but since it's unlikely to be the body of a real article, I suppose you can use this.

Redirects are ridiculously easy to parse with a regular expression.

Timwi

Brion Vibber

8:29 p.m.

Timwi wrote:

...

Well, if you try this with a non-existent page:

http://en.wikipedia.org/w/wiki.phtml?title=Nonexistantpage&action=raw&am...

you get "<html><body></body></html>".

That's a Mozilla bug, the returned data is actually empty.

-- brion vibber (brion @ pobox.com)

Rob Hooft

9:28 p.m.

Timwi wrote:

...

Redirects are ridiculously easy to parse with a regular expression.

Don't forget that the "#redirect" string can be translated, and is translated in the cy: wikipedia to "#ail-cyfeirio". The regular expression is a little more complicated than you might think at first sight....

Regards,

Rob Hooft

Timwi

9:59 p.m.

Rob Hooft wrote:

...

Timwi wrote:

...
Redirects are ridiculously easy to parse with a regular expression.

Don't forget that the "#redirect" string can be translated, and is translated in the cy: wikipedia to "#ail-cyfeirio". The regular expression is a little more complicated than you might think at first sight....

Not really. As far as I understand it, all redirects satisfy /^\s*#\S+\s*:?\s*[[([^]]+)]]\s*$/s.

Obviously I'm aware that this can catch non-redirects, but it would have to be a really weird article, and in fact one that probably should be a redirect anyway.

Timwi

Gutza

24 Jul 24 Jul

12:01 p.m.

Timwi wrote:

...

Gutza wrote:

...
Speaking of which (don't want to start a new thread for this), is there any way to retrieve only the status of a page? Meaning, "0" if it doesn't exist, "1" if it does or an URL if it's a redirect -- I guess that would be the logical way to do it. I think it would be nice to have for automated quick searches.

Well, if you try this with a non-existent page:

http://en.wikipedia.org/w/wiki.phtml?title=Nonexistantpage&action=raw&am...

you get "<html><body></body></html>". Not the most professional of ways of saying "this page doesn't exist", but since it's unlikely to be the body of a real article, I suppose you can use this.

Redirects are ridiculously easy to parse with a regular expression.

Yes, but you see, if you want for instance to highlight words in a text which can be linked to articles in Wikipedia, then you would need to retrieve all pages for no real reason. Don't get me wrong, I'm not really saying that I *need* this functionality right now, nor am I pushing towards it in any way, I'm just saying that once it's in place, it might help others to write applications around Wikipedia, thus adding value to the project, at least due to the extra exposure, if nothing else.

--Gutza

Karlheinz Toni

26 Jul 26 Jul

8:34 a.m.

New subject: AW: Re: Plain text of page?

Hi there,

...

-----Ursprüngliche Nachricht----- Von: wikitech-l-bounces@wikimedia.org [mailto:wikitech-l- bounces@wikimedia.org] Im Auftrag von Timwi Gesendet: Freitag, 23. Juli 2004 15:25 An: wikitech-l@wikimedia.org Betreff: [Wikitech-l] Re: Plain text of page?

Karlheinz Toni wrote:

...
what possibilities do I have to get the plain text of a page?

http://en.wikipedia.org/w/wiki.phtml?title=Main_Page&action=raw if you want the "correct" but uncommon and usually unsupported content-type of text/x-wiki; or

http://en.wikipedia.org/w/wiki.phtml?title=Main_Page&action=raw&ctyp... /css if you want the "hacky" but more useful text/css which makes it display as plain text in browsers.

Thank you for your answer. Is there any possibility of getting the formatted text of the pages (without navigation,... just the information on the page)...

Sincerely Charly

Timwi

9:57 a.m.

Karlheinz Toni wrote:

...

Thank you for your answer. Is there any possibility of getting the formatted text of the pages (without navigation,... just the information on the page)...

You mean the HTML?

Well, you could download the page (http://en.wikipedia.org/wiki/Article) and harvest everything between "" and "", but obviously this isn't very professional and works only because Monobook includes those comments.

Timwi

7462

Age (days ago)

7465

Last active (days ago)

wikitech-l@lists.wikimedia.org

10 comments

5 participants

tags (0)

participants (5)

Brion Vibber
Gutza
Karlheinz Toni
Rob Hooft
Timwi