Hi there,
what possibilities do I have to get the plain text of a page? On the one hand I would like to have access to the source on the other hand I would like to be able to get the formatted text. Are there any possibilities of doing so?
Sincerely Charly
Karlheinz Toni wrote:
what possibilities do I have to get the plain text of a page?
http://en.wikipedia.org/w/wiki.phtml?title=Main_Page&action=raw if you want the "correct" but uncommon and usually unsupported content-type of text/x-wiki; or
http://en.wikipedia.org/w/wiki.phtml?title=Main_Page&action=raw&ctyp... if you want the "hacky" but more useful text/css which makes it display as plain text in browsers.
Timwi wrote:
Karlheinz Toni wrote:
what possibilities do I have to get the plain text of a page?
http://en.wikipedia.org/w/wiki.phtml?title=Main_Page&action=raw if you want the "correct" but uncommon and usually unsupported content-type of text/x-wiki; or
http://en.wikipedia.org/w/wiki.phtml?title=Main_Page&action=raw&ctyp...
if you want the "hacky" but more useful text/css which makes it display as plain text in browsers.
Speaking of which (don't want to start a new thread for this), is there any way to retrieve only the status of a page? Meaning, "0" if it doesn't exist, "1" if it does or an URL if it's a redirect -- I guess that would be the logical way to do it. I think it would be nice to have for automated quick searches.
--Gutza
Gutza wrote:
Speaking of which (don't want to start a new thread for this), is there any way to retrieve only the status of a page? Meaning, "0" if it doesn't exist, "1" if it does or an URL if it's a redirect -- I guess that would be the logical way to do it. I think it would be nice to have for automated quick searches.
No.
-- brion vibber (brion @ pobox.com)
Gutza wrote:
Speaking of which (don't want to start a new thread for this), is there any way to retrieve only the status of a page? Meaning, "0" if it doesn't exist, "1" if it does or an URL if it's a redirect -- I guess that would be the logical way to do it. I think it would be nice to have for automated quick searches.
Well, if you try this with a non-existent page:
http://en.wikipedia.org/w/wiki.phtml?title=Nonexistantpage&action=raw&am...
you get "<html><body></body></html>". Not the most professional of ways of saying "this page doesn't exist", but since it's unlikely to be the body of a real article, I suppose you can use this.
Redirects are ridiculously easy to parse with a regular expression.
Timwi
Timwi wrote:
Well, if you try this with a non-existent page:
http://en.wikipedia.org/w/wiki.phtml?title=Nonexistantpage&action=raw&am...
you get "<html><body></body></html>".
That's a Mozilla bug, the returned data is actually empty.
-- brion vibber (brion @ pobox.com)
Timwi wrote:
Redirects are ridiculously easy to parse with a regular expression.
Don't forget that the "#redirect" string can be translated, and is translated in the cy: wikipedia to "#ail-cyfeirio". The regular expression is a little more complicated than you might think at first sight....
Regards,
Rob Hooft
Rob Hooft wrote:
Timwi wrote:
Redirects are ridiculously easy to parse with a regular expression.
Don't forget that the "#redirect" string can be translated, and is translated in the cy: wikipedia to "#ail-cyfeirio". The regular expression is a little more complicated than you might think at first sight....
Not really. As far as I understand it, all redirects satisfy /^\s*#\S+\s*:?\s*[[([^]]+)]]\s*$/s.
Obviously I'm aware that this can catch non-redirects, but it would have to be a really weird article, and in fact one that probably should be a redirect anyway.
Timwi
Timwi wrote:
Gutza wrote:
Speaking of which (don't want to start a new thread for this), is there any way to retrieve only the status of a page? Meaning, "0" if it doesn't exist, "1" if it does or an URL if it's a redirect -- I guess that would be the logical way to do it. I think it would be nice to have for automated quick searches.
Well, if you try this with a non-existent page:
http://en.wikipedia.org/w/wiki.phtml?title=Nonexistantpage&action=raw&am...
you get "<html><body></body></html>". Not the most professional of ways of saying "this page doesn't exist", but since it's unlikely to be the body of a real article, I suppose you can use this.
Redirects are ridiculously easy to parse with a regular expression.
Yes, but you see, if you want for instance to highlight words in a text which can be linked to articles in Wikipedia, then you would need to retrieve all pages for no real reason. Don't get me wrong, I'm not really saying that I *need* this functionality right now, nor am I pushing towards it in any way, I'm just saying that once it's in place, it might help others to write applications around Wikipedia, thus adding value to the project, at least due to the extra exposure, if nothing else.
--Gutza
Hi there,
-----Ursprüngliche Nachricht----- Von: wikitech-l-bounces@wikimedia.org [mailto:wikitech-l- bounces@wikimedia.org] Im Auftrag von Timwi Gesendet: Freitag, 23. Juli 2004 15:25 An: wikitech-l@wikimedia.org Betreff: [Wikitech-l] Re: Plain text of page?
Karlheinz Toni wrote:
what possibilities do I have to get the plain text of a page?
http://en.wikipedia.org/w/wiki.phtml?title=Main_Page&action=raw if you want the "correct" but uncommon and usually unsupported content-type of text/x-wiki; or
http://en.wikipedia.org/w/wiki.phtml?title=Main_Page&action=raw&ctyp... /css if you want the "hacky" but more useful text/css which makes it display as plain text in browsers.
Thank you for your answer. Is there any possibility of getting the formatted text of the pages (without navigation,... just the information on the page)...
Sincerely Charly
Karlheinz Toni wrote:
Thank you for your answer. Is there any possibility of getting the formatted text of the pages (without navigation,... just the information on the page)...
You mean the HTML?
Well, you could download the page (http://en.wikipedia.org/wiki/Article) and harvest everything between "<!-- start content -->" and "<!-- end content -->", but obviously this isn't very professional and works only because Monobook includes those comments.
Timwi
wikitech-l@lists.wikimedia.org