Re: [Wikitech-l] people biographies pulled in via RSS - Wikitech-l

4 Jun 2005


      On 04/06/05, QuotationsBook.com Webmaster/Support
quotationsbook@gmail.com wrote:
...
Thanks very much for your note. I had a good read of these links, and
due to being new to wikipedia data access methods, still couldn't make
sense of what would be the best course of action. I essentially need
to pass a search query to wikipedia e.g. "Wilde, Oscar" and for the
first article returned, display the article text on quotationsbook.com
for that author.
Well, if you want to use a large amount of such content, but don't
mind it lagging slightly behind the copy on Wikipedia itself, the
"cleanest" solution is to download a copy of the database and extract
the information yourself. However, that might seem a technically
complex solution, and the content you want may actually only be a
fraction of the Wikipedia database.
In which case, you have a further two options: ask for the "wikitext"
source of an article, using Special:Export or
en.wikipedia.org/w/index.php?title=<some article>&action=raw and have
a local copy of MediaWiki (or one of the programs at
http://meta.wikimedia.org/wiki/Alternative_parsers) to turn that into
HTML for you (less load on the Wikipedia server, but more complex for
you) or just request the rendered article and separate the content
from the navigation stuff (easy enough to automate if you look at the
source, though you may want to play around with some styling to make
things look right).
BTW, note that person articles in Wikipedia are generally titled
"Firstname Lastname", not "Lastname, Firstname" - e.g.
http://en.wikipedia.org/wiki/Oscar_Wilde  You could probably have your
software guess the correct name in most cases, but
http://en.wikipedia.org/wiki/Special:Search?search=<some terms> may
also be useful (this will return an article with a completely matching
name if it exists, and search results otherwise).
...
I don't know whether I can do this dynamically (request by request),
or store a single cached copy of an article on my site, so that the
request is only made once.
Well, that's almost entirely up to you - once you've downloaded data,
you can do what you like with it; nothing Wikipedia does could allow
or prevent a particular caching scheme at your end. However, out of
consideration for the frequently overloaded servers maintained by the
non-profit Wikimedia Foundation, some form of caching would probably
be considered far preferable to making a fresh request every time. A
standard HTTP "if-modified-since:" header, like most browsers and
proxies use, would do if you wanted to stay up-to-date; but how you
actually manage the request storage is entirely up to you.
Also note that relying on Wikipedia responding before you returned any
of your own content could slow down your site *a lot*, as the servers
often have heavy load or even go down for hours at a time. Not
necessarily a problem, but worth bearing in mind when designing your
caching solution.
-- 
Rowan Collins BSc
[IMSoP]