Brion Vibber <brion <at> pobox.com> writes:
Hi Eric. Our resources are provided by donations from
our users and
supporters to keep Wikipedia and our other projects available to human
readers and contributors.
Just so; I have made some myself. I send a set percentage of any net
AdSense money I make to Wikipedia; and every article page of my site has
an explicit link to Wikipedia's fund-raising page.
Please remember that you're using someone
else's servers, paid for with
someone else's money, to run your web site for you. As an uninvited
guest, you need to be mindful about how you use your host's resources.
Again, just so.
Limit the number of connections you make and cache
resources locally
once they've been retrieved. Understand and use HTTP caching headers
when available (such as the Last-Modified and Is-Modified-Since
headers). In particular remember that a rush of connections to your
site, such as a flash crowd (slashdot!) or a search engine spidering
loops of links, can cause your site to pass a *huge* number of requests
on to ours.
Dear me, if I ever had a flash crowd, I think I'd faint dead away. But
search engines are, I think, the very nut of the matter. Philosophically,
my visitors are, in effect, simply visitors to Wikipedia, albeit by proxy,
and the load they represent is effectively just that of visitors to Wikipedia
itself; but the extra from the searchbots is the problem. (Mind, I'd be
pleased as punch if the SEs would actually visit enough of my pages for it
to matter.)
We make available public database dumps of the page
databases for all
our wikis for the express purpose of making it easy for people to reuse
massive amounts of our content on their own sites as well as to perform
private research, republishing in other formats, etc. Updates are
somewhat intermittent while we're moving database servers around, but
occur roughly every couple of weeks. The last dump was made on May 16.
They're available at
http://dumps.wikimedia.org/
Yes, so I have discovered. But I wonder about the trade-off between my
taking files, especially if I can manage to use the XML feeds effectively,
"on demand" versus my taking many gigabytes of data every two to four
weeks--which would average to something like a gigabyte a day. Which--in
any sense of the word--"costs" more?
I would strongly recommend that you make use of these
database dumps if
possible, and avoid hitting our servers at all. If you _really_ need the
most up-to-date pages you can use the Special:Export interface to grab
source text, and render it within your own MediaWiki installation.
Yes, that is what I would do--indeed, would have done long ago if I had
available a simple, reliable markup-to-html translator. But if needs be,
I am willing to work at handling the translation (based in good part on
my very recent discovery of some extant translation work).
I realize that
there is no easy way to convert the marked-up text to HTML,
but I am prepared to cobble up some php to essay the task--but, before
going to that nontrivial effort,
Our code's all open source and you should feel free to use it for this
purpose:
http://www.mediawiki.org/
True, save that the markup language seems to be not clearly set and defined
anywhere, though perhaps I simply have not yet found what I have been
looking for. But Magnus Manske's wiki2xml.php script looks like an
excellent running start on the problem, at least for a site on my scale.
I would like
to be sure that I will not
again be blocked even if I am accessing individual articles via
Special:Export XML. (At present, I seem to be getting perhaps 20,000
visitors a day.)
We cannot guarantee that you will never be blocked; if your site becomes
problematic it may very well be, but if the site is well-behaved it
probably will not be.
"Well-behaved" is, I fear, in the eye of the beholder. I dearly want to be
well-behaved, and a good citizen in all ways, but I am still not sure I know
what the considerations are. Was I 403'ed because the actual volume of
transactions was too high, or simply because I was using "remote loading"?
I also am unclear as to whether there is a difference between bandwidth used
downloading a dump and bandwidth used taking files (HTML or the XML available)
"on demand"--different servers, or all just "bandwidth"?).
Most of all, remember that if you use a complete
database dump you can
avoid any reliance on our site being up, down, unavailable, or blocking
you at any given time. This will make your site more resilient against
downtime, network troubles, and slow servers as well as the possibility
that you might get blocked.
Yet again, just so, but . . . . I am already paying what is, for me as an
individual, anyway, a dearly high monthly for a "high-volume" account, yet
I get but 5 GB of storage, scarcely a fraction of even just the English-
language database.
I wonder that there is not in place some system for dealing with situations
like mine, which I can hardly think unique. I would be more than happy to
pay some plausible fee for accesses, either to XML or to HTML, rather than
try to re-invent the wheel. To install the entire Wikimedia package is, to
put it mildly, daunting to a solo nonexpert, yet almost all of it would
really be useless to me: all I want is to be able to get at the current
content of the actual articles.
I realize that Wikipedia is "open source", which is usually held to be equal
to "free", but is there no place for mixing "free to public visitors"
and a
modest fee for serving, shall we say, on the wholesale rather than the
retail side?
Meanwhile, my question for the instant moment is: if the block is taken off,
and I take the XML Export files on an on-demand basis, will I be OK? (I am
even now at work on a converter, using the aforesaid wiki2xml.php script;
there goes my weekend, and I have already gotten an earful about the weeds
that need mowing).