Retrieving Articles for Mirror

List overview All Threads
Download

newer

older

Minor bugbear fixed: newtalk...

Re: possible interwiki solution

Eric Walker

11 Jun 2005 11 Jun '05

4:55 a.m.

The sysadmin with whom I have been corresponding recommended that I post my situation and query here. I should say that while I am not quite illiterate technically, neither am I an expert. I established, quite some months ago, a web site that has been operating as a combination of Wikipedia articles and the corresponding dmoz links lists. I have been punctilious in observing the terms of use of both organizations. I have, however, now--after all these months--discovered that I am apparently not using the resource itself in a preferred manner. I have been fetching individual articles from the wikipedia site as visitors request them (once fetched, they are given some php-based processing, and the rest of the page built around them). Apparently that is a no-no, owing, I am told, to the server load, especially from searchbots that may follow out the pages. I was, after some months of operation, suddenly hit with a 403 block; on inquiry, I discovered the facts above. I then asked whether using instead the Special:Export XML access would be an acceptable way of fetching articles individually on demand. The sysadmin wrote that he felt it would, but that I would be best to post here to see if others agree or disagree. I realize that there is no easy way to convert the marked-up text to HTML, but I am prepared to cobble up some php to essay the task--but, before going to that nontrivial effort, I would like to be sure that I will not again be blocked even if I am accessing individual articles via Special:Export XML. (At present, I seem to be getting perhaps 20,000 visitors a day.) This site is very important to me, and I need to act extremely quickly if it is not all to just go down the drain, so please help me out here.

Show replies by date

Brion Vibber

11 Jun 11 Jun

5:51 a.m.

Eric Walker wrote:

...

I have been fetching individual articles from the wikipedia site as visitors request them (once fetched, they are given some php-based processing, and the rest of the page built around them). Apparently that is a no-no, owing, I am told, to the server load, especially from searchbots that may follow out the pages. I was, after some months of operation, suddenly hit with a 403 block; on inquiry, I discovered the facts above. I then asked whether using instead the Special:Export XML access would be an acceptable way of fetching articles individually on demand. The sysadmin wrote that he felt it would, but that I would be best to post here to see if others agree or disagree.

Hi Eric. Our resources are provided by donations from our users and supporters to keep Wikipedia and our other projects available to human readers and contributors. Please remember that you're using someone else's servers, paid for with someone else's money, to run your web site for you. As an uninvited guest, you need to be mindful about how you use your host's resources. Limit the number of connections you make and cache resources locally once they've been retrieved. Understand and use HTTP caching headers when available (such as the Last-Modified and Is-Modified-Since headers). In particular remember that a rush of connections to your site, such as a flash crowd (slashdot!) or a search engine spidering loops of links, can cause your site to pass a *huge* number of requests on to ours. We make available public database dumps of the page databases for all our wikis for the express purpose of making it easy for people to reuse massive amounts of our content on their own sites as well as to perform private research, republishing in other formats, etc. Updates are somewhat intermittent while we're moving database servers around, but occur roughly every couple of weeks. The last dump was made on May 16. They're available at http://dumps.wikimedia.org/ I would strongly recommend that you make use of these database dumps if possible, and avoid hitting our servers at all. If you _really_ need the most up-to-date pages you can use the Special:Export interface to grab source text, and render it within your own MediaWiki installation.

...

I realize that there is no easy way to convert the marked-up text to HTML, but I am prepared to cobble up some php to essay the task--but, before going to that nontrivial effort,

Our code's all open source and you should feel free to use it for this purpose: http://www.mediawiki.org/

...

I would like to be sure that I will not again be blocked even if I am accessing individual articles via Special:Export XML. (At present, I seem to be getting perhaps 20,000 visitors a day.)

We cannot guarantee that you will never be blocked; if your site becomes problematic it may very well be, but if the site is well-behaved it probably will not be. Most of all, remember that if you use a complete database dump you can avoid any reliance on our site being up, down, unavailable, or blocking you at any given time. This will make your site more resilient against downtime, network troubles, and slow servers as well as the possibility that you might get blocked. -- brion vibber (brion @ pobox.com)

Eric Walker

8:20 a.m.

Brion Vibber <brion <at> pobox.com> writes:

...

Hi Eric. Our resources are provided by donations from our users and supporters to keep Wikipedia and our other projects available to human readers and contributors.

Just so; I have made some myself. I send a set percentage of any net AdSense money I make to Wikipedia; and every article page of my site has an explicit link to Wikipedia's fund-raising page.

...

Please remember that you're using someone else's servers, paid for with someone else's money, to run your web site for you. As an uninvited guest, you need to be mindful about how you use your host's resources.

Again, just so.

...

Limit the number of connections you make and cache resources locally once they've been retrieved. Understand and use HTTP caching headers when available (such as the Last-Modified and Is-Modified-Since headers). In particular remember that a rush of connections to your site, such as a flash crowd (slashdot!) or a search engine spidering loops of links, can cause your site to pass a *huge* number of requests on to ours.

Dear me, if I ever had a flash crowd, I think I'd faint dead away. But search engines are, I think, the very nut of the matter. Philosophically, my visitors are, in effect, simply visitors to Wikipedia, albeit by proxy, and the load they represent is effectively just that of visitors to Wikipedia itself; but the extra from the searchbots is the problem. (Mind, I'd be pleased as punch if the SEs would actually visit enough of my pages for it to matter.)

...

We make available public database dumps of the page databases for all our wikis for the express purpose of making it easy for people to reuse massive amounts of our content on their own sites as well as to perform private research, republishing in other formats, etc. Updates are somewhat intermittent while we're moving database servers around, but occur roughly every couple of weeks. The last dump was made on May 16. They're available at http://dumps.wikimedia.org/

Yes, so I have discovered. But I wonder about the trade-off between my taking files, especially if I can manage to use the XML feeds effectively, "on demand" versus my taking many gigabytes of data every two to four weeks--which would average to something like a gigabyte a day. Which--in any sense of the word--"costs" more?

...

I would strongly recommend that you make use of these database dumps if possible, and avoid hitting our servers at all. If you _really_ need the most up-to-date pages you can use the Special:Export interface to grab source text, and render it within your own MediaWiki installation.

Yes, that is what I would do--indeed, would have done long ago if I had available a simple, reliable markup-to-html translator. But if needs be, I am willing to work at handling the translation (based in good part on my very recent discovery of some extant translation work).

...

I realize that there is no easy way to convert the marked-up text to HTML, but I am prepared to cobble up some php to essay the task--but, before going to that nontrivial effort,

Our code's all open source and you should feel free to use it for this purpose: http://www.mediawiki.org/

True, save that the markup language seems to be not clearly set and defined anywhere, though perhaps I simply have not yet found what I have been looking for. But Magnus Manske's wiki2xml.php script looks like an excellent running start on the problem, at least for a site on my scale.

...

I would like to be sure that I will not again be blocked even if I am accessing individual articles via Special:Export XML. (At present, I seem to be getting perhaps 20,000 visitors a day.)

We cannot guarantee that you will never be blocked; if your site becomes problematic it may very well be, but if the site is well-behaved it probably will not be.

"Well-behaved" is, I fear, in the eye of the beholder. I dearly want to be well-behaved, and a good citizen in all ways, but I am still not sure I know what the considerations are. Was I 403'ed because the actual volume of transactions was too high, or simply because I was using "remote loading"? I also am unclear as to whether there is a difference between bandwidth used downloading a dump and bandwidth used taking files (HTML or the XML available) "on demand"--different servers, or all just "bandwidth"?).

...

Most of all, remember that if you use a complete database dump you can avoid any reliance on our site being up, down, unavailable, or blocking you at any given time. This will make your site more resilient against downtime, network troubles, and slow servers as well as the possibility that you might get blocked.

Yet again, just so, but . . . . I am already paying what is, for me as an individual, anyway, a dearly high monthly for a "high-volume" account, yet I get but 5 GB of storage, scarcely a fraction of even just the English- language database. I wonder that there is not in place some system for dealing with situations like mine, which I can hardly think unique. I would be more than happy to pay some plausible fee for accesses, either to XML or to HTML, rather than try to re-invent the wheel. To install the entire Wikimedia package is, to put it mildly, daunting to a solo nonexpert, yet almost all of it would really be useless to me: all I want is to be able to get at the current content of the actual articles. I realize that Wikipedia is "open source", which is usually held to be equal to "free", but is there no place for mixing "free to public visitors" and a modest fee for serving, shall we say, on the wholesale rather than the retail side? Meanwhile, my question for the instant moment is: if the block is taken off, and I take the XML Export files on an on-demand basis, will I be OK? (I am even now at work on a converter, using the aforesaid wiki2xml.php script; there goes my weekend, and I have already gotten an earful about the weeds that need mowing).

Brion Vibber

12 Jun 12 Jun

9:33 a.m.

Eric Walker wrote:

...

Brion Vibber <brion <at> pobox.com> writes:

Downloading a dump from time to time takes a bit of bandwidth and virtually no processing power for us. Our scarce resources are in the database load and the wiki's web processing, which aren't touched by this. We pump out something on the order of 100Mbps, so if you were to download, say, the 900 megabyte English Wikipedia current-article dump once a week that would be a pretty small dent.

...

We cannot guarantee that you will never be blocked; if your site becomes problematic it may very well be, but if the site is well-behaved it probably will not be.

Generally, if your volume's big enough to be noticed leeching page views and searches off our main servers, you've got a fair chance of being blocked. (I seem to remember hearing something about the search engine being hit in this case, but I may be thinking of something else.) Having your rewritten output include ad banners or remove the Wikipedia attribution, copyright/licencing info, etc is pretty much certain to get you blocked. (I have never seen your web site personally, I don't know if these apply.) We are under no obligation to act as a back-end web server for other peoples' web sites.

...

I hate to say "not my problem", but honestly it's your responsibility, not ours, to find and pay for the hosting of *your* web site.

...

I wonder that there is not in place some system for dealing with situations like mine, which I can hardly think unique. I would be more than happy to pay some plausible fee for accesses, either to XML or to HTML, rather than try to re-invent the wheel.

[snip]

...

I realize that Wikipedia is "open source", which is usually held to be equal to "free", but is there no place for mixing "free to public visitors" and a modest fee for serving, shall we say, on the wholesale rather than the retail side?

You may wish to propose this to the Wikimedia Foundation board of directors; they can be reached by e-mail at: board at wikimedia.org

...

Meanwhile, my question for the instant moment is: if the block is taken off, and I take the XML Export files on an on-demand basis, will I be OK?

If your site doesn't appear to violate the content license, and you follow earlier suggestions about caching data (preferably backed by your own local copy to begin with) and you limit your hit rate and don't abuse our search engine or other tools, you probably won't get blocked. But again, there are no guarantees. We run our web site, not yours, and if our system administrators feel your site's eating more load than it should it may well get cut off at some point. A general recommendation for remote loading: if you're not already doing this, always use a relevant user-agent string which identifies your site well enough that the other site's admins can find and contact you. (Including an e-mail address is strongly recommended; an URL might be insufficient if your site is being slashdotted or DoSed.) If we know who you are and how to contact you, and we know that you've made a good-faith effort, you're more likely to get the benefit of the doubt. But if there's a necessity, such as a huge flash crowd sending far too many requests to us due to poor throttling on your site, you might well find yourself blocked without prior notice. -- brion vibber (brion @ pobox.com)

Jerome Jamnicky

10:03 a.m.

Brion Vibber wrote:

...

Eric Walker wrote:

...

Yes, omniknow.com uses Wikipedia's search, for example fetching this: http://omniknow.com/common/search.php?in=en&redo=NO&text=aleech&… results in this request on Wikipedia: 209.68.1.165 - - [12/Jun/2005:03:29:09 +0000] "GET http://en.wikipedia.org/wiki/Special:Search?search=aleech HTTP/1.0" 403 1486 "-" "-" -- Send instant messages to your online friends http://au.messenger.yahoo.com

Eric Walker

8:29 p.m.

Brion Vibber <brion <at> pobox.com> writes: [parts elided for brevity]

...

I have been scrupulously careful about the license terms, and even provide a link (and suggestion text for it), on every page, to the Wikimedia Foundation fund-raising page. (Incidentally, I repeat that this is not a dead-exact mirror, in that each topic page also includes the dmoz links, if any can be found, for that topic, thus providing--so far as I can determine--a unique value-added page.) I also have in place (beside a Crawl-delay: 5 statement in the robots.txt file) a throttler that blocks any entity trying to exceed thast 5-second limitation for longer than a half minute or so, so as to keep ill-behaved searchbots (or email harvesters) from rapid attacks.

...

We run our web site, not yours, and if our system administrators feel your site's eating more load than it should it may well get cut off at some point.

I understand and completely agree. Had I known any of this before starting up, I would have gone about things in a very different manner. My problem--not, I agree, yours, but I am just *asking* for a little help, not trying to demand anything--is not that I don't want to or won't convert to a local system using database dumps; I will, _but_ this is an ongoing project, and it will no doubt take me some little time to learn enough to set up and operate a database system and the associated software, matters of which I am at present wholly ignorant.

...

A general recommendation for remote loading: if you're not already doing this, always use a relevant user-agent string which identifies your site well enough that the other site's admins can find and contact you. (Including an e-mail address is strongly recommended; an URL might be insufficient if your site is being slashdotted or DoSed.)

Thank you for the idea; I have now added a "From: " header giving my email. So, again, I wonder if, should I manage to quickly adapt or invent a usable parser and take files "on demand" (as visitors hit a topic) purely via XML, at least for the time--which could be a while--that a crash course in database management and some associated things takes me, would I be allowed to operate? I have begun working on such a parser (I discovered that trying to adapt the existing ones is likely more work than starting anew); I have asked the sysadmin I first dealt with, via email, if I could have a "grace period" of about a week to continue as I have been for the past several months while I develop and test a parser and switch to XML exports, after which I will begin trying to set up a full dump-based back end, which, as I say, will be quite a learning process. But I thought it best to also make that request here, so as to not seem to be trying to go around the community in any way. Should I be allowed the grace period (I suggested this coming Friday, but time having gone by, I would ask instead for this coming Sunday), would someone lift the current block? I apologize for the time this thread is taking, but I have acted in good faith, if, it appears, in ignorance. I want to get into full compliance as soon as I can, but I am praying that I will not be knocked wholly off the air till I can accomplish that.

Rowan Collins

4:46 p.m.

On 11/06/05, Eric Walker <email(a)owlcroft.com> wrote:

...

Just a quick note that although the markup itself is fairly ill-defined, many attempts have been made to make tools which "understand" it to some extent or another. If you haven't already, it might be worth looking at http://meta.wikimedia.org/wiki/Alternative_parsers -- Rowan Collins BSc [IMSoP]

Jerome Jamnicky

4:43 a.m.

Eric Walker wrote:

...

I was, after some months of operation, suddenly hit with a 403 block; on inquiry, I discovered the facts above. I then asked whether using instead the Special:Export XML access would be an acceptable way of fetching articles individually on demand. The sysadmin wrote that he felt it would, but that I would be best to post here to see if others agree or disagree.

I object. If we allow that, then the rest of the Wikipedia mirrors would also see fit to do the same thing. Given the number of mirrors out there, this would likely be a severe burden.

...

I realize that Wikipedia is "open source", which is usually held to

be equal

...

to "free", but is there no place for mixing "free to public visitors"

and a

...

modest fee for serving, shall we say, on the wholesale rather than the retail side?

This idea has been tossed around a bit, but currently it's not done. It would need somebody to draw up agreements, to do the sysadmin work, and so on.

...

This site is very important to me, and I need to act extremely quickly if it is not all to just go down the drain, so please help me out here.

Why not remote load from a Wikipedia mirror belonging to one of your SEO colleagues? You'd be dealing with someone whose mission is in line with your own, and we'd have more of our time free to do something useful. -- Send instant messages to your online friends http://au.messenger.yahoo.com

6918

days inactive

6919

days old

wikitech-l@lists.wikimedia.org

Manage subscription

7 comments

4 participants

tags (0)

participants (4)

Brion Vibber
Eric Walker
Jerome Jamnicky
Rowan Collins