In my discussion with Ilse (based on which I recently sent the request to reduce the put_throttle), we also got to the subject of XML feeds. I mentioned that Yahoo was already getting one, and my contact at Ilse said he would be interested in such. Thus my questions:
* Would sending out XML feeds to other parties with a good reason for interest be a good idea or would it be detrimental (how much is for example the load on the servers of sending out an XML feed compared to being spidered by a search engine?) * If the answer is positive, who can I or my contact contact to discuss the possibility?
Andre Engels
Andre Engels wrote:
In my discussion with Ilse (based on which I recently sent the request to reduce the put_throttle), we also got to the subject of XML feeds. I mentioned that Yahoo was already getting one, and my contact at Ilse said he would be interested in such. Thus my questions:
- Would sending out XML feeds to other parties with a good reason for interest
be a good idea or would it be detrimental (how much is for example the load on the servers of sending out an XML feed compared to being spidered by a search engine?)
- If the answer is positive, who can I or my contact contact to discuss the
possibility?
Andre Engels
I would imagine that an XML feed would allow a significant reduction in load, or a significant increase in freshness for the same level of load, if it was used widely by spider operators.
Each spider that simply spiders the site repetitively will have to hit every page with at least a last-modification-time request to get up to date, and then download all the pages that have changed. The more up-to-date a spider operator wants to be, the more load they need to generate.
For example, for 500,000 articles, keeping up to date within a day makes for 500,000 hits per day, or 5 hits per second 24/7, just to check the timestamps. Then the updated articles (about 1000 to 3000) would still need to be uploaded each day.
On the other hand, if they switch to using an XML feed, they can be up-to-date to within an hour or so, and only download the pages that have changed: perhaps 1000 to 3000 hits per day. This only corresponds to one hit every 30 seconds or so. The overhead of polling the XML feed every hour or so would be negligible.
We should consider adding a bit of filtering to the XML feed, so that users can select their degree of granularity: for example, allowing them to be notified of every single change, or only the last change in an hour, a day, or other time period. Programmable hysteresis-based filtering would also be interesting, to suppress notification until the end of a "burst" of editing on a page: for example "1 hour after the first edit since the last notification, or 10 minutes since the most recent edit, whichever is earlier". With the right tuning this could try to make sure that articles were "stable" when notification was sent via the XML feed.
-- Neil
Neil Harris wrote:
I would imagine that an XML feed would allow a significant reduction in load, or a significant increase in freshness for the same level of load, if it was used widely by spider operators.
[...]
-- Neil
Is the XML feed compressed (gzipped?)? That could save bandwidth too.
On Mar 17, 2004, at 11:58, Andre Engels wrote:
In my discussion with Ilse (based on which I recently sent the request to reduce the put_throttle), we also got to the subject of XML feeds. I mentioned that Yahoo was already getting one, and my contact at Ilse said he would be interested in such. Thus my questions:
As far as I know the Yahoo XML feed is mythical. If it exists, I don't know about it.
The next major version of MediaWiki will have more pervasive ability to generate feeds in RSS and probably Atom formats (currently Special:Newpages in the CVS version can produce a simple RSS feed, but the code's there to do more).
- Would sending out XML feeds to other parties with a good reason for
interest be a good idea or would it be detrimental (how much is for example the load on the servers of sending out an XML feed compared to being spidered by a search engine?)
Search engines will spider the site anyway, as they spider every other site.
A real RecentChanges feed could be more useful to someone mirroring/republishing the content grabbing updates live rather than checking on demand or grabbing a huge backup dump every week.
-- brion vibber (brion @ pobox.com)
To me, making available an XML feed in a variety of formats is a sensible thing to do. One good candidate format would be RDF/RSS.
This is particularly true for parties who may have a serious interest. I mean, I'm not too keen on random individuals downloading a massive file every day just to update some internal search engine on their home machine or whatever, although if the load is small, it does no harm.
But I'm very keen on "getting the word out" and so anyone who is either indexing our content for a search engine or re-using our content in some fashion that will get it out to a larger audience, then that's great.
--Jimbo
Jimmy Wales wrote:
To me, making available an XML feed in a variety of formats is a sensible thing to do. One good candidate format would be RDF/RSS.
An extended version of the XML export page would be nice as well. If we (one day, in a galaxy far, far away;-) distribute a CD/DVD version, it would be nice to * either check a page for changes compared to the CD version /if you're online/ so it will display the latest version of a page * or (slowly) upgrade your hard disk version to the online one (like auto-rsync).
So we'd have an XML page that returns which pages have changed since XX, or which of [list of titles] have changes since then, and of course the current XML export.
Magnus
On Wed, 17 Mar 2004 09:57:37 -0800, Jimmy Wales wrote: <snip>
I mean, I'm not too keen on random individuals downloading a massive file every day just to update some internal search engine on their home machine or whatever, although if the load is small, it does no harm.
<snip>
Maybe having "patch dump" will save some bandwith ? Like getting only articles that changed since the last dump and the dump before ? This will allow people to just get the datas they haven't got already.
wikitech-l@lists.wikimedia.org