Warning: Long and depressing text follows. Don't read it at home, save it for work instead. Better spend a nice evening with your girlfriend. (Then again, this list is probably like slashdot, so forget about the imaginary girlfriend and continue reading ;-)
I thought I had it all figured out.
I created a demo version for data entry in a wiki-like fashion. It uses a "one-table-fits-all" SQL schema, which some of you had worries about. No problem. If someone else write a better data entry mechanism, I'm all for it. As far as it concerns me, the WikiData site should be like a black box to the outside, serving data to wikipedias and everyone else who wants it. What's going on inside is only for those who enter the data.
Today, I finished creating a rough draft for the query (the wikipedia) side of the bargain. Istead of creating Yet Another Wikimarkup [{(like this)}] I figured out that we should separate the query and the display part, and hide the query part within the template system. Goes like this:
{{speciesdata:Foobus Barus}}
in the article; [[Template:Speciesdata]] looks like this:
<data> <query database="wikispecies" result="r1">Some sort of XQuery or SQL query for wikispecies for {{{1}}}</query> Some species data table using <r1>latin_name</r1>, <r1>name_en</r1>, <r1>family</r1> etc. </data>
For creating lists (like "all species within the family 'Foobus'"), a <foreach> element could be used. The <data> thingy would be a plugin ("plugins GOOD!"), but one that returns wikitext to be parsed further. It would handle the <query> and <foreach> tags etc.
So, we'd have *one* ugly m..........r of a <data><query> kind template, which would, once created, not be edited again a lot. All the powerful, functional uglyness that could scare newbies away would be hidden through the template. Yes, I got it all figured out.
Then it hit me.
As good "wiki-fiddlers" (thanks so much, Register!) we would like to see every change in WikiData on the wikipedia pages real soon. Like, now. So the information that something changes, and what changed, has to pass from the data site to the display site. There are two ways to do that: push or pull.
PUSH means the data site will notify the display site that something has changed, and the display needs to be updated. For that, the data site has to know which pages of the display site are affected by which change. Then, it has to notify the display site of this. Bad things: * Needs basically a cache of *all* queries *ever* asked of the data site, as well as their results * Has to recalculate *all* of these after *every* change to find which queries produce different results * Won't work if the display site is offline * Won't work well with non-wikipedias
That can't be it.
PULL means the display site asks the data site if anything has changed, which basically means rerunning a query. Which means, doing this for *every* pageview, even for anons. Which means, all caching variants, including squids, are going bye-bye. Additionally, for every page view, the display site has to wait for the data site to complete the query. Think wikipedia is slow today? Think again...
That can't be it, either.
Oh, sure, we can cache the queries with results on the display site, or only update the data once a day/week, but then we won't be wiki (=quick) anymore, no? Will this be the price to pay?
I think I'll have that autumn-depression now, please...
Magnus
Perhaps a way to address this dilemma is with a manual pull system. A page that incorporates WikiData would display a message indicating that the page uses data that was last refreshed *for this page* at such-and-such a date/time. (This information can be cached with the page, since it doesn't make a statement about the freshness of the underlying data.) The display would also give the user the ability to force a refresh if desired. (A DoS attack could be avoided by not allowing refresh before x amount of time has passed since the last refresh.)
Wouldn't be quite as automatic as the pull system described in the original message, but it could avoid the severe performance penalty. Just a thought.
Alan
On Thu, 21 Oct 2004 21:07:45 +0200, Magnus Manske magnus.manske@web.de wrote: [...]
As good "wiki-fiddlers" (thanks so much, Register!) we would like to see every change in WikiData on the wikipedia pages real soon. Like, now. So the information that something changes, and what changed, has to pass from the data site to the display site. There are two ways to do that: push or pull.
PUSH means the data site will notify the display site that something has changed, and the display needs to be updated. For that, the data site has to know which pages of the display site are affected by which change. Then, it has to notify the display site of this. Bad things:
- Needs basically a cache of *all* queries *ever* asked of the data
site, as well as their results
- Has to recalculate *all* of these after *every* change to find which
queries produce different results
- Won't work if the display site is offline
- Won't work well with non-wikipedias
That can't be it.
PULL means the display site asks the data site if anything has changed, which basically means rerunning a query. Which means, doing this for *every* pageview, even for anons. Which means, all caching variants, including squids, are going bye-bye. Additionally, for every page view, the display site has to wait for the data site to complete the query. Think wikipedia is slow today? Think again...
That can't be it, either.
Oh, sure, we can cache the queries with results on the display site, or only update the data once a day/week, but then we won't be wiki (=quick) anymore, no? Will this be the price to pay?
Alan Wessman wrote:
Perhaps a way to address this dilemma is with a manual pull system. A page that incorporates WikiData would display a message indicating that the page uses data that was last refreshed *for this page* at such-and-such a date/time. (This information can be cached with the page, since it doesn't make a statement about the freshness of the underlying data.) The display would also give the user the ability to force a refresh if desired. (A DoS attack could be avoided by not allowing refresh before x amount of time has passed since the last refresh.)
Wouldn't be quite as automatic as the pull system described in the original message, but it could avoid the severe performance penalty. Just a thought.
That would be a way, and actually easy to code ;-)
From a usability standpoint, it would be a last resort, though.
Magnus
Good acumen Magnus. A very incisive "rant".
Anyway, just musing and mulling: Could the PULL method be implemented w/ a checksum? Say, generate a fairly short checksum (we're talking versioning here, not security) for every article revision. Then, with each request hitting an Intarweb-facing (caching) webserver, have that cache box look if there's a version stored in its cache for said article (if not, fetch the article from the actual DB , etc.). IF however there is a cached version, ask the DB server for its current checksum on its current version. If this matches the checksum the cache has for its version, just don't bother the DB any further and serve the page from the cache. If the checksums differ, then again fetch the article from the DB and serve that (and cache the new article and checksum for potential subsequent requests). This entire checksum thing will NOT be required for any cached non-current revisions, because they won't change. So, yes, for each request hitting the cache server, there'd be a short checksum PULL with the actual DB server, but other than that (and provided the article hasn't changed) it can just be served from the cache.
Does that make sense to people? Or am I reinventing the wheel or something? I'm just brainstorming, I'm not even a real programmer. (Translation: The above may be--or may not be--rubbish.)
-- ropers [[en:User:Ropers]] www.ropersonline.com
On 21 Oct 2004, at 21:07, Magnus Manske wrote:
Warning: Long and depressing text follows. Don't read it at home, save it for work instead. Better spend a nice evening with your girlfriend. (Then again, this list is probably like slashdot, so forget about the imaginary girlfriend and continue reading ;-)
I thought I had it all figured out.
I created a demo version for data entry in a wiki-like fashion. It uses a "one-table-fits-all" SQL schema, which some of you had worries about. No problem. If someone else write a better data entry mechanism, I'm all for it. As far as it concerns me, the WikiData site should be like a black box to the outside, serving data to wikipedias and everyone else who wants it. What's going on inside is only for those who enter the data.
Today, I finished creating a rough draft for the query (the wikipedia) side of the bargain. Istead of creating Yet Another Wikimarkup [{(like this)}] I figured out that we should separate the query and the display part, and hide the query part within the template system. Goes like this:
{{speciesdata:Foobus Barus}}
in the article; [[Template:Speciesdata]] looks like this:
<data> <query database="wikispecies" result="r1">Some sort of XQuery or SQL query for wikispecies for {{{1}}}</query> Some species data table using <r1>latin_name</r1>, <r1>name_en</r1>, <r1>family</r1> etc. </data>
For creating lists (like "all species within the family 'Foobus'"), a <foreach> element could be used. The <data> thingy would be a plugin ("plugins GOOD!"), but one that returns wikitext to be parsed further. It would handle the <query> and <foreach> tags etc.
So, we'd have *one* ugly m..........r of a <data><query> kind template, which would, once created, not be edited again a lot. All the powerful, functional uglyness that could scare newbies away would be hidden through the template. Yes, I got it all figured out.
Then it hit me.
As good "wiki-fiddlers" (thanks so much, Register!) we would like to see every change in WikiData on the wikipedia pages real soon. Like, now. So the information that something changes, and what changed, has to pass from the data site to the display site. There are two ways to do that: push or pull.
PUSH means the data site will notify the display site that something has changed, and the display needs to be updated. For that, the data site has to know which pages of the display site are affected by which change. Then, it has to notify the display site of this. Bad things:
- Needs basically a cache of *all* queries *ever* asked of the data
site, as well as their results
- Has to recalculate *all* of these after *every* change to find which
queries produce different results
- Won't work if the display site is offline
- Won't work well with non-wikipedias
That can't be it.
PULL means the display site asks the data site if anything has changed, which basically means rerunning a query. Which means, doing this for *every* pageview, even for anons. Which means, all caching variants, including squids, are going bye-bye. Additionally, for every page view, the display site has to wait for the data site to complete the query. Think wikipedia is slow today? Think again...
That can't be it, either.
Oh, sure, we can cache the queries with results on the display site, or only update the data once a day/week, but then we won't be wiki (=quick) anymore, no? Will this be the price to pay?
I think I'll have that autumn-depression now, please...
Magnus _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Jens Ropers wrote:
Good acumen Magnus. A very incisive "rant".
Thanks.
Anyway, just musing and mulling: Could the PULL method be implemented w/ a checksum? Say, generate a fairly short checksum (we're talking versioning here, not security) for every article revision. Then, with each request hitting an Intarweb-facing (caching) webserver, have that cache box look if there's a version stored in its cache for said article (if not, fetch the article from the actual DB , etc.). IF however there is a cached version, ask the DB server for its current checksum on its current version. If this matches the checksum the cache has for its version, just don't bother the DB any further and serve the page from the cache. If the checksums differ, then again fetch the article from the DB and serve that (and cache the new article and checksum for potential subsequent requests). This entire checksum thing will NOT be required for any cached non-current revisions, because they won't change. So, yes, for each request hitting the cache server, there'd be a short checksum PULL with the actual DB server, but other than that (and provided the article hasn't changed) it can just be served from the cache.
So, the DB server keeps a list with a checksum (or a version number; this is supposed to be wiki-like) for each data entry, and likewise does the article, right? What if it more than a single data entry in that article? Like the list of species I mentioned.Say, a new species was added at wikidata; how to handle that one? What if there are multiple queries in one article? What if (in my example) the actual query is in a template? What if that template includes other templates that contain queries?
Yes, I think that it could be done. But, and I say that as someone who started programming with "spaghetti code", it looks like a mess to me. A dependency nightmare. We are already suffering from such effects (think categories in templates) without wikidata to look out for. Also, you will have to query the DB server and wait for its answer on *every* page view, including cached/anons, to deliver the checksum(s). And, this will work only with the most rudimentary database structure, like "SELECT * from specieslist where name='Foo'". If wikidata is to become more complex than that (and I don't say it should, just speculating), if wikidata tables can be interlinked, then there will be no "simple" dependency on a single data entry anymore.
Does that make sense to people? Or am I reinventing the wheel or something? I'm just brainstorming, I'm not even a real programmer. (Translation: The above may be--or may not be--rubbish.)
Definitely not rubbish. But a lot more complicated than it looks at first glance, IMHO.
Magnus
On 21 Oct 2004, at 23:31, Magnus Manske wrote:
Jens Ropers wrote:
Anyway, just musing and mulling: Could the PULL method be implemented w/ a checksum? Say, generate a fairly short checksum (we're talking versioning here, not security) for every article revision. Then, with each request hitting an Intarweb-facing (caching) webserver, have that cache box look if there's a version stored in its cache for said article (if not, fetch the article from the actual DB , etc.). IF however there is a cached version, ask the DB server for its current checksum on its current version. If this matches the checksum the cache has for its version, just don't bother the DB any further and serve the page from the cache. If the checksums differ, then again fetch the article from the DB and serve that (and cache the new article and checksum for potential subsequent requests). This entire checksum thing will NOT be required for any cached non-current revisions, because they won't change. So, yes, for each request hitting the cache server, there'd be a short checksum PULL with the actual DB server, but other than that (and provided the article hasn't changed) it can just be served from the cache.
So, the DB server keeps a list with a checksum (or a version number; this is supposed to be wiki-like) for each data entry, and likewise does the article, right?
Yup.
What if it more than a single data entry in that article? Like the list of species I mentioned.Say, a new species was added at wikidata; how to handle that one? What if there are multiple queries in one article? What if (in my example) the actual query is in a template? What if that template includes other templates that contain queries?
I wouldn't have a clue to be honest. I would, in my non-coder and possibly naive imagination reckon that maybe a "one article-one checksum" principle should work with most pages. As regards Wikidata et alia, well, I dunno -- counting templates (and possibly single data sets; but I don't really know a lot about what you're building/you've built there) as articles may work as well. Then again, just having (only) ordinary articles intelligently cached as per the above proposal might solve the biggest part of our problem.
Yes, I think that it could be done. But, and I say that as someone who started programming with "spaghetti code", it looks like a mess to me. A dependency nightmare. We are already suffering from such effects (think categories in templates) without wikidata to look out for. Also, you will have to query the DB server and wait for its answer on *every* page view, including cached/anons, to deliver the checksum(s).
True. I'm really not trying to deliberately complicate things , but "we" could also make the DB servers report all checksums to a group of separate dedicated checksum cache servers (which would replicate between each other like there's no tomorrow, to make damn sure that all checksum cache servers would ALWAYS have the identical set of checksums). One of these redundant checksum servers would then be checked (which should spread the load) and if there's so much as a delay with querying one of them, then the checksum query could fail-over (round robin) to the next checksum cache box. There prolly should also remain a last resort fail-over option of querying the DB server direct and doing away with the entire checksum thing (which, after all is only there to save time and shield the DB server from excessive queries).
And, this will work only with the most rudimentary database structure, like "SELECT * from specieslist where name='Foo'". If wikidata is to become more complex than that (and I don't say it should, just speculating), if wikidata tables can be interlinked, then there will be no "simple" dependency on a single data entry anymore.
Does that make sense to people? Or am I reinventing the wheel or something? I'm just brainstorming, I'm not even a real programmer. (Translation: The above may be--or may not be--rubbish.)
Definitely not rubbish. But a lot more complicated than it looks at first glance, IMHO.
Yea, that's probably true. All the worse as I won't be the one coding it, because I, uh, lack the requisite programming skills :-/
--ropers
Magnus _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
On 22 Oct 2004, at 00:32, Jens Ropers wrote:
On 21 Oct 2004, at 23:31, Magnus Manske wrote:
Jens Ropers wrote:
Anyway, just musing and mulling: Could the PULL method be implemented w/ a checksum? Say, generate a fairly short checksum (we're talking versioning here, not security) for every article revision. Then, with each request hitting an Intarweb-facing (caching) webserver, have that cache box look if there's a version stored in its cache for said article (if not, fetch the article from the actual DB , etc.). IF however there is a cached version, ask the DB server for its current checksum on its current version. If this matches the checksum the cache has for its version, just don't bother the DB any further and serve the page from the cache. If the checksums differ, then again fetch the article from the DB and serve that (and cache the new article and checksum for potential subsequent requests). This entire checksum thing will NOT be required for any cached non-current revisions, because they won't change. So, yes, for each request hitting the cache server, there'd be a short checksum PULL with the actual DB server, but other than that (and provided the article hasn't changed) it can just be served from the cache.
So, the DB server keeps a list with a checksum (or a version number; this is supposed to be wiki-like) for each data entry, and likewise does the article, right?
Yup.
To add:
We shouldn't however confuse these "checksum version numbers" with existent Wikipedia "article revision version numbers". Because past revisions never change and the entire point in this "checksum version number"-system is to determine whether the CURRENT version of the article has changed without bothering the actual DB server.
Second thought, once there are version numbers for CURRENT articles (see bug 181 -- http://bugzilla.wikipedia.org/show_bug.cgi?id=181), then these could/should be used as our checksums: The Internet-facing cache server would check if its article version number matches the version number the DB presently knows as being the CURRENT one.
I hope this makes sense.
What if it more than a single data entry in that article? Like the list of species I mentioned.Say, a new species was added at wikidata; how to handle that one? What if there are multiple queries in one article? What if (in my example) the actual query is in a template? What if that template includes other templates that contain queries?
I wouldn't have a clue to be honest. I would, in my non-coder and possibly naive imagination reckon that maybe a "one article-one checksum" principle should work with most pages. As regards Wikidata et alia, well, I dunno -- counting templates (and possibly single data sets; but I don't really know a lot about what you're building/you've built there) as articles may work as well. Then again, just having (only) ordinary articles intelligently cached as per the above proposal might solve the biggest part of our problem.
Yes, I think that it could be done. But, and I say that as someone who started programming with "spaghetti code", it looks like a mess to me. A dependency nightmare. We are already suffering from such effects (think categories in templates) without wikidata to look out for. Also, you will have to query the DB server and wait for its answer on *every* page view, including cached/anons, to deliver the checksum(s).
True. I'm really not trying to deliberately complicate things , but "we" could also make the DB servers report all checksums to a group of separate dedicated checksum cache servers (which would replicate between each other like there's no tomorrow, to make damn sure that all checksum cache servers would ALWAYS have the identical set of checksums). One of these redundant checksum servers would then be checked (which should spread the load) and if there's so much as a delay with querying one of them, then the checksum query could fail-over (round robin) to the next checksum cache box. There prolly should also remain a last resort fail-over option of querying the DB server direct and doing away with the entire checksum thing (which, after all is only there to save time and shield the DB server from excessive queries).
And, this will work only with the most rudimentary database structure, like "SELECT * from specieslist where name='Foo'". If wikidata is to become more complex than that (and I don't say it should, just speculating), if wikidata tables can be interlinked, then there will be no "simple" dependency on a single data entry anymore.
Does that make sense to people? Or am I reinventing the wheel or something? I'm just brainstorming, I'm not even a real programmer. (Translation: The above may be--or may not be--rubbish.)
Definitely not rubbish. But a lot more complicated than it looks at first glance, IMHO.
Yea, that's probably true. All the worse as I won't be the one coding it, because I, uh, lack the requisite programming skills :-/
--ropers
Magnus _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Magnus Manske wrote:
<data> <query database="wikispecies" result="r1">Some sort of XQuery or SQL query for wikispecies for {{{1}}}</query> Some species data table using <r1>latin_name</r1>, <r1>name_en</r1>, <r1>family</r1> etc. </data>
My approach would be to not use SQL, or anything similar. Use a custom syntax with a greatly restricted feature set. Think in terms of applications. Only allow queries which can be cached and invalidated. Fetching single rows would be a good place to start, that's all I would have implemented if I followed my WikiDB idea.
Cache invalidation or purging is the standard solution here. Make a list of every article which fetches a particular row, and update it on edit. Then when the row changes, invalidate all the articles in the list. Make a list of every article which contains a list of species in the Foobus family. Invalidate all articles in the list every time a species is added or removed from that family.
It's disappointing to give up on some of the dream, but at some stage of the development process, you have to be realistic. My advice would be to set a short term goal (a few months or so), code something useful, admire your work, then go from there.
-- Tim Starling
Tim Starling wrote:
Magnus Manske wrote:
<data> <query database="wikispecies" result="r1">Some sort of XQuery or SQL query for wikispecies for {{{1}}}</query> Some species data table using <r1>latin_name</r1>, <r1>name_en</r1>, <r1>family</r1> etc. </data>
My approach would be to not use SQL, or anything similar. Use a custom syntax with a greatly restricted feature set. Think in terms of applications. Only allow queries which can be cached and invalidated. Fetching single rows would be a good place to start, that's all I would have implemented if I followed my WikiDB idea.
Reducing the possible queries would simplify things. But, I was under the impression that more is demanded from a WikiData system, and we should probably get this right from the start. But, I am not insisting on SQL or anything. It just seemed the natural choice for me, besides XQueries.
Cache invalidation or purging is the standard solution here. Make a list of every article which fetches a particular row, and update it on edit. Then when the row changes, invalidate all the articles in the list.
No problem here.
Make a list of every article which contains a list of species in the Foobus family. Invalidate all articles in the list every time a species is added or removed from that family.
Now *that* requires that we keep * the original query for each article, and its results * rerun that query every time an (*any*) entry has changed or was added, and compare it to the original results
Also, that works only if we have, for example, all the species data in one table. Like, kingdom, phylum, class, order, family, genus. If we, instead, decide to have one table for species which contains only the genus, then another table for the genus which, apart from information about the genus in general, contains the order, etc., then this will become a problem. *Theoretically*, an order could be moved from one subclass to another. Now all species in a genus in a family in a suborder in that order needs to be updated. Good luck with that.
It's disappointing to give up on some of the dream, but at some stage of the development process, you have to be realistic. My advice would be to set a short term goal (a few months or so), code something useful, admire your work, then go from there.
If there were consensus to limit WikiData to all but the most simple queries ("... WHERE name='Foobus'"), and to give up on instant updates and just clear the cache once in a while to update data in articles, something can be done. Otherwise, I'll stay clear on this one, unless it turns out there's something obvious I missed.
Magnus
Magnus Manske wrote: <snip stop depressing!>
As good "wiki-fiddlers" (thanks so much, Register!) we would like to see every change in WikiData on the wikipedia pages real soon. Like, now. So the information that something changes, and what changed, has to pass from the data site to the display site. There are two ways to do that: push or pull.
PUSH means the data site will notify the display site that something has changed, and the display needs to be updated. For that, the data site has to know which pages of the display site are affected by which change. Then, it has to notify the display site of this. Bad things:
- Needs basically a cache of *all* queries *ever* asked of the data
site, as well as their results
- Has to recalculate *all* of these after *every* change to find which
queries produce different results
- Won't work if the display site is offline
- Won't work well with non-wikipedias
That can't be it.
<snip PULL>
Hello,
I would personally PUSH datas from the wikidata to the content publishers (like wikipedia).
A lot of blog systems have a feature known as trackback. When someone publish an article wich contain reference to other blogs, its blog system will send a ping (known as XML/RPC ping) to the referenced blogs alerting them that their news got reused somewhere.
Simple example: Blog slashdot publish a news about nasa discovering martians.
MartianFan001 wich is part of a "Life on mars foundation" decide to publish a news about it and reference slashdot.
JohnDoe who like things about mars decide to publish a news on his personal blog and his article is something like:
<<The mars foundation [http://marsfoundation/newsid/113] report a news originally posted by [http://slashdot/?newsid=123912 slashdot] about life on Mars !>>
He submits that news to its blog engine that parse links and try to send pings to marsfoundation and slashdot saying : johndoe.com/newsid=5 reference your article !
When receiving this ping, marsfoundation and slashdot blogs can update their trackback list:
slashdot news #123912 referenced by: "GeekHideout", "Nerds.com", "Mars foundation"
Marsfoundation news 113 referenced by: News referenced by: "JohnDoe"
So when some site wants to use wikidatas, it sends a query to the wikidata server associated with their internal reference (ex: name of the wikipedia article and language). Wikidata then send them the requested data and the wikidata internal reference.
When a wikidata is changed, the site send ping to every site referencing that set of data with the update. From there the site using data will answer wikidata with a code: 1/ data change acknowledged. 2/ no more need for this data, remove me. 3/ doesnt answer.
If it doesnt answer, there could be a system that queue the ping so it can be sent later (and eventualy be dropped after x days).
I believe the PULL method will generate too much traffic for datas wich are probably not meant to be changed between each view. Datas about species are probably much more stables than nasdaq stocks.
cheers,
Ashar Voultoiz wrote: <snip attempt to rescue the thing>
So when some site wants to use wikidatas, it sends a query to the wikidata server associated with their internal reference (ex: name of the wikipedia article and language). Wikidata then send them the requested data and the wikidata internal reference.
When a wikidata is changed, the site send ping to every site referencing that set of data with the update. From there the site using data will answer wikidata with a code: 1/ data change acknowledged. 2/ no more need for this data, remove me. 3/ doesnt answer.
If it doesnt answer, there could be a system that queue the ping so it can be sent later (and eventualy be dropped after x days).
That will work nicely, if we restrict WikiData access to "show me that specific row from that specific table in that specific database". Which is fine for "Show me data on that species".
But as soon as we allow queries to return lists (e.g., "show me all species of that family"), we cannot do that anymore. Suppone someone adds a species to WikiData. How can we know that a wikipedia page needs to be updated?
Only one way to do that: * Store the original query, the wikipedia page for that query, and its results * On changing any WikiData, rerun *all* these queries, compare their results to the stored ones, and notify wikipedias if necessary
Rerunning a million queries for each data change will dwarf the possible traffic generated from pull (pull isn't really better either; that's the dilemma).
Also, pushing will require extensive infrastructure on the recipient's site, which is not necessarily a wikimedia project (the data should be available to everyone).
Magnus
wikitech-l@lists.wikimedia.org