Good acumen Magnus. A very incisive "rant".
Anyway, just musing and mulling:
Could the PULL method be implemented w/ a checksum?
Say, generate a fairly short checksum (we're talking versioning here,
not security) for every article revision.
Then, with each request hitting an Intarweb-facing (caching) webserver,
have that cache box look if there's a version stored in its cache for
said article (if not, fetch the article from the actual DB , etc.). IF
however there is a cached version, ask the DB server for its current
checksum on its current version. If this matches the checksum the cache
has for its version, just don't bother the DB any further and serve the
page from the cache. If the checksums differ, then again fetch the
article from the DB and serve that (and cache the new article and
checksum for potential subsequent requests). This entire checksum thing
will NOT be required for any cached non-current revisions, because they
won't change. So, yes, for each request hitting the cache server,
there'd be a short checksum PULL with the actual DB server, but other
than that (and provided the article hasn't changed) it can just be
served from the cache.
Does that make sense to people?
Or am I reinventing the wheel or something?
I'm just brainstorming, I'm not even a real programmer. (Translation:
The above may be--or may not be--rubbish.)
-- ropers [[en:User:Ropers]]
www.ropersonline.com
On 21 Oct 2004, at 21:07, Magnus Manske wrote:
Warning: Long and depressing text follows. Don't
read it at home, save
it for work instead. Better spend a nice evening with your girlfriend.
(Then again, this list is probably like slashdot, so forget about the
imaginary girlfriend and continue reading ;-)
I thought I had it all figured out.
I created a demo version for data entry in a wiki-like fashion. It
uses a "one-table-fits-all" SQL schema, which some of you had worries
about. No problem. If someone else write a better data entry
mechanism, I'm all for it. As far as it concerns me, the WikiData site
should be like a black box to the outside, serving data to wikipedias
and everyone else who wants it. What's going on inside is only for
those who enter the data.
Today, I finished creating a rough draft for the query (the wikipedia)
side of the bargain. Istead of creating Yet Another Wikimarkup [{(like
this)}] I figured out that we should separate the query and the
display part, and hide the query part within the template system. Goes
like this:
{{speciesdata:Foobus Barus}}
in the article; [[Template:Speciesdata]] looks like this:
<data>
<query database="wikispecies" result="r1">Some sort of XQuery
or SQL
query for wikispecies for {{{1}}}</query>
Some species data table using <r1>latin_name</r1>,
<r1>name_en</r1>,
<r1>family</r1> etc.
</data>
For creating lists (like "all species within the family 'Foobus'"), a
<foreach> element could be used.
The <data> thingy would be a plugin ("plugins GOOD!"), but one that
returns wikitext to be parsed further. It would handle the <query> and
<foreach> tags etc.
So, we'd have *one* ugly m..........r of a <data><query> kind
template, which would, once created, not be edited again a lot. All
the powerful, functional uglyness that could scare newbies away would
be hidden through the template.
Yes, I got it all figured out.
Then it hit me.
As good "wiki-fiddlers" (thanks so much, Register!) we would like to
see every change in WikiData on the wikipedia pages real soon. Like,
now.
So the information that something changes, and what changed, has to
pass from the data site to the display site. There are two ways to do
that: push or pull.
PUSH means the data site will notify the display site that something
has changed, and the display needs to be updated. For that, the data
site has to know which pages of the display site are affected by which
change. Then, it has to notify the display site of this. Bad things:
* Needs basically a cache of *all* queries *ever* asked of the data
site, as well as their results
* Has to recalculate *all* of these after *every* change to find which
queries produce different results
* Won't work if the display site is offline
* Won't work well with non-wikipedias
That can't be it.
PULL means the display site asks the data site if anything has
changed, which basically means rerunning a query. Which means, doing
this for *every* pageview, even for anons. Which means, all caching
variants, including squids, are going bye-bye. Additionally, for every
page view, the display site has to wait for the data site to complete
the query.
Think wikipedia is slow today? Think again...
That can't be it, either.
Oh, sure, we can cache the queries with results on the display site,
or only update the data once a day/week, but then we won't be wiki
(=quick) anymore, no? Will this be the price to pay?
I think I'll have that autumn-depression now, please...
Magnus
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)wikimedia.org
http://mail.wikipedia.org/mailman/listinfo/wikitech-l