WikiData: A rant - Wikitech-l

21 Oct 2004

Warning: Long and depressing text follows. Don't read it at home, save 
it for work instead. Better spend a nice evening with your girlfriend. 
(Then again, this list is probably like slashdot, so forget about the 
imaginary girlfriend and continue reading ;-)

I thought I had it all figured out.

I created a demo version for data entry in a wiki-like fashion. It uses 
a "one-table-fits-all" SQL schema, which some of you had worries about. 
No problem. If someone else write a better data entry mechanism, I'm all 
for it. As far as it concerns me, the WikiData site should be like a 
black box to the outside, serving data to wikipedias and everyone else 
who wants it. What's going on inside is only for those who enter the data.

Today, I finished creating a rough draft for the query (the wikipedia) 
side of the bargain. Istead of creating Yet Another Wikimarkup [{(like 
this)}] I figured out that we should separate the query and the display 
part, and hide the query part within the template system. Goes like this:

{{speciesdata:Foobus Barus}}

in the article; [[Template:Speciesdata]] looks like this:

<data>
<query database="wikispecies" result="r1">Some sort of XQuery or
SQL 
query for wikispecies for {{{1}}}</query>
Some species data table using <r1>latin_name</r1>,
<r1>name_en</r1>, 
<r1>family</r1> etc.
</data>

For creating lists (like "all species within the family 'Foobus'"), a 
<foreach> element could be used.
The <data> thingy would be a plugin ("plugins GOOD!"), but one that 
returns wikitext to be parsed further. It would handle the <query> and 
<foreach> tags etc.

So, we'd have *one* ugly m..........r of a <data><query> kind template, 
which would, once created, not be edited again a lot. All the powerful, 
functional uglyness that could scare newbies away would be hidden 
through the template.
Yes, I got it all figured out.

Then it hit me.

As good "wiki-fiddlers" (thanks so much, Register!) we would like to see 
every change in WikiData on the wikipedia pages real soon. Like, now.
So the information that something changes, and what changed, has to pass 
from the data site to the display site. There are two ways to do that: 
push or pull.

PUSH means the data site will notify the display site that something has 
changed, and the display needs to be updated. For that, the data site 
has to know which pages of the display site are affected by which 
change. Then, it has to notify the display site of this. Bad things:
* Needs basically a cache of *all* queries *ever* asked of the data 
site, as well as their results
* Has to recalculate *all* of these after *every* change to find which 
queries produce different results
* Won't work if the display site is offline
* Won't work well with non-wikipedias

That can't be it.

PULL means the display site asks the data site if anything has changed, 
which basically means rerunning a query. Which means, doing this for 
*every* pageview, even for anons. Which means, all caching variants, 
including squids, are going bye-bye. Additionally, for every page view, 
the display site has to wait for the data site to complete the query.
Think wikipedia is slow today? Think again...

That can't be it, either.

Oh, sure, we can cache the queries with results on the display site, or 
only update the data once a day/week, but then we won't be wiki (=quick) 
anymore, no? Will this be the price to pay?

I think I'll have that autumn-depression now, please...

Magnus