Re: [Wikitech-l] WikiData: A rant

21 Oct 2004

Good acumen Magnus. A very incisive "rant".

Anyway, just musing and mulling:
Could the PULL method be implemented w/ a checksum?
Say, generate a fairly short checksum (we're talking versioning here, 
not security) for every article revision.
Then, with each request hitting an Intarweb-facing (caching) webserver, 
have that cache box look if there's a version stored in its cache for 
said article (if not, fetch the article from the actual DB , etc.). IF 
however there is a cached version, ask the DB server for its current 
checksum on its current version. If this matches the checksum the cache 
has for its version, just don't bother the DB any further and serve the 
page from the cache. If the checksums differ, then again fetch the 
article from the DB and serve that (and cache the new article and 
checksum for potential subsequent requests). This entire checksum thing 
will NOT be required for any cached non-current revisions, because they 
won't change. So, yes, for each request hitting the cache server, 
there'd be a short checksum PULL with the actual DB server, but other 
than that (and provided the article hasn't changed) it can just be 
served from the cache.

Does that make sense to people?
Or am I reinventing the wheel or something?
I'm just brainstorming, I'm not even a real programmer. (Translation: 
The above may be--or may not be--rubbish.)

-- ropers [[en:User:Ropers]]
     www.ropersonline.com

On 21 Oct 2004, at 21:07, Magnus Manske wrote:

...
  Warning: Long and depressing text follows. Don't
read it at home, save 
 it for work instead. Better spend a nice evening with your girlfriend. 
 (Then again, this list is probably like slashdot, so forget about the 
 imaginary girlfriend and continue reading ;-)

 I thought I had it all figured out.

 I created a demo version for data entry in a wiki-like fashion. It 
 uses a "one-table-fits-all" SQL schema, which some of you had worries 
 about. No problem. If someone else write a better data entry 
 mechanism, I'm all for it. As far as it concerns me, the WikiData site 
 should be like a black box to the outside, serving data to wikipedias 
 and everyone else who wants it. What's going on inside is only for 
 those who enter the data.

 Today, I finished creating a rough draft for the query (the wikipedia) 
 side of the bargain. Istead of creating Yet Another Wikimarkup [{(like 
 this)}] I figured out that we should separate the query and the 
 display part, and hide the query part within the template system. Goes 
 like this:

 {{speciesdata:Foobus Barus}}

 in the article; [[Template:Speciesdata]] looks like this:

 <data>
 <query database="wikispecies" result="r1">Some sort of XQuery
or SQL 
 query for wikispecies for {{{1}}}</query>
 Some species data table using <r1>latin_name</r1>,
<r1>name_en</r1>, 
 <r1>family</r1> etc.
 </data>

 For creating lists (like "all species within the family 'Foobus'"), a 
 <foreach> element could be used.
 The <data> thingy would be a plugin ("plugins GOOD!"), but one that 
 returns wikitext to be parsed further. It would handle the <query> and 
 <foreach> tags etc.

 So, we'd have *one* ugly m..........r of a <data><query> kind 
 template, which would, once created, not be edited again a lot. All 
 the powerful, functional uglyness that could scare newbies away would 
 be hidden through the template.
 Yes, I got it all figured out.

 Then it hit me.

 As good "wiki-fiddlers" (thanks so much, Register!) we would like to 
 see every change in WikiData on the wikipedia pages real soon. Like, 
 now.
 So the information that something changes, and what changed, has to 
 pass from the data site to the display site. There are two ways to do 
 that: push or pull.

 PUSH means the data site will notify the display site that something 
 has changed, and the display needs to be updated. For that, the data 
 site has to know which pages of the display site are affected by which 
 change. Then, it has to notify the display site of this. Bad things:
 * Needs basically a cache of *all* queries *ever* asked of the data 
 site, as well as their results
 * Has to recalculate *all* of these after *every* change to find which 
 queries produce different results
 * Won't work if the display site is offline
 * Won't work well with non-wikipedias

 That can't be it.

 PULL means the display site asks the data site if anything has 
 changed, which basically means rerunning a query. Which means, doing 
 this for *every* pageview, even for anons. Which means, all caching 
 variants, including squids, are going bye-bye. Additionally, for every 
 page view, the display site has to wait for the data site to complete 
 the query.
 Think wikipedia is slow today? Think again...

 That can't be it, either.

 Oh, sure, we can cache the queries with results on the display site, 
 or only update the data once a day/week, but then we won't be wiki 
 (=quick) anymore, no? Will this be the price to pay?

 I think I'll have that autumn-depression now, please...

 Magnus
 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)wikimedia.org
 http://mail.wikipedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] WikiData: A rant