Re: [Wikitech-l] WikiData: A rant

21 Oct 2004

On 22 Oct 2004, at 00:32, Jens Ropers wrote:

...

 On 21 Oct 2004, at 23:31, Magnus Manske wrote:

  Jens Ropers wrote:

  Anyway, just musing and mulling:
 Could the PULL method be implemented w/ a checksum?
 Say, generate a fairly short checksum (we're talking versioning 
 here, not security) for every article revision.
 Then, with each request hitting an Intarweb-facing (caching) 
 webserver, have that cache box look if there's a version stored in 
 its cache for said article (if not, fetch the article from the 
 actual DB , etc.). IF however there is a cached version, ask the DB 
 server for its current checksum on its current version. If this 
 matches the checksum the cache has for its version, just don't 
 bother the DB any further and serve the page from the cache. If the 
 checksums differ, then again fetch the article from the DB and serve 
 that (and cache the new article and checksum for potential 
 subsequent requests). This entire checksum thing will NOT be 
 required for any cached non-current revisions, because they won't 
 change. So, yes, for each request hitting the cache server, there'd 
 be a short checksum PULL with the actual DB server, but other than 
 that (and provided the article hasn't changed) it can just be served 
 from the cache. 
 So, the DB server keeps a list with a checksum (or a version number; 
 this is supposed to be wiki-like) for each data entry, and likewise 
 does the article, right? 
 Yup. 
To add:

We shouldn't however confuse these "checksum version numbers" with 
existent Wikipedia "article revision version numbers". Because past 
revisions never change and the entire point in this "checksum version 
number"-system is to determine whether the CURRENT version of the 
article has changed without bothering the actual DB server.

Second thought, once there are version numbers for CURRENT articles 
(see bug 181 -- http://bugzilla.wikipedia.org/show_bug.cgi?id=181), 
then these could/should be used as our checksums: The Internet-facing 
cache server would check if its article version number matches the 
version number the DB presently knows as being the CURRENT one.

I hope this makes sense.

...

  What if it more than a single data entry in that
article? Like the 
 list of species I mentioned.Say, a new species was added at wikidata; 
 how to handle that one?
 What if there are multiple queries in one article?
 What if (in my example) the actual query is in a template?
 What if that template includes other templates that contain queries? 
 I wouldn't have a clue to be honest. I would, in my non-coder and 
 possibly naive imagination reckon that maybe a "one article-one 
 checksum" principle should work with most pages. As regards Wikidata 
 et alia, well, I dunno -- counting templates (and possibly single data 
 sets; but I don't really know a lot about what you're building/you've 
 built there) as articles may work as well. Then again, just having 
 (only) ordinary articles intelligently cached as per the above 
 proposal might solve the biggest part of our problem.

 Yes, I think that it could be done. But, and I say that as someone 
 who started programming with "spaghetti code", it looks like a mess 
 to me. A dependency nightmare. We are already suffering from such 
 effects (think categories in templates) without wikidata to look out 
 for.
 Also, you will have to query the DB server and wait for its answer on 
 *every* page view, including cached/anons, to deliver the 
 checksum(s). 
 True.
 I'm really not trying to deliberately complicate things , but "we" 
 could also make the DB servers report all checksums to a group of 
 separate dedicated checksum cache servers (which would replicate 
 between each other like there's no tomorrow, to make damn sure that 
 all checksum cache servers would ALWAYS have the identical set of 
 checksums). One of these redundant checksum servers would then be 
 checked (which should spread the load) and if there's so much as a 
 delay with querying one of them, then the checksum query could 
 fail-over (round robin) to the next checksum cache box. There prolly 
 should also remain a last resort fail-over option of querying the DB 
 server direct and doing away with the entire checksum thing (which, 
 after all is only there to save time and shield the DB server from 
 excessive queries).

  And, this will work only with the most
rudimentary database 
 structure, like "SELECT * from specieslist where name='Foo'". If 
 wikidata is to become more complex than that (and I don't say it 
 should, just speculating), if wikidata tables can be interlinked, 
 then there will be no "simple" dependency on a single data entry 
 anymore.

  Does that make sense to people?
 Or am I reinventing the wheel or something?
 I'm just brainstorming, I'm not even a real programmer. 
 (Translation: The above may be--or may not be--rubbish.) 
 Definitely not rubbish. But a lot more complicated than it looks at 
 first glance, IMHO. 
 Yea, that's probably true. All the worse as I won't be the one coding 
 it, because I, uh, lack the requisite programming skills :-/

 --ropers

  Magnus
 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)wikimedia.org
 http://mail.wikipedia.org/mailman/listinfo/wikitech-l

 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)wikimedia.org
 http://mail.wikipedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] WikiData: A rant