Re: [Wikitech-l] Request for Comments: Cross site data access for Wikidata

23 Apr 2012


      On 23/04/12 18:42, Daniel Kinzler wrote:
...
On 23.04.2012 17:28, Platonides wrote:
...
On 23/04/12 14:45, Daniel Kinzler wrote:
...
*#* if we only update language links, the page doesn't even need to be
re-parsed: we just update the languagelinks in the cached ParserOutput object.
It's not that simple, for instance, they may be several ParserOutputs
for the same page. On the bright side, you probably don't need it. I'd
expect that if interwikis are handled through wikidata, they are
completely replaced through a hook, so no need to touch the ParserOutput
objects.
I would go that way if we were just talking about languagelinks. But we have to
provide for phase II (infoboxes) and III (automated lists) too. Since we'll have
to re-parse in most cases anyway (and parsing pages without infoboxes tends to
be cheaper anyway), I see no benefit in spending time on inventing a way to
bypass parsing. It's tempting, granted, but it seems a distraction atm.
Sure, but in those cases you need to reparse the full page. No need to
make tricks modifying the ParserOutput. :)
So, if you want to skip the reparsing for iw fine, but just use a hook.
...
...
I think a save/purge shall always fetch the data. We can't store the
copy in the parsed object.
well, for languagelinks, we already do, and will probably keep doing it. Other
data, which will be used in the page content, shouldn't be stored in the parser
output. The parser should take them from some cache.
The ParserOutput is a parsed representation of the wikitext. The cached
wikidata interwikis shouldn't be stored there (or at least, not only
there, in case it saved the interwikis as they were on last full-render).
...
...
What we can do is to fetch is from a local cache or directly from the
origin one.
Indeed. Local or remote, DB directly or HTTP... we can have FileRepo-like
plugins for that, sure. But:
The real question is how purging and updating will work. Pushing? Polling?
Purge-and-pull?
...
You mention the cache for the push model, but I think it deserves a
clearer separation.
Can you explain what you have in mind?
I mean, they are based in the same concept. What really matters is how
things reach the db.
I'd have WikiData db replicated to {{places}}.
For WMF, all wikis could connect directly to the main instance, have a
slave "assigned" to each cluster...
Then on each page render, the variables used could be checked with the
latest version (unless checked in last x minutes) and trigger a rerender
if different.
So, suppose a page uses the fact
Germany{capital:"Berlin";language:"German"},
it would store that along the version of WikiData used (eg. Wikidata 2.0,
Germany 488584364).
When going to show it, it would check:
1) Is the latest WikiData version newer than 2.0? (No-> go to 5)
2) Is the Germany module newer than 488584364? (No-> Store that it's up
to date to WikiData 3, go to 5)
3) Fetch Germany data. If the used data hasn't changed, update the
metadata. Go to 5.
4) Re-render the page.
5) Show contents.
As for actively purging the pages content, that's interesting only for
the anons.
You'd need a script able to replicate a purge for a WikiData changes
range. That'd basically perform the above steps, but making the render
through the job queue.
A normal wiki would call those functions while replicating, but wikis
with a shared db (or dropping full files with newer data) would run it
standalone (plus utility on screw ups).
...
...
You'd probably also want multiple dbs (let's call them WikiData
repositories), partitioned by content (and its update frequency). You
could then use different frontends (as Chad says, "similar to FileRepo").
So, a WikiData repository with the atom properties of each element would
happily live in a dba file. Interwikis would have to be on a MySQL db, etc.
This is what I was aiming at with the DataTransclusion extension a while back.
But currently, we are not building a tool for including arbitrary data sources
in wikipedia. We are building a central database for maintaining factual
information. Our main objective is to get that done.
Not arbitrary, but having different sources (repositories), even if they
are under control of the same entity. Mostly interesting for slow-fast
altough I'm sure reusers would find more use cases, such as only
downloading the db about this section.
...
A design that is flexible enough to easily allow for future inclusion of other
data sources would be nice. As long as the abstraction doesn't get in the way.
Anyway, it seems that it boils down to this:

The client needs some (abstracted?) way to access the reporitory/repositories
The repo needs to be able to notify the client sites about changes, be it via

push, pr purge, or polling.
3) We'll need a local cache or cross-site database access.
So, which combination of these techniques would you prefer?
-- daniel
I'd use a pull-based model. That seems to be what fits better with
current MediaWiki model. But it isn't too relevant at this time (or you
have advanced a lot by now!).

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Request for Comments: Cross site data access for Wikidata