On 23.04.2012 17:28, Platonides wrote:
On 23/04/12 14:45, Daniel Kinzler wrote:
*#* if we only update language links, the page
doesn't even need to be
re-parsed: we just update the languagelinks in the cached ParserOutput object.
It's not that simple, for instance, they may be several ParserOutputs
for the same page. On the bright side, you probably don't need it. I'd
expect that if interwikis are handled through wikidata, they are
completely replaced through a hook, so no need to touch the ParserOutput
objects.
I would go that way if we were just talking about languagelinks. But we have to
provide for phase II (infoboxes) and III (automated lists) too. Since we'll have
to re-parse in most cases anyway (and parsing pages without infoboxes tends to
be cheaper anyway), I see no benefit in spending time on inventing a way to
bypass parsing. It's tempting, granted, but it seems a distraction atm.
*# invalidate
the (parser) cache for all pages that use the respective item (for
now we can assume that we know this from the language links)
And in such case, you
don't need to invalidate the parser cache. Only if
it was factual data embedded into the page.
Which will be a very frequent case in the next phase: most infoboxes will (at
some point) work like that.
I think a save/purge shall always fetch the data. We
can't store the
copy in the parsed object.
well, for languagelinks, we already do, and will probably keep doing it. Other
data, which will be used in the page content, shouldn't be stored in the parser
output. The parser should take them from some cache.
What we can do is to fetch is from a local cache or
directly from the
origin one.
Indeed. Local or remote, DB directly or HTTP... we can have FileRepo-like
plugins for that, sure. But:
The real question is how purging and updating will work. Pushing? Polling?
Purge-and-pull?
You mention the cache for the push model, but I think
it deserves a
clearer separation.
Can you explain what you have in mind?
You'd probably also want multiple dbs (let's
call them WikiData
repositories), partitioned by content (and its update frequency). You
could then use different frontends (as Chad says, "similar to FileRepo").
So, a WikiData repository with the atom properties of each element would
happily live in a dba file. Interwikis would have to be on a MySQL db, etc.
This is what I was aiming at with the DataTransclusion extension a while back.
But currently, we are not building a tool for including arbitrary data sources
in wikipedia. We are building a central database for maintaining factual
information. Our main objective is to get that done.
A design that is flexible enough to easily allow for future inclusion of other
data sources would be nice. As long as the abstraction doesn't get in the way.
Anyway, it seems that it boils down to this:
1) The client needs some (abstracted?) way to access the reporitory/repositories
2) The repo needs to be able to notify the client sites about changes, be it via
push, pr purge, or polling.
3) We'll need a local cache or cross-site database access.
So, which combination of these techniques would you prefer?
-- daniel