On 23.04.2012 17:28, Platonides wrote:
On 23/04/12 14:45, Daniel Kinzler wrote:
*#* if we only update language links, the page doesn't even need to be re-parsed: we just update the languagelinks in the cached ParserOutput object.
It's not that simple, for instance, they may be several ParserOutputs for the same page. On the bright side, you probably don't need it. I'd expect that if interwikis are handled through wikidata, they are completely replaced through a hook, so no need to touch the ParserOutput objects.
I would go that way if we were just talking about languagelinks. But we have to provide for phase II (infoboxes) and III (automated lists) too. Since we'll have to re-parse in most cases anyway (and parsing pages without infoboxes tends to be cheaper anyway), I see no benefit in spending time on inventing a way to bypass parsing. It's tempting, granted, but it seems a distraction atm.
*# invalidate the (parser) cache for all pages that use the respective item (for now we can assume that we know this from the language links)
And in such case, you don't need to invalidate the parser cache. Only if it was factual data embedded into the page.
Which will be a very frequent case in the next phase: most infoboxes will (at some point) work like that.
I think a save/purge shall always fetch the data. We can't store the copy in the parsed object.
well, for languagelinks, we already do, and will probably keep doing it. Other data, which will be used in the page content, shouldn't be stored in the parser output. The parser should take them from some cache.
What we can do is to fetch is from a local cache or directly from the origin one.
Indeed. Local or remote, DB directly or HTTP... we can have FileRepo-like plugins for that, sure. But:
The real question is how purging and updating will work. Pushing? Polling? Purge-and-pull?
You mention the cache for the push model, but I think it deserves a clearer separation.
Can you explain what you have in mind?
You'd probably also want multiple dbs (let's call them WikiData repositories), partitioned by content (and its update frequency). You could then use different frontends (as Chad says, "similar to FileRepo"). So, a WikiData repository with the atom properties of each element would happily live in a dba file. Interwikis would have to be on a MySQL db, etc.
This is what I was aiming at with the DataTransclusion extension a while back.
But currently, we are not building a tool for including arbitrary data sources in wikipedia. We are building a central database for maintaining factual information. Our main objective is to get that done.
A design that is flexible enough to easily allow for future inclusion of other data sources would be nice. As long as the abstraction doesn't get in the way.
Anyway, it seems that it boils down to this:
1) The client needs some (abstracted?) way to access the reporitory/repositories 2) The repo needs to be able to notify the client sites about changes, be it via push, pr purge, or polling. 3) We'll need a local cache or cross-site database access.
So, which combination of these techniques would you prefer?
-- daniel