On 23/04/12 18:42, Daniel Kinzler wrote:
On 23.04.2012 17:28, Platonides wrote:
On 23/04/12 14:45, Daniel Kinzler wrote:
*#* if we only update language links, the page doesn't even need to be re-parsed: we just update the languagelinks in the cached ParserOutput object.
It's not that simple, for instance, they may be several ParserOutputs for the same page. On the bright side, you probably don't need it. I'd expect that if interwikis are handled through wikidata, they are completely replaced through a hook, so no need to touch the ParserOutput objects.
I would go that way if we were just talking about languagelinks. But we have to provide for phase II (infoboxes) and III (automated lists) too. Since we'll have to re-parse in most cases anyway (and parsing pages without infoboxes tends to be cheaper anyway), I see no benefit in spending time on inventing a way to bypass parsing. It's tempting, granted, but it seems a distraction atm.
Sure, but in those cases you need to reparse the full page. No need to make tricks modifying the ParserOutput. :) So, if you want to skip the reparsing for iw fine, but just use a hook.
I think a save/purge shall always fetch the data. We can't store the copy in the parsed object.
well, for languagelinks, we already do, and will probably keep doing it. Other data, which will be used in the page content, shouldn't be stored in the parser output. The parser should take them from some cache.
The ParserOutput is a parsed representation of the wikitext. The cached wikidata interwikis shouldn't be stored there (or at least, not only there, in case it saved the interwikis as they were on last full-render).
What we can do is to fetch is from a local cache or directly from the origin one.
Indeed. Local or remote, DB directly or HTTP... we can have FileRepo-like plugins for that, sure. But:
The real question is how purging and updating will work. Pushing? Polling? Purge-and-pull?
You mention the cache for the push model, but I think it deserves a clearer separation.
Can you explain what you have in mind?
I mean, they are based in the same concept. What really matters is how things reach the db. I'd have WikiData db replicated to {{places}}. For WMF, all wikis could connect directly to the main instance, have a slave "assigned" to each cluster... Then on each page render, the variables used could be checked with the latest version (unless checked in last x minutes) and trigger a rerender if different.
So, suppose a page uses the fact Germany{capital:"Berlin";language:"German"}, it would store that along the version of WikiData used (eg. Wikidata 2.0, Germany 488584364).
When going to show it, it would check: 1) Is the latest WikiData version newer than 2.0? (No-> go to 5) 2) Is the Germany module newer than 488584364? (No-> Store that it's up to date to WikiData 3, go to 5) 3) Fetch Germany data. If the used data hasn't changed, update the metadata. Go to 5. 4) Re-render the page. 5) Show contents.
As for actively purging the pages content, that's interesting only for the anons. You'd need a script able to replicate a purge for a WikiData changes range. That'd basically perform the above steps, but making the render through the job queue. A normal wiki would call those functions while replicating, but wikis with a shared db (or dropping full files with newer data) would run it standalone (plus utility on screw ups).
You'd probably also want multiple dbs (let's call them WikiData repositories), partitioned by content (and its update frequency). You could then use different frontends (as Chad says, "similar to FileRepo"). So, a WikiData repository with the atom properties of each element would happily live in a dba file. Interwikis would have to be on a MySQL db, etc.
This is what I was aiming at with the DataTransclusion extension a while back.
But currently, we are not building a tool for including arbitrary data sources in wikipedia. We are building a central database for maintaining factual information. Our main objective is to get that done.
Not arbitrary, but having different sources (repositories), even if they are under control of the same entity. Mostly interesting for slow-fast altough I'm sure reusers would find more use cases, such as only downloading the db about this section.
A design that is flexible enough to easily allow for future inclusion of other data sources would be nice. As long as the abstraction doesn't get in the way.
Anyway, it seems that it boils down to this:
- The client needs some (abstracted?) way to access the reporitory/repositories
- The repo needs to be able to notify the client sites about changes, be it via
push, pr purge, or polling. 3) We'll need a local cache or cross-site database access.
So, which combination of these techniques would you prefer?
-- daniel
I'd use a pull-based model. That seems to be what fits better with current MediaWiki model. But it isn't too relevant at this time (or you have advanced a lot by now!).