On 07.11.2012 00:41, Tim Starling wrote:
On 06/11/12 23:16, Daniel Kinzler wrote:
On 05.11.2012 05:43, Tim Starling wrote:
On 02/11/12 22:35, Denny Vrandečić wrote:
- For re-rendering the page, the wiki needs access to the data.
We are not sure about how do to this best: have it per cluster, or in one place only?
Why do you need to re-render a page if only the language links are changed? Language links are only in the navigation area, the wikitext content is not affected.
Because AFAIK language links are cached in the parser output object, and rendered into the skin from there. Asking the database for them every time seems like overhead if the cached ParserOutput already has them... I believe we currently use the one from the PO if it's there. Am I wrong about that?
You can use memcached.
Ok, let me see if I understand what you are suggesting.
So, in memcached, we'd have the language links for every page (or as many as fit in there); actually, three lists per page: one of the links defined on the page itself, one of the links defined by wikidata, and one of the wikidata links suppressed locally.
When generating the langlinks in the sidebar, these two would be combined appropriately. If we don't find anything in memcached for this of course, we need to parse the page to get the locally defined language links.
When wikidata updates, we just update the record in memcached and invalidate the page.
As far as I can see, we then can get the updated language links before the page has been re-parsed, but we still need to re-parse eventually. And, when someone actually looks at the page, the page does get parsed/rendered right away, and the user sees the updated langlinks. So... what do we need the pre-parse-update-of-langlinks for? Where and when would they even be used? I don't see the point.
We could get around this, but even then it would be an optimization for language links. But wikidata is soon going to provide data for infoboxes. Any aspect of a data item could be sued in an {{#if:...}}. So we need to re-render the page whenever an item changes.
Wikidata is somewhere around 61000 physical lines of code now. Surely somewhere in that mountain of code, there is a class for the type of an item, where an update method can be added.
I don't understand what you are suggesting. At the moment, when EntityContent::save() is called, it will trigger a change notification, which is written to the wb_changes table. On the client side, a maintenance script polls that table. What could/should be changed about that?
I don't think it is feasible to parse pages very much more frequently than they are already parsed as a result of template updates (i.e. refreshLinks jobs).
I don't see why we would parse more frequently. An edit is an edit, locally or remotely. If you want a language link to be updated, the page needs to be reparsed, whether that is triggered by wikidata or a bot edit. At least, wikidata doesn't create a new revision.
The CPU cost of template updates is already very high. Maybe it would be possible if the updates were delayed, run say once per day, to allow more effective duplicate job removal. Template updates should probably be handled in the same way.
My proposal is indeed unclear on one point: it does not clearly distinguish between invalidating a page and re-rendering it. I think denny mentioned re-rendering in his original mail. The fact is: At the moment, we do not re-render at all. We just invalidate. And I think that's good enough for now.
I don't see how that duplicate removal would work beyond the coalescing I already suggested - except that for a large batch that covers a whole day, a lot more can be coalesced.
Of course, with template updates, you don't have to wait for the refreshLinks job to run before the new content becomes visible, because page_touched is updated and Squid is purged before the job is run. That may also be feasible with Wikidata.
We call Title::invalidateCache(). That ought to do it, right?
If a page is only viewed once a week, you don't want to be rendering it 5 times per day. The idea is to delay rendering until the page is actually requested, and to update links periodically.
As I said, we currently don't re-render at all, and whether and when we should is up for discussion. Maybe there could just be a background job re-rendering all "dirty" pages every 24 hours or so, to keep the link tables up to date.
Note that we do need to re-parse eventually: Infoboxes will contain things like {{#property:population}}, which need to be invalidated when the data item changes. Any aspect of a data item can be used in conditionals:
{{#if:{{#property:commons-gallery}}|{{commons|{{#property:commons-gallery}}}}}}
Sitelinks (Language links) too can be accessed via parser functions and used in conditionals.
The reason I think duplicate removal is essential is because entities will be updated in batches. For example, a census in a large country might result in hundreds of thousands of item updates.
Yes, but for different items. How can we remove any duplicate updates if there is just one edit per item? Why would there be multiple?
(Note: the current UI only supports atomic edits, one value at a time. The API however allows bots to change any number of values at once, reducing the number of change events.)
What I'm suggesting is not quite the same as what you call "coalescing" in your design document. Coalescing allows you to reduce the number of events in recentchanges, and presumably also the number of Squid purges and page_touched updates. I'm saying that even after coalescing, changes should be merged further to avoid unnecessaray parsing.
Ok, so there would be a re-parse queue with duplicate removal. When a change notification is processed (after coalescing notifications), the target page is invalidated using Title::invalidateCache() and it's also placed in the re-parse queue to be processed later. How is this different from the job queue used for parsing after template edits?
Also, when the page is edited manually, and then rendered, the wiki need to somehow know a) which item ID is associated with this page and b) it needs to load the item data to be able to render the page (just the language links, or also infobox data, or eventually also the result of a wikidata query as a list).
You could load the data from memcached while the page is being parsed, instead of doing it in advance, similar to what we do for images.
How does it get into memcached? What if it's not there?
Dedicating hundreds of processor cores to parsing articles immediately after every wikidata change doesn't sound like a great way to avoid a few memcached queries.
Yea, as I said above, this is a misunderstanding. We don't insist on immediate reparsing, we just think the pages need to be invalidated (i.e. *scheduled* for parsing). I'll adjust the proposal to reflect that distinction.
As I've previously explained, I don't think the langlinks table on the client wiki should be updated. So you only need to purge Squid and add an entry to Special:RecentChanges.
If the language links from wikidata is not fulled in during rendering and stored in the parseroutput object, and it's also not stored in the langlinks table, where is it stored, then?
In the wikidatawiki DB, cached in memcached.
How should we display it?
Use an OutputPage or Skin hook, such as OutputPageParserOutput.
Do I understand correctly that the point of this is to be able to update the sitelinks quickly, without parsing the page? We *do* need to parse the page anyway, though doing so later or only when the page is requested would probably be fine.
Note that I'd still suggest to write the *effective* language links to the langlink table, for consistency. I don't see a problem with that.
You can get the namespace names from $wgConf and localisation cache, and then duplicate the code from Language::getNamespaces() to put it all together, along the lines of:
$wgConf->loadFullData(); $extraNamespaces = $wgConf->get( 'wgExtraNamespaces', $wiki ) ); $metaNamespace = $wgConf->get( 'wgMetaNamespace', $wiki ); $metaNamespaceTalk = $wgConf->get( 'wgMetaNamespace', $wiki ); list( $site, $lang ) = $wgConf->siteFromDB( $wiki ); $defaults = Language::getLocalisationCache() ->getItem( $lang,'namespaceNames' );
But using the web API and caching the result in a file in $wgCacheDirectory would be faster and easier. $wgConf->loadFullData() takes about 16ms, it's much slower than reading a small local file.
Writing to another wiki's database without a firm handle on that wiki's config sounds quite scary and brittle to me. It can be done, and we can pull together all the necessary info, but... do you really think this is a good idea? What are we gaining by doing it this way?
Like every other sort of link, entity links should probably be tracked using the page_id of the origin (local) page, so that the link is not invalidated when the page moves.
This is the wrong way around: sitelinks go from wikidata to wikipedia. As with all links, link targets are tracked by title, and break when stuff is renamed. When you move a page on Wikipedia, it loses it's connection to the Wikidata item, unless you update the Wikidata item (we plan to offer a button on the page move form on wikipedia to do this conveniently).
So when you update recentchanges, you can select the page_namespace from the page table. So the problem of namespace display would occur on the repo UI side.
There's two use cases to consider:
* when a change notification comes in, we need to inject the corresponding record into the rc table of every wiki using the respective item. To do that, we need access to some aspects of that wiki's config. Your proposal for caching the namespace info would cover that.
* when a page is re-rendered, we need access to the data item, so we can pull in the data fields via parser functions (in phase II). How does page Foo know that it needs to load item Q5432? And how does it load the item data?
I currently envision that the page <-> item mapping would be maintained locally, so a simple lookup would provide the item ID. And the item data could ideally be pulled from ES - that needs some refactoring though. Our current solution has a cache table with the full uncompressed item data (latest revision only), which could be maintained on every cluster or only on the repo. I'm now inclined though to implement direct ES access. I have poked around a bit, and it seems that this is possible without factoring out standalone BlobStore classes (although that would still be nice). I'll put a note into the proposal to that effect.
-- daniel