On 06/11/12 23:16, Daniel Kinzler wrote:
On 05.11.2012 05:43, Tim Starling wrote:
On 02/11/12 22:35, Denny Vrandečić wrote:
- For re-rendering the page, the wiki needs access to the data.
We are not sure about how do to this best: have it per cluster, or in one place only?
Why do you need to re-render a page if only the language links are changed? Language links are only in the navigation area, the wikitext content is not affected.
Because AFAIK language links are cached in the parser output object, and rendered into the skin from there. Asking the database for them every time seems like overhead if the cached ParserOutput already has them... I believe we currently use the one from the PO if it's there. Am I wrong about that?
You can use memcached.
We could get around this, but even then it would be an optimization for language links. But wikidata is soon going to provide data for infoboxes. Any aspect of a data item could be sued in an {{#if:...}}. So we need to re-render the page whenever an item changes.
Wikidata is somewhere around 61000 physical lines of code now. Surely somewhere in that mountain of code, there is a class for the type of an item, where an update method can be added.
I don't think it is feasible to parse pages very much more frequently than they are already parsed as a result of template updates (i.e. refreshLinks jobs). The CPU cost of template updates is already very high. Maybe it would be possible if the updates were delayed, run say once per day, to allow more effective duplicate job removal. Template updates should probably be handled in the same way.
Of course, with template updates, you don't have to wait for the refreshLinks job to run before the new content becomes visible, because page_touched is updated and Squid is purged before the job is run. That may also be feasible with Wikidata.
If a page is only viewed once a week, you don't want to be rendering it 5 times per day. The idea is to delay rendering until the page is actually requested, and to update links periodically.
A page which is viewed once per week is not an unrealistic scenario. We will probably have bot-generated geographical articles for just about every town in the world, in 200 or so languages, and all of them will pull many entities from Wikidata. The majority of those articles will be visited by search engine crawlers much more often than they are visited by humans.
The reason I think duplicate removal is essential is because entities will be updated in batches. For example, a census in a large country might result in hundreds of thousands of item updates.
What I'm suggesting is not quite the same as what you call "coalescing" in your design document. Coalescing allows you to reduce the number of events in recentchanges, and presumably also the number of Squid purges and page_touched updates. I'm saying that even after coalescing, changes should be merged further to avoid unnecessaray parsing.
Also, when the page is edited manually, and then rendered, the wiki need to somehow know a) which item ID is associated with this page and b) it needs to load the item data to be able to render the page (just the language links, or also infobox data, or eventually also the result of a wikidata query as a list).
You could load the data from memcached while the page is being parsed, instead of doing it in advance, similar to what we do for images. Dedicating hundreds of processor cores to parsing articles immediately after every wikidata change doesn't sound like a great way to avoid a few memcached queries.
As I've previously explained, I don't think the langlinks table on the client wiki should be updated. So you only need to purge Squid and add an entry to Special:RecentChanges.
If the language links from wikidata is not fulled in during rendering and stored in the parseroutput object, and it's also not stored in the langlinks table, where is it stored, then?
In the wikidatawiki DB, cached in memcached.
How should we display it?
Use an OutputPage or Skin hook, such as OutputPageParserOutput.
Purging Squid can certainly be done from the context of a wikidatawiki job. For RecentChanges the main obstacle is accessing localisation text. You could use rc_params to store language-independent message parameters, like what we do for log entries.
We also need to resolve localized namespace names so we can put the correct namespace id into the RC table. I don't see a good way to do this from the context of another wiki (without using the web api).
You can get the namespace names from $wgConf and localisation cache, and then duplicate the code from Language::getNamespaces() to put it all together, along the lines of:
$wgConf->loadFullData(); $extraNamespaces = $wgConf->get( 'wgExtraNamespaces', $wiki ) ); $metaNamespace = $wgConf->get( 'wgMetaNamespace', $wiki ); $metaNamespaceTalk = $wgConf->get( 'wgMetaNamespace', $wiki ); list( $site, $lang ) = $wgConf->siteFromDB( $wiki ); $defaults = Language::getLocalisationCache() ->getItem( $lang,'namespaceNames' );
But using the web API and caching the result in a file in $wgCacheDirectory would be faster and easier. $wgConf->loadFullData() takes about 16ms, it's much slower than reading a small local file.
Like every other sort of link, entity links should probably be tracked using the page_id of the origin (local) page, so that the link is not invalidated when the page moves. So when you update recentchanges, you can select the page_namespace from the page table. So the problem of namespace display would occur on the repo UI side.
-- Tim Starling