Re: [Wikitech-l] Data flow from Wikidata to the Wikipedias

7 Nov 2012

      On 06/11/12 23:16, Daniel Kinzler wrote:
...
On 05.11.2012 05:43, Tim Starling wrote:
...
On 02/11/12 22:35, Denny Vrandečić wrote:
...

For re-rendering the page, the wiki needs access to the data.

We are not sure about how do to this best: have it per cluster,
or in one place only?
Why do you need to re-render a page if only the language links are
changed? Language links are only in the navigation area, the wikitext
content is not affected.
Because AFAIK language links are cached in the parser output object, and
rendered into the skin from there. Asking the database for them every time seems
like overhead if the cached ParserOutput already has them... I believe we
currently use the one from the PO if it's there. Am I wrong about that?
You can use memcached.
...
We could get around this, but even then it would be an optimization for language
links. But wikidata is soon going to provide data for infoboxes. Any aspect of a
data item could be sued in an {{#if:...}}. So we need to re-render the page
whenever an item changes.
Wikidata is somewhere around 61000 physical lines of code now. Surely
somewhere in that mountain of code, there is a class for the type of
an item, where an update method can be added.
I don't think it is feasible to parse pages very much more frequently
than they are already parsed as a result of template updates (i.e.
refreshLinks jobs). The CPU cost of template updates is already very
high. Maybe it would be possible if the updates were delayed, run say
once per day, to allow more effective duplicate job removal. Template
updates should probably be handled in the same way.
Of course, with template updates, you don't have to wait for the
refreshLinks job to run before the new content becomes visible,
because page_touched is updated and Squid is purged before the job is
run. That may also be feasible with Wikidata.
If a page is only viewed once a week, you don't want to be rendering
it 5 times per day. The idea is to delay rendering until the page is
actually requested, and to update links periodically.
A page which is viewed once per week is not an unrealistic scenario.
We will probably have bot-generated geographical articles for just
about every town in the world, in 200 or so languages, and all of them
will pull many entities from Wikidata. The majority of those articles
will be visited by search engine crawlers much more often than they
are visited by humans.
The reason I think duplicate removal is essential is because entities
will be updated in batches. For example, a census in a large country
might result in hundreds of thousands of item updates.
What I'm suggesting is not quite the same as what you call
"coalescing" in your design document. Coalescing allows you to reduce
the number of events in recentchanges, and presumably also the number
of Squid purges and page_touched updates. I'm saying that even after
coalescing, changes should be merged further to avoid unnecessaray
parsing.
...
Also, when the page is edited manually, and then rendered, the wiki need to
somehow know a) which item ID is associated with this page and b) it needs to
load the item data to be able to render the page (just the language links, or
also infobox data, or eventually also the result of a wikidata query as a list).
You could load the data from memcached while the page is being parsed,
instead of doing it in advance, similar to what we do for images.
Dedicating hundreds of processor cores to parsing articles immediately
after every wikidata change doesn't sound like a great way to avoid a
few memcached queries.
...
...
As I've previously explained, I don't think the langlinks table on the
client wiki should be updated. So you only need to purge Squid and add
an entry to Special:RecentChanges.
If the language links from wikidata is not fulled in during rendering and stored
in the parseroutput object, and it's also not stored in the langlinks table,
where is it stored, then?
In the wikidatawiki DB, cached in memcached.
...
How should we display it?
Use an OutputPage or Skin hook, such as OutputPageParserOutput.
...
...
Purging Squid can certainly be done from the context of a wikidatawiki
job. For RecentChanges the main obstacle is accessing localisation
text. You could use rc_params to store language-independent message
parameters, like what we do for log entries.
We also need to resolve localized namespace names so we can put the correct
namespace id into the RC table. I don't see a good way to do this from the
context of another wiki (without using the web api).
You can get the namespace names from $wgConf and localisation cache,
and then duplicate the code from Language::getNamespaces() to put it
all together, along the lines of:
$wgConf->loadFullData();
$extraNamespaces = $wgConf->get( 'wgExtraNamespaces', $wiki ) );
$metaNamespace = $wgConf->get( 'wgMetaNamespace', $wiki );
$metaNamespaceTalk = $wgConf->get( 'wgMetaNamespace', $wiki );
list( $site, $lang ) = $wgConf->siteFromDB( $wiki );
$defaults = Language::getLocalisationCache()
   ->getItem( $lang,'namespaceNames' );
But using the web API and caching the result in a file in
$wgCacheDirectory would be faster and easier. $wgConf->loadFullData()
takes about 16ms, it's much slower than reading a small local file.
Like every other sort of link, entity links should probably be tracked
using the page_id of the origin (local) page, so that the link is not
invalidated when the page moves. So when you update recentchanges, you
can select the page_namespace from the page table. So the problem of
namespace display would occur on the repo UI side.
-- Tim Starling

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Data flow from Wikidata to the Wikipedias