Data flow from Wikidata to the Wikipedias

List overview All Threads
Download

newer

older

MW coding query: pausing and...

Re: [Wikitech-l] [Wikimedia-l]...

Denny Vrandečić

2 Nov 2012 2 Nov '12

5:35 a.m.

Hi all,

Wikidata aims to centralize structured datas from the Wikipedias in one central wiki, starting with the language links. The main technical challenge that we will face is to implement the data flow on the WMF infrastructure efficiently. We invite peer-review on our design.

I am trying to give a simplified overview here. The full description is on-wiki: http://meta.wikimedia.org/wiki/Wikidata/Notes/Change_propagation

There are a number of design choices. Here is our current thinking:

* Every change on the language links in Wikidata is stored in the wb_changes table on Wikidata * A script (or several, depends on load), run per wiki cluster, checks wb_changes, gets a batch of changes it has not seen yet, and creates jobs for all pages that are affected in all wikis on the given cluster * When the jobs are executed, the respective page is re-rendered and the local recentchanges filled * For re-rendering the page, the wiki needs access to the data. We are not sure about how do to this best: have it per cluster, or in one place only?

We appreciate comments. A lot. This thing is make-or-break for the whole project, and it is getting kinda urgent.

Cheers, Denny

-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

Show replies by date

Tim Starling

4 Nov 4 Nov

10:43 p.m.

On 02/11/12 22:35, Denny Vrandečić wrote:

...

For re-rendering the page, the wiki needs access to the data.

We are not sure about how do to this best: have it per cluster, or in one place only?

Why do you need to re-render a page if only the language links are changed? Language links are only in the navigation area, the wikitext content is not affected.

As I've previously explained, I don't think the langlinks table on the client wiki should be updated. So you only need to purge Squid and add an entry to Special:RecentChanges.

Purging Squid can certainly be done from the context of a wikidatawiki job. For RecentChanges the main obstacle is accessing localisation text. You could use rc_params to store language-independent message parameters, like what we do for log entries.

-- Tim Starling

Daniel Kinzler

6 Nov 6 Nov

6:16 a.m.

On 05.11.2012 05:43, Tim Starling wrote:

...

On 02/11/12 22:35, Denny Vrandečić wrote:

...

For re-rendering the page, the wiki needs access to the data.

We are not sure about how do to this best: have it per cluster, or in one place only?

Why do you need to re-render a page if only the language links are changed? Language links are only in the navigation area, the wikitext content is not affected.

Because AFAIK language links are cached in the parser output object, and rendered into the skin from there. Asking the database for them every time seems like overhead if the cached ParserOutput already has them... I believe we currently use the one from the PO if it's there. Am I wrong about that?

We could get around this, but even then it would be an optimization for language links. But wikidata is soon going to provide data for infoboxes. Any aspect of a data item could be sued in an {{#if:...}}. So we need to re-render the page whenever an item changes.

Also, when the page is edited manually, and then rendered, the wiki need to somehow know a) which item ID is associated with this page and b) it needs to load the item data to be able to render the page (just the language links, or also infobox data, or eventually also the result of a wikidata query as a list).

...

As I've previously explained, I don't think the langlinks table on the client wiki should be updated. So you only need to purge Squid and add an entry to Special:RecentChanges.

If the language links from wikidata is not fulled in during rendering and stored in the parseroutput object, and it's also not stored in the langlinks table, where is it stored, then? How should we display it?

...

Purging Squid can certainly be done from the context of a wikidatawiki job. For RecentChanges the main obstacle is accessing localisation text. You could use rc_params to store language-independent message parameters, like what we do for log entries.

We also need to resolve localized namespace names so we can put the correct namespace id into the RC table. I don't see a good way to do this from the context of another wiki (without using the web api).

-- daniel

Tim Starling

5:41 p.m.

On 06/11/12 23:16, Daniel Kinzler wrote:

...

On 05.11.2012 05:43, Tim Starling wrote:

...
On 02/11/12 22:35, Denny Vrandečić wrote:

...

For re-rendering the page, the wiki needs access to the data.

We are not sure about how do to this best: have it per cluster, or in one place only?

Why do you need to re-render a page if only the language links are changed? Language links are only in the navigation area, the wikitext content is not affected.

Because AFAIK language links are cached in the parser output object, and rendered into the skin from there. Asking the database for them every time seems like overhead if the cached ParserOutput already has them... I believe we currently use the one from the PO if it's there. Am I wrong about that?

You can use memcached.

...

We could get around this, but even then it would be an optimization for language links. But wikidata is soon going to provide data for infoboxes. Any aspect of a data item could be sued in an {{#if:...}}. So we need to re-render the page whenever an item changes.

Wikidata is somewhere around 61000 physical lines of code now. Surely somewhere in that mountain of code, there is a class for the type of an item, where an update method can be added.

I don't think it is feasible to parse pages very much more frequently than they are already parsed as a result of template updates (i.e. refreshLinks jobs). The CPU cost of template updates is already very high. Maybe it would be possible if the updates were delayed, run say once per day, to allow more effective duplicate job removal. Template updates should probably be handled in the same way.

Of course, with template updates, you don't have to wait for the refreshLinks job to run before the new content becomes visible, because page_touched is updated and Squid is purged before the job is run. That may also be feasible with Wikidata.

If a page is only viewed once a week, you don't want to be rendering it 5 times per day. The idea is to delay rendering until the page is actually requested, and to update links periodically.

A page which is viewed once per week is not an unrealistic scenario. We will probably have bot-generated geographical articles for just about every town in the world, in 200 or so languages, and all of them will pull many entities from Wikidata. The majority of those articles will be visited by search engine crawlers much more often than they are visited by humans.

The reason I think duplicate removal is essential is because entities will be updated in batches. For example, a census in a large country might result in hundreds of thousands of item updates.

What I'm suggesting is not quite the same as what you call "coalescing" in your design document. Coalescing allows you to reduce the number of events in recentchanges, and presumably also the number of Squid purges and page_touched updates. I'm saying that even after coalescing, changes should be merged further to avoid unnecessaray parsing.

...

Also, when the page is edited manually, and then rendered, the wiki need to somehow know a) which item ID is associated with this page and b) it needs to load the item data to be able to render the page (just the language links, or also infobox data, or eventually also the result of a wikidata query as a list).

You could load the data from memcached while the page is being parsed, instead of doing it in advance, similar to what we do for images. Dedicating hundreds of processor cores to parsing articles immediately after every wikidata change doesn't sound like a great way to avoid a few memcached queries.

...

...
As I've previously explained, I don't think the langlinks table on the client wiki should be updated. So you only need to purge Squid and add an entry to Special:RecentChanges.

If the language links from wikidata is not fulled in during rendering and stored in the parseroutput object, and it's also not stored in the langlinks table, where is it stored, then?

In the wikidatawiki DB, cached in memcached.

...

How should we display it?

Use an OutputPage or Skin hook, such as OutputPageParserOutput.

...

...
Purging Squid can certainly be done from the context of a wikidatawiki job. For RecentChanges the main obstacle is accessing localisation text. You could use rc_params to store language-independent message parameters, like what we do for log entries.

We also need to resolve localized namespace names so we can put the correct namespace id into the RC table. I don't see a good way to do this from the context of another wiki (without using the web api).

You can get the namespace names from $wgConf and localisation cache, and then duplicate the code from Language::getNamespaces() to put it all together, along the lines of:

$wgConf->loadFullData(); $extraNamespaces = $wgConf->get( 'wgExtraNamespaces', $wiki ) ); $metaNamespace = $wgConf->get( 'wgMetaNamespace', $wiki ); $metaNamespaceTalk = $wgConf->get( 'wgMetaNamespace', $wiki ); list( $site, $lang ) = $wgConf->siteFromDB( $wiki ); $defaults = Language::getLocalisationCache() ->getItem( $lang,'namespaceNames' );

But using the web API and caching the result in a file in $wgCacheDirectory would be faster and easier. $wgConf->loadFullData() takes about 16ms, it's much slower than reading a small local file.

Like every other sort of link, entity links should probably be tracked using the page_id of the origin (local) page, so that the link is not invalidated when the page moves. So when you update recentchanges, you can select the page_namespace from the page table. So the problem of namespace display would occur on the repo UI side.

-- Tim Starling

Daniel Kinzler

7 Nov 7 Nov

5:56 a.m.

On 07.11.2012 00:41, Tim Starling wrote:

...

On 06/11/12 23:16, Daniel Kinzler wrote:

...
On 05.11.2012 05:43, Tim Starling wrote:

...
On 02/11/12 22:35, Denny Vrandečić wrote:

...

For re-rendering the page, the wiki needs access to the data.

We are not sure about how do to this best: have it per cluster, or in one place only?

Why do you need to re-render a page if only the language links are changed? Language links are only in the navigation area, the wikitext content is not affected.

Because AFAIK language links are cached in the parser output object, and rendered into the skin from there. Asking the database for them every time seems like overhead if the cached ParserOutput already has them... I believe we currently use the one from the PO if it's there. Am I wrong about that?

You can use memcached.

Ok, let me see if I understand what you are suggesting.

So, in memcached, we'd have the language links for every page (or as many as fit in there); actually, three lists per page: one of the links defined on the page itself, one of the links defined by wikidata, and one of the wikidata links suppressed locally.

When generating the langlinks in the sidebar, these two would be combined appropriately. If we don't find anything in memcached for this of course, we need to parse the page to get the locally defined language links.

When wikidata updates, we just update the record in memcached and invalidate the page.

As far as I can see, we then can get the updated language links before the page has been re-parsed, but we still need to re-parse eventually. And, when someone actually looks at the page, the page does get parsed/rendered right away, and the user sees the updated langlinks. So... what do we need the pre-parse-update-of-langlinks for? Where and when would they even be used? I don't see the point.

...

...
We could get around this, but even then it would be an optimization for language links. But wikidata is soon going to provide data for infoboxes. Any aspect of a data item could be sued in an {{#if:...}}. So we need to re-render the page whenever an item changes.

Wikidata is somewhere around 61000 physical lines of code now. Surely somewhere in that mountain of code, there is a class for the type of an item, where an update method can be added.

I don't understand what you are suggesting. At the moment, when EntityContent::save() is called, it will trigger a change notification, which is written to the wb_changes table. On the client side, a maintenance script polls that table. What could/should be changed about that?

...

I don't think it is feasible to parse pages very much more frequently than they are already parsed as a result of template updates (i.e. refreshLinks jobs).

I don't see why we would parse more frequently. An edit is an edit, locally or remotely. If you want a language link to be updated, the page needs to be reparsed, whether that is triggered by wikidata or a bot edit. At least, wikidata doesn't create a new revision.

...

The CPU cost of template updates is already very high. Maybe it would be possible if the updates were delayed, run say once per day, to allow more effective duplicate job removal. Template updates should probably be handled in the same way.

My proposal is indeed unclear on one point: it does not clearly distinguish between invalidating a page and re-rendering it. I think denny mentioned re-rendering in his original mail. The fact is: At the moment, we do not re-render at all. We just invalidate. And I think that's good enough for now.

I don't see how that duplicate removal would work beyond the coalescing I already suggested - except that for a large batch that covers a whole day, a lot more can be coalesced.

...

Of course, with template updates, you don't have to wait for the refreshLinks job to run before the new content becomes visible, because page_touched is updated and Squid is purged before the job is run. That may also be feasible with Wikidata.

We call Title::invalidateCache(). That ought to do it, right?

...

If a page is only viewed once a week, you don't want to be rendering it 5 times per day. The idea is to delay rendering until the page is actually requested, and to update links periodically.

As I said, we currently don't re-render at all, and whether and when we should is up for discussion. Maybe there could just be a background job re-rendering all "dirty" pages every 24 hours or so, to keep the link tables up to date.

Note that we do need to re-parse eventually: Infoboxes will contain things like {{#property:population}}, which need to be invalidated when the data item changes. Any aspect of a data item can be used in conditionals:

{{#if:{{#property:commons-gallery}}|{{commons|{{#property:commons-gallery}}}}}}

Sitelinks (Language links) too can be accessed via parser functions and used in conditionals.

...

The reason I think duplicate removal is essential is because entities will be updated in batches. For example, a census in a large country might result in hundreds of thousands of item updates.

Yes, but for different items. How can we remove any duplicate updates if there is just one edit per item? Why would there be multiple?

(Note: the current UI only supports atomic edits, one value at a time. The API however allows bots to change any number of values at once, reducing the number of change events.)

...

What I'm suggesting is not quite the same as what you call "coalescing" in your design document. Coalescing allows you to reduce the number of events in recentchanges, and presumably also the number of Squid purges and page_touched updates. I'm saying that even after coalescing, changes should be merged further to avoid unnecessaray parsing.

Ok, so there would be a re-parse queue with duplicate removal. When a change notification is processed (after coalescing notifications), the target page is invalidated using Title::invalidateCache() and it's also placed in the re-parse queue to be processed later. How is this different from the job queue used for parsing after template edits?

...

...
Also, when the page is edited manually, and then rendered, the wiki need to somehow know a) which item ID is associated with this page and b) it needs to load the item data to be able to render the page (just the language links, or also infobox data, or eventually also the result of a wikidata query as a list).

You could load the data from memcached while the page is being parsed, instead of doing it in advance, similar to what we do for images.

How does it get into memcached? What if it's not there?

...

Dedicating hundreds of processor cores to parsing articles immediately after every wikidata change doesn't sound like a great way to avoid a few memcached queries.

Yea, as I said above, this is a misunderstanding. We don't insist on immediate reparsing, we just think the pages need to be invalidated (i.e. *scheduled* for parsing). I'll adjust the proposal to reflect that distinction.

...

...
...
As I've previously explained, I don't think the langlinks table on the client wiki should be updated. So you only need to purge Squid and add an entry to Special:RecentChanges.

If the language links from wikidata is not fulled in during rendering and stored in the parseroutput object, and it's also not stored in the langlinks table, where is it stored, then?

In the wikidatawiki DB, cached in memcached.

...
How should we display it?

Use an OutputPage or Skin hook, such as OutputPageParserOutput.

Do I understand correctly that the point of this is to be able to update the sitelinks quickly, without parsing the page? We *do* need to parse the page anyway, though doing so later or only when the page is requested would probably be fine.

Note that I'd still suggest to write the *effective* language links to the langlink table, for consistency. I don't see a problem with that.

...

You can get the namespace names from $wgConf and localisation cache, and then duplicate the code from Language::getNamespaces() to put it all together, along the lines of:

$wgConf->loadFullData(); $extraNamespaces = $wgConf->get( 'wgExtraNamespaces', $wiki ) ); $metaNamespace = $wgConf->get( 'wgMetaNamespace', $wiki ); $metaNamespaceTalk = $wgConf->get( 'wgMetaNamespace', $wiki ); list( $site, $lang ) = $wgConf->siteFromDB( $wiki ); $defaults = Language::getLocalisationCache() ->getItem( $lang,'namespaceNames' );

But using the web API and caching the result in a file in $wgCacheDirectory would be faster and easier. $wgConf->loadFullData() takes about 16ms, it's much slower than reading a small local file.

Writing to another wiki's database without a firm handle on that wiki's config sounds quite scary and brittle to me. It can be done, and we can pull together all the necessary info, but... do you really think this is a good idea? What are we gaining by doing it this way?

...

Like every other sort of link, entity links should probably be tracked using the page_id of the origin (local) page, so that the link is not invalidated when the page moves.

This is the wrong way around: sitelinks go from wikidata to wikipedia. As with all links, link targets are tracked by title, and break when stuff is renamed. When you move a page on Wikipedia, it loses it's connection to the Wikidata item, unless you update the Wikidata item (we plan to offer a button on the page move form on wikipedia to do this conveniently).

...

So when you update recentchanges, you can select the page_namespace from the page table. So the problem of namespace display would occur on the repo UI side.

There's two use cases to consider:

* when a change notification comes in, we need to inject the corresponding record into the rc table of every wiki using the respective item. To do that, we need access to some aspects of that wiki's config. Your proposal for caching the namespace info would cover that.

* when a page is re-rendered, we need access to the data item, so we can pull in the data fields via parser functions (in phase II). How does page Foo know that it needs to load item Q5432? And how does it load the item data?

I currently envision that the page <-> item mapping would be maintained locally, so a simple lookup would provide the item ID. And the item data could ideally be pulled from ES - that needs some refactoring though. Our current solution has a cache table with the full uncompressed item data (latest revision only), which could be maintained on every cluster or only on the repo. I'm now inclined though to implement direct ES access. I have poked around a bit, and it seems that this is possible without factoring out standalone BlobStore classes (although that would still be nice). I'll put a note into the proposal to that effect.

-- daniel

Tim Starling

6:51 p.m.

On 07/11/12 22:56, Daniel Kinzler wrote:

...

As far as I can see, we then can get the updated language links before the page has been re-parsed, but we still need to re-parse eventually.

Why does it need to be re-parsed eventually?

...

And, when someone actually looks at the page, the page does get parsed/rendered right away, and the user sees the updated langlinks. So... what do we need the pre-parse-update-of-langlinks for? Where and when would they even be used? I don't see the point.

For language link updates in particular, you wouldn't have to update page_touched, so the page wouldn't have to be re-parsed.

...

...
...
We could get around this, but even then it would be an optimization for language links. But wikidata is soon going to provide data for infoboxes. Any aspect of a data item could be sued in an {{#if:...}}. So we need to re-render the page whenever an item changes.

Wikidata is somewhere around 61000 physical lines of code now. Surely somewhere in that mountain of code, there is a class for the type of an item, where an update method can be added.

I don't understand what you are suggesting. At the moment, when EntityContent::save() is called, it will trigger a change notification, which is written to the wb_changes table. On the client side, a maintenance script polls that table. What could/should be changed about that?

I'm saying that you don't really need the client-side maintenance script, it can be done just with repo-side jobs. That would reduce the job insert rate by a factor of the number of languages, and make the task of providing low-latency updates to client pages somewhat easier.

For language link updates, you just need to push to memcached, purge Squid and insert a row into recentchanges. For #property, you additionally need to update page_touched and construct a de-duplicated batch of refreshLinks jobs to be run on the client side on a daily basis.

...

...
I don't think it is feasible to parse pages very much more frequently than they are already parsed as a result of template updates (i.e. refreshLinks jobs).

I don't see why we would parse more frequently. An edit is an edit, locally or remotely. If you want a language link to be updated, the page needs to be reparsed, whether that is triggered by wikidata or a bot edit. At least, wikidata doesn't create a new revision.

Surely Wikidata will dramatically increase the amount of data available on the infoboxes of articles in small wikis, and improve the freshness of that data. If it doesn't, something must have gone terribly wrong.

Note that the current system is inefficient, sometimes to the point of not working at all. When bot edits on zhwiki cause a job queue backlog 6 months long, or data templates cause articles to take a gigabyte of RAM and 15 seconds to render, I tell people "don't worry, I'm sure Wikidata will fix it". I still think we can deliver on that promise, with proper attention to system design.

...

...
Of course, with template updates, you don't have to wait for the refreshLinks job to run before the new content becomes visible, because page_touched is updated and Squid is purged before the job is run. That may also be feasible with Wikidata.

We call Title::invalidateCache(). That ought to do it, right?

You would have to also call Title::purgeSquid(). But it's not efficient to use these Title methods when you have thousands of pages to purge, that's why we use HTMLCacheUpdate for template updates.

...

Sitelinks (Language links) too can be accessed via parser functions and used in conditionals.

Presumably that would be used fairly rarely. You could track it separately, or remove the feature, in order to provide efficient language link updates as I described.

...

...
The reason I think duplicate removal is essential is because entities will be updated in batches. For example, a census in a large country might result in hundreds of thousands of item updates.

Yes, but for different items. How can we remove any duplicate updates if there is just one edit per item? Why would there be multiple?

I'm not talking about removing duplicate item edits, I'm talking about avoiding running multiple refreshLinks jobs for each client page. I thought refreshLinks was what Denny was talking about when he said "re-render", thanks for clearing that up.

...

Ok, so there would be a re-parse queue with duplicate removal. When a change notification is processed (after coalescing notifications), the target page is invalidated using Title::invalidateCache() and it's also placed in the re-parse queue to be processed later. How is this different from the job queue used for parsing after template edits?

There's no duplicate removal with template edits, and no 24-hour delay in updates to improve the effectiveness of duplicate removal.

It's the same problem, it's just that the current system for template edits is cripplingly inefficient and unscalable. So I'm bringing up these performance ideas before Wikidata increases the edit rate by a factor of 10.

The change I'm suggesting is conservative. I'm not sure if it will be enough to avoid serious site performance issues. Maybe if we deploy Lua first, it will work.

...

...
...
Also, when the page is edited manually, and then rendered, the wiki need to somehow know a) which item ID is associated with this page and b) it needs to load the item data to be able to render the page (just the language links, or also infobox data, or eventually also the result of a wikidata query as a list).

You could load the data from memcached while the page is being parsed, instead of doing it in advance, similar to what we do for images.

How does it get into memcached? What if it's not there?

Push it into memcached when the item is changed. If it's not there on parse, load it from the repo slave DB and save it back to memcached.

That's not exactly the scheme that we use for images, but it's the scheme that Asher Feldman recommends that we use for future performance work. It can probably be made to work.

-- Tim Starling

Daniel Kinzler

8 Nov 8 Nov

7:43 a.m.

First off, TL;DR: (@Tim: did my best to summarize, please correct any misrepresentation.)

* Tim: don't re-parse when sitelinks change.

* Daniel: can be done, but do we really need to optimize for this case? Denny, can we get better figures on this?

* Daniel: how far do we want to limit the things we make available via parser functions and/or lua binding? Coudl allow more with LUA (faster, and implementing complex functionality via parser function is nasty anyway).

* Consensus: we want to coalesc changes before acting on them.

* Tim: we also want to avoid redundant rendering by removing duplicate render jobs (like multiple re-rendering of the same page) resulting from the changes.

* Tim: large batches (lower frequency) for re-rendering pages that are already invalidated would allow more dupes to be removed. (Pages would still be rendered on demand when viewed, but link tables would update later)

* Daniel: sounds good, but perhaps this should be a general feature of the re-render/linksUpdate job queue, so it's also used when templates get edited.

* Consensus: load items directly from ES (via remote access to the repo's text table), cache in memcached.

* Tim: Also get rid of local item <-> page mapping, just look each page up on the repo.

* Daniel: Ok, but then we can't optimize bulk ops involving multiple items.

* Tim: run the polling script from the repo, push to client wiki db's directly

* Daniel: that's scary, client wikis should keep control of how changes are handled.

Now, the nitty gritty:

On 08.11.2012 01:51, Tim Starling wrote:

...

On 07/11/12 22:56, Daniel Kinzler wrote:

...
As far as I can see, we then can get the updated language links before the page has been re-parsed, but we still need to re-parse eventually.

Why does it need to be re-parsed eventually?

For the same reason pages need to be re-parsed when templates change: because links may depend on the data items.

We are currently working on the assumption that *any* aspect of a data item is accessible vis parser functions in the wikitext, and may thus infuence any apsect of that page's parser output.

So, if *anything* about a data items chanes, *anything* about the wikipedia page using it may change too. So that page needs to be re-parsed.

Maybe we'll be able to cut past the rendering for some cases, but for "normal" property changes, like a new value for the population of a country, all pages that use the respective data item need to be re-rendered soonish, otherwise the link tables (especially categories) will get out of whack.

So, let's thinkabout what we *could* optimize:

* I think we could probably disallow access to wikidata sitelinks via parser functions in wikipedia articles. That would allows us to use an optimized data flow for changes to sitelinks (aka language links) which does not cause the page to be re-rendered.

* Maybe we can also avoid re-parsing pages on changes that apply only to languages that are not used on the respective wiki (lets call them unrelated translation changes). The tricky bit here is to figure out changes to which language effect which wiki in the presence of complex language fallback rules (e.g. nds->de->mul or even nastier stuff involvinc circular relations between language variants).

* Changes to labels, descriptions and aliases of items on wikidata will *probably* not influence the content of wikipedia pages. We could disallow acccess to these aspects of data items to make sure - this would be a shame, but not terrible. At least not for infoboxes. For automatically generated lists we'll need the labels at the very least.

* We could keep track of which properties of the data item are actually used on each page, and then only re-parse of these properties change. That would be quite a bit of data, and annoying to maintain, but possible. Whether this has a large impact on the need to re-parse remains to be seen, since it greately depends on the infobox templates.

We can come up with increasingly complex rules for skipping rendering, but except perhaps for the sitelink changes, this seems brittle and confusing. I'd like to avoid it as much as possible.

...

...
And, when someone actually looks at the page, the page does get parsed/rendered right away, and the user sees the updated langlinks. So... what do we need the pre-parse-update-of-langlinks for? Where and when would they even be used? I don't see the point.

For language link updates in particular, you wouldn't have to update page_touched, so the page wouldn't have to be re-parsed.

If the languagelinks in the sidebar come from memcached and not the cached parser output, then yes.

...

...
...
...
We could get around this, but even then it would be an optimization for language links. But wikidata is soon going to provide data for infoboxes. Any aspect of a data item could be sued in an {{#if:...}}. So we need to re-render the page whenever an item changes.

Wikidata is somewhere around 61000 physical lines of code now. Surely somewhere in that mountain of code, there is a class for the type of an item, where an update method can be added.

I don't understand what you are suggesting. At the moment, when EntityContent::save() is called, it will trigger a change notification, which is written to the wb_changes table. On the client side, a maintenance script polls that table. What could/should be changed about that?

I'm saying that you don't really need the client-side maintenance script, it can be done just with repo-side jobs. That would reduce the job insert rate by a factor of the number of languages, and make the task of providing low-latency updates to client pages somewhat easier.

So, instead of polling scripts running on each cluster, the polling script would run on the repo, and push... what exactly to the client wikis? We'd sill need one job posted per client wiki, so that client wiki can figure out how to react to the change, no?

Or do you want the script on the repo to be smart enough to know this, and mess with the contents of the client wiki's memcache, page table, etc directly? That sounds very scary, and also creates a bottle neck for updates (the single polling script).

I don't care where the polling scripts are running, that's pretty arbitrary. The idea is that each polling worker is configured to update a specific set of client wikis. Whether we use one process for every 10 wikis, or a single process for all of them, doesn't matter to me.

But I would like to avoid interfering with client wiki's internals directly, bypassing all well defined external interfaces. That's not just ugly, that's asking for trouble.

...

For language link updates, you just need to push to memcached, purge Squid and insert a row into recentchanges. For #property, you additionally need to update page_touched and construct a de-duplicated batch of refreshLinks jobs to be run on the client side on a daily basis.

refreshLinks jobs will re-parse the page, right?

So, we are just talking about not re-parsing when the sitelinks are changed? That can be done, but...

I'm wondering what percentage of changes that will be. Denny seems to think that this kind of edit will become relatively rare once the current state of the langlink graph has been imported into wikidata. The rate should roughly be the rate of creation of articles on all wikipedias combined.

Does this warrant a special case with nasty hacks to optimize for it?

...

...
...
I don't think it is feasible to parse pages very much more frequently than they are already parsed as a result of template updates (i.e. refreshLinks jobs).

I don't see why we would parse more frequently. An edit is an edit, locally or remotely. If you want a language link to be updated, the page needs to be reparsed, whether that is triggered by wikidata or a bot edit. At least, wikidata doesn't create a new revision.

Surely Wikidata will dramatically increase the amount of data available on the infoboxes of articles in small wikis, and improve the freshness of that data. If it doesn't, something must have gone terribly wrong.

Hm... so instead of bots updating the population value in the top 20 wikis, wikidata now pushes it to all 200 that have the respective article, causing a factor 10 increase in rendering?...

...

Note that the current system is inefficient, sometimes to the point of not working at all. When bot edits on zhwiki cause a job queue backlog 6 months long, or data templates cause articles to take a gigabyte of RAM and 15 seconds to render, I tell people "don't worry, I'm sure Wikidata will fix it". I still think we can deliver on that promise, with proper attention to system design.

Which is why we have been publishing our system design since about 8 months now:)

Wikidata can certainly help a lot with several things, but we won't get past re-rendering whenever anything that could be visible on the page changes.

The only way to remove the number of changes that trigger rendering is to limit the aspects of data items that can be accessed from the wikitext: sitelinks? label/description/aliases? Values in languages different from the wiki's content language? Changes to changes marked as deprecated, and thus (usually) not shown? Changes to properties not currently used on the page, according to some tracking table?...

As I said earlier: we can come up with increasingly complex rules, which will get increasingly brittle. The question is now: how much do we need to do to make this efficient?

...

...
...
Of course, with template updates, you don't have to wait for the refreshLinks job to run before the new content becomes visible, because page_touched is updated and Squid is purged before the job is run. That may also be feasible with Wikidata.

We call Title::invalidateCache(). That ought to do it, right?

You would have to also call Title::purgeSquid(). But it's not efficient to use these Title methods when you have thousands of pages to purge, that's why we use HTMLCacheUpdate for template updates.

Currently we are still going page-by-page, so we should be using Title::purgeSquid(). But I agree that this should be change dto use bulk operations ASAP. Is there a bulk alternative to Title::invalidateCache(), or should we update the page table directly?

Denny: we should probably open a ticket for this. It's independant of other decisions around the data flow and update mechanisms.

...

I'm not talking about removing duplicate item edits, I'm talking about avoiding running multiple refreshLinks jobs for each client page. I thought refreshLinks was what Denny was talking about when he said "re-render", thanks for clearing that up.

Not sure how these things are fundamentally different. Coalescing/dupe removal can be done on multiple levels of course: once for the changes, and again (with larger batches/delays) for rendering.

...

...
Ok, so there would be a re-parse queue with duplicate removal. When a change notification is processed (after coalescing notifications), the target page is invalidated using Title::invalidateCache() and it's also placed in the re-parse queue to be processed later. How is this different from the job queue used for parsing after template edits?

There's no duplicate removal with template edits, and no 24-hour delay in updates to improve the effectiveness of duplicate removal.

Ok. But that would be relatively easy to do I think... would it not? And it would be nice to do it in a way that would also work for template edits, etc. basically, the a re-render queue with dupe removal and low processing frequency (so the dupe removal has more impact).

...

The change I'm suggesting is conservative. I'm not sure if it will be enough to avoid serious site performance issues. Maybe if we deploy Lua first, it will work.

One thing that we could do, and which would make life a LOt easier for me actually, is to strongly limit the functionality of the parser functiosn for accessing item data, so they only do the basics. These would be easier to predict and track. With LUA, we could offer more powerful functionality - which might lead to more frequent re-parsing, but since it's using LUA, that would not so much of a problem.

...

...
...
You could load the data from memcached while the page is being parsed, instead of doing it in advance, similar to what we do for images.

How does it get into memcached? What if it's not there?

Push it into memcached when the item is changed. If it's not there on parse, load it from the repo slave DB and save it back to memcached.

That's not exactly the scheme that we use for images, but it's the scheme that Asher Feldman recommends that we use for future performance work. It can probably be made to work.

I'm fine with doing it that way. We'd basically have a getItemdata function that checks memcached, and if the item isn't there, remotely accesses the repo's text table (and then ES) to load the data and push it into memcached.

How much data can we put into memcached, btw? I mean, we are talking about millions of items, several KB each... But yea, it'll help a lot with stuff that changes frequently. Which of course is the stuff where performance matters most.

I think we consesnus wrt accessing the item data itself.

As to the item <-> page mapping... it would make bulk operations a lot easier and more efficient if we had that locally on each cluster. But optimizing bulk operations (say, by joining that table iwth the page table to set page_touched) is probably the only thing we need it for. If we don't need bulk operations involving the page table, we can do without it, and just access the mapping that exists on the repo.

If you prefer this approach, I'll work on it.

-- daniel

Denny Vrandečić

6 Nov 6 Nov

10:01 a.m.

Hi Tim,

thank you for the input.

Wikidata unfortunately will not contain all language links: a wiki can locally overwrite the list (by extending the list, suppressing a link from Wikidata, or replacing a link). This is a requirement as not all language links are necessarily symmetric (although I wish they were). This means there is some interplay between the wikitext and the links coming from Wikidata. An update to the links coming from Wikidata can have different effects on the actually displayed language links depending on what is in the local wikitext.

Now, we could possibly also save the effects defined in the local wikitext (which links are suppressed, which are additionally locally defined) in the DB as well, and then when they get changed externally smartly combine that is and create the new correct list --- but this sounds like a lot of effort. It would potentially save cycles compared to today's situation. But the proposed solution does not *add* cycles compared to today's situation. Today, the bots that keep the language links in synch, basically incur a re-rendering of the page anyway, we would not being adding any cost on top of that. We do not make matters worse with regards to server costs.

Also it would, as Daniel mentioned, be an optimization which only would work for the language links. Once we add further data, that will be available to the wikitext, this will not work at all anymore.

I hope this explains why we think that the re-rendering is helpful.

Having said that, here's an alternative scenario: Assuming we do not send any re-rendering jobs to the Wikipedias, what is the worst that would happen?

For this to answer, I need the answer to this question first: are the squids and caches holding their content indefinitely, or would the data, in the worst case, just be out of synch for, say, up to 24 hours on a Wikipedia article that didn't have an edit at all?

If we do not re-render, I assume editors will come up with their own workflows (e.g. changing some values in Wikidata, going to their home wiki, purge the effected page, or write a script that gives them a "purge my homewiki page"-link on Wikidata), which is fine, and still cheaper than if we initiate re-rendering all pages every time. It just means that in some cases some pages will not be up to date.

So, we could go without re-rendering at all, if there is consensus that this is the preferred solution and that this is better than the solution we suggested.

Anyone having any comments, questions, or insights?

Cheers, Denny

2012/11/5 Tim Starling tstarling@wikimedia.org:

...

On 02/11/12 22:35, Denny Vrandečić wrote:

...

For re-rendering the page, the wiki needs access to the data.

We are not sure about how do to this best: have it per cluster, or in one place only?

Why do you need to re-render a page if only the language links are changed? Language links are only in the navigation area, the wikitext content is not affected.

As I've previously explained, I don't think the langlinks table on the client wiki should be updated. So you only need to purge Squid and add an entry to Special:RecentChanges.

Purging Squid can certainly be done from the context of a wikidatawiki job. For RecentChanges the main obstacle is accessing localisation text. You could use rc_params to store language-independent message parameters, like what we do for log entries.

-- Tim Starling

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Daniel Kinzler

10:55 a.m.

On 06.11.2012 17:01, Denny Vrandečić wrote:

...

So, we could go without re-rendering at all, if there is consensus that this is the preferred solution and that this is better than the solution we suggested.

Anyone having any comments, questions, or insights?

I already suggested that changes to the same page in the same batch could be coalesced together. The larger we make the batches, the more updates we optimize away, but the longer it takes until the Wikipedia pages update.

Once we implement this, I think polling frequency and batch size are good tuning screws we can easily use to control the burden of rerendering and other kinds of updates. But we still would have control over how and when changes get pushed.

-- daniel

4423

Age (days ago)

4429

Last active (days ago)

wikitech-l@lists.wikimedia.org

8 comments

3 participants

tags (0)

participants (3)

Daniel Kinzler
Denny Vrandečić
Tim Starling