Re: [Wikitech-l] Data flow from Wikidata to the Wikipedias

8 Nov 2012


      First off, TL;DR:
(@Tim: did my best to summarize, please correct any misrepresentation.)
* Tim: don't re-parse when sitelinks change.
* Daniel: can be done, but do we really need to optimize for this case? Denny,
can we get better figures on this?
* Daniel: how far do we want to limit the things we make available via parser
functions and/or lua binding? Coudl allow more with LUA (faster, and
implementing complex functionality via parser function is nasty anyway).
* Consensus: we want to coalesc changes before acting on them.
* Tim: we also want to avoid redundant rendering by removing duplicate render
jobs (like multiple re-rendering of the same page) resulting from the changes.
* Tim: large batches (lower frequency) for re-rendering pages that are already
invalidated would allow more dupes to be removed. (Pages would still be rendered
on demand when viewed, but link tables would update later)
* Daniel: sounds good, but perhaps this should be a general feature of the
re-render/linksUpdate job queue, so it's also used when templates get edited.
* Consensus: load items directly from ES (via remote access to the repo's text
table), cache in memcached.
* Tim:  Also get rid of local item <-> page mapping, just look each page up on
the repo.
* Daniel: Ok, but then we can't optimize bulk ops involving multiple items.
* Tim: run the polling script from the repo, push to client wiki db's directly
* Daniel: that's scary, client wikis should keep control of how changes are handled.
Now, the nitty gritty:
On 08.11.2012 01:51, Tim Starling wrote:
...
On 07/11/12 22:56, Daniel Kinzler wrote:
...
As far as I can see, we then can get the updated language links before the page
has been re-parsed, but we still need to re-parse eventually.
Why does it need to be re-parsed eventually?
For the same reason pages need to be re-parsed when templates change: because
links may depend on the data items.
We are currently working on the assumption that *any* aspect of a data item is
accessible vis parser functions in the wikitext, and may thus infuence any
apsect of that page's parser output.
So, if *anything* about a data items chanes, *anything* about the wikipedia page
using it may change too. So that page needs to be re-parsed.
Maybe we'll be able to cut past the rendering for some cases, but for "normal"
property changes, like a new value for the population of a country, all pages
that use the respective data item need to be re-rendered soonish, otherwise the
link tables (especially categories) will get out of whack.
So, let's thinkabout what we *could* optimize:
* I think we could probably disallow access to wikidata sitelinks via parser
functions in wikipedia articles. That would allows us to use an optimized data
flow for changes to sitelinks (aka language links) which does not cause the page
to be re-rendered.
* Maybe we can also avoid re-parsing pages on changes that apply only to
languages that are not used on the respective wiki (lets call them unrelated
translation changes). The tricky bit here is to figure out changes to which
language effect which wiki in the presence of complex language fallback rules
(e.g. nds->de->mul or even nastier stuff involvinc circular relations between
language variants).
* Changes to labels, descriptions and aliases of items on wikidata will
*probably* not influence the content of wikipedia pages. We could disallow
acccess to these aspects of data items to make sure - this would be a shame, but
not terrible. At least not for infoboxes. For automatically generated lists
we'll need the labels at the very least.
* We could keep track of which properties of the data item are actually used on
each page, and then only re-parse of these properties change. That would be
quite a bit of data, and annoying to maintain, but possible. Whether this has a
large impact on the need to re-parse remains to be seen, since it greately
depends on the infobox templates.
We can come up with increasingly complex rules for skipping rendering, but
except perhaps for the sitelink changes, this seems brittle and confusing. I'd
like to avoid it as much as possible.
...
...
And, when someone
actually looks at the page, the page does get parsed/rendered right away, and
the user sees the updated langlinks. So... what do we need the
pre-parse-update-of-langlinks for? Where and when would they even be used? I
don't see the point.
For language link updates in particular, you wouldn't have to update
page_touched, so the page wouldn't have to be re-parsed.
If the languagelinks in the sidebar come from memcached and not the cached
parser output, then yes.
...
...
...
...
We could get around this, but even then it would be an optimization for language
links. But wikidata is soon going to provide data for infoboxes. Any aspect of a
data item could be sued in an {{#if:...}}. So we need to re-render the page
whenever an item changes.
Wikidata is somewhere around 61000 physical lines of code now. Surely
somewhere in that mountain of code, there is a class for the type of
an item, where an update method can be added.
I don't understand what you are suggesting. At the moment, when
EntityContent::save() is called, it will trigger a change notification, which is
written to the wb_changes table. On the client side, a maintenance script polls
that table. What could/should be changed about that?
I'm saying that you don't really need the client-side maintenance
script, it can be done just with repo-side jobs. That would reduce the
job insert rate by a factor of the number of languages, and make the
task of providing low-latency updates to client pages somewhat easier.
So, instead of polling scripts running on each cluster, the polling script would
run on the repo, and push... what exactly to the client wikis? We'd sill need
one job posted per client wiki, so that client wiki can figure out how to react
to the change, no?
Or do you want the script on the repo to be smart enough to know this, and mess
with the contents of the client wiki's memcache, page table, etc directly? That
sounds very scary, and also creates a bottle neck for updates (the single
polling script).
I don't care where the polling scripts are running, that's pretty arbitrary. The
idea is that each polling worker is configured to update a specific set of
client wikis. Whether we use one process for every 10 wikis, or a single process
for all of them, doesn't matter to me.
But I would like to avoid interfering with client wiki's internals directly,
bypassing all well defined external interfaces. That's not just ugly, that's
asking for trouble.
...
For language link updates, you just need to push to memcached, purge
Squid and insert a row into recentchanges. For #property, you
additionally need to update page_touched and construct a de-duplicated
batch of refreshLinks jobs to be run on the client side on a daily basis.
refreshLinks jobs will re-parse the page, right?
So, we are just talking about not re-parsing when the sitelinks are changed?
That can be done, but...
I'm wondering what percentage of changes that will be. Denny seems to think that
this kind of edit will become relatively rare once the current state of the
langlink graph has been imported into wikidata. The rate should roughly be the
rate of creation of articles on all wikipedias combined.
Does this warrant a special case with nasty hacks to optimize for it?
...
...
...
I don't think it is feasible to parse pages very much more frequently
than they are already parsed as a result of template updates (i.e.
refreshLinks jobs).
I don't see why we would parse more frequently. An edit is an edit, locally or
remotely. If you want a language link to be updated, the page needs
to be reparsed, whether that is triggered by wikidata or a bot edit. At least,
wikidata doesn't create a new revision.
Surely Wikidata will dramatically increase the amount of data
available on the infoboxes of articles in small wikis, and improve the
freshness of that data. If it doesn't, something must have gone
terribly wrong.
Hm... so instead of bots updating the population value in the top 20 wikis,
wikidata now pushes it to all 200 that have the respective article, causing a
factor 10 increase in rendering?...
...
Note that the current system is inefficient, sometimes to the point of
not working at all. When bot edits on zhwiki cause a job queue backlog
6 months long, or data templates cause articles to take a gigabyte of
RAM and 15 seconds to render, I tell people "don't worry, I'm sure
Wikidata will fix it". I still think we can deliver on that promise,
with proper attention to system design.
Which is why we have been publishing our system design since about 8 months now:)
Wikidata can certainly help a lot with several things, but we won't get past
re-rendering whenever anything that could be visible on the page changes.
The only way to remove the number of changes that trigger rendering is to limit
the aspects of data items that can be accessed from the wikitext: sitelinks?
label/description/aliases? Values in languages different from the wiki's content
language? Changes to changes marked as deprecated, and thus (usually) not shown?
Changes to properties not currently used on the page, according to some tracking
table?...
As I said earlier: we can come up with increasingly complex rules, which will
get increasingly brittle. The question is now: how much do we need to do to make
this efficient?
...
...
...
Of course, with template updates, you don't have to wait for the
refreshLinks job to run before the new content becomes visible,
because page_touched is updated and Squid is purged before the job is
run. That may also be feasible with Wikidata.
We call Title::invalidateCache(). That ought to do it, right?
You would have to also call Title::purgeSquid(). But it's not
efficient to use these Title methods when you have thousands of pages
to purge, that's why we use HTMLCacheUpdate for template updates.
Currently we are still going page-by-page, so we should be using
Title::purgeSquid(). But I agree that this should be change dto use bulk
operations ASAP. Is there a bulk alternative to Title::invalidateCache(), or
should we update the page table directly?
Denny: we should probably open a ticket for this. It's independant of other
decisions around the data flow and update mechanisms.
...
I'm not talking about removing duplicate item edits, I'm talking about
avoiding running multiple refreshLinks jobs for each client page. I
thought refreshLinks was what Denny was talking about when he said
"re-render", thanks for clearing that up.
Not sure how these things are fundamentally different. Coalescing/dupe removal
can be done on multiple levels of course: once for the changes, and again (with
larger batches/delays) for rendering.
...
...
Ok, so there would be a re-parse queue with duplicate removal. When a change
notification is processed (after coalescing notifications), the target page is
invalidated using Title::invalidateCache() and it's also placed in the re-parse
queue to be processed later. How is this different from the job queue used for
parsing after template edits?
There's no duplicate removal with template edits, and no 24-hour delay
in updates to improve the effectiveness of duplicate removal.
Ok. But that would be relatively easy to do I think... would it not? And it
would be nice to do it in a way that would also work for template edits, etc.
basically, the a re-render queue with dupe removal and low processing frequency
(so the dupe removal has more impact).
...
The change I'm suggesting is conservative. I'm not sure if it will be
enough to avoid serious site performance issues. Maybe if we deploy
Lua first, it will work.
One thing that we could do, and which would make life a LOt easier for me
actually, is to strongly limit the functionality of the parser functiosn for
accessing item data, so they only do the basics. These would be easier to
predict and track. With LUA, we could offer more powerful functionality - which
might lead to more frequent re-parsing, but since it's using LUA, that would not
so much of a problem.
...
...
...
You could load the data from memcached while the page is being parsed,
instead of doing it in advance, similar to what we do for images.
How does it get into memcached? What if it's not there?
Push it into memcached when the item is changed. If it's not there on
parse, load it from the repo slave DB and save it back to memcached.
That's not exactly the scheme that we use for images, but it's the
scheme that Asher Feldman recommends that we use for future
performance work. It can probably be made to work.
I'm fine with doing it that way. We'd basically have a getItemdata function that
checks memcached, and if the item isn't there, remotely accesses the repo's text
table (and then ES) to load the data and push it into memcached.
How much data can we put into memcached, btw? I mean, we are talking about
millions of items, several KB each... But yea, it'll help a lot with stuff that
changes frequently. Which of course is the stuff where performance matters most.
I think we consesnus wrt accessing the item data itself.
As to the item <-> page mapping... it would make bulk operations a lot easier
and more efficient if we had that locally on each cluster. But optimizing bulk
operations (say, by joining that table iwth the page table to set page_touched)
is probably the only thing we need it for. If we don't need bulk operations
involving the page table, we can do without it, and just access the mapping that
exists on the repo.
If you prefer this approach, I'll work on it.
-- daniel

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Data flow from Wikidata to the Wikipedias