Re: [Wikitech-l] Data flow from Wikidata to the Wikipedias

8 Nov 2012


      On 07/11/12 22:56, Daniel Kinzler wrote:
...
As far as I can see, we then can get the updated language links before the page
has been re-parsed, but we still need to re-parse eventually.
Why does it need to be re-parsed eventually?
...
And, when someone
actually looks at the page, the page does get parsed/rendered right away, and
the user sees the updated langlinks. So... what do we need the
pre-parse-update-of-langlinks for? Where and when would they even be used? I
don't see the point.
For language link updates in particular, you wouldn't have to update
page_touched, so the page wouldn't have to be re-parsed.
...
...
...
We could get around this, but even then it would be an optimization for language
links. But wikidata is soon going to provide data for infoboxes. Any aspect of a
data item could be sued in an {{#if:...}}. So we need to re-render the page
whenever an item changes.
Wikidata is somewhere around 61000 physical lines of code now. Surely
somewhere in that mountain of code, there is a class for the type of
an item, where an update method can be added.
I don't understand what you are suggesting. At the moment, when
EntityContent::save() is called, it will trigger a change notification, which is
written to the wb_changes table. On the client side, a maintenance script polls
that table. What could/should be changed about that?
I'm saying that you don't really need the client-side maintenance
script, it can be done just with repo-side jobs. That would reduce the
job insert rate by a factor of the number of languages, and make the
task of providing low-latency updates to client pages somewhat easier.
For language link updates, you just need to push to memcached, purge
Squid and insert a row into recentchanges. For #property, you
additionally need to update page_touched and construct a de-duplicated
batch of refreshLinks jobs to be run on the client side on a daily basis.
...
...
I don't think it is feasible to parse pages very much more frequently
than they are already parsed as a result of template updates (i.e.
refreshLinks jobs).
I don't see why we would parse more frequently. An edit is an edit, locally or
remotely. If you want a language link to be updated, the page needs
to be reparsed, whether that is triggered by wikidata or a bot edit. At least,
wikidata doesn't create a new revision.
Surely Wikidata will dramatically increase the amount of data
available on the infoboxes of articles in small wikis, and improve the
freshness of that data. If it doesn't, something must have gone
terribly wrong.
Note that the current system is inefficient, sometimes to the point of
not working at all. When bot edits on zhwiki cause a job queue backlog
6 months long, or data templates cause articles to take a gigabyte of
RAM and 15 seconds to render, I tell people "don't worry, I'm sure
Wikidata will fix it". I still think we can deliver on that promise,
with proper attention to system design.
...
...
Of course, with template updates, you don't have to wait for the
refreshLinks job to run before the new content becomes visible,
because page_touched is updated and Squid is purged before the job is
run. That may also be feasible with Wikidata.
We call Title::invalidateCache(). That ought to do it, right?
You would have to also call Title::purgeSquid(). But it's not
efficient to use these Title methods when you have thousands of pages
to purge, that's why we use HTMLCacheUpdate for template updates.
...
Sitelinks (Language links) too can be accessed via parser functions and used in
conditionals.
Presumably that would be used fairly rarely. You could track it
separately, or remove the feature, in order to provide efficient
language link updates as I described.
...
...
The reason I think duplicate removal is essential is because entities
will be updated in batches. For example, a census in a large country
might result in hundreds of thousands of item updates.
Yes, but for different items. How can we remove any duplicate updates if there
is just one edit per item? Why would there be multiple?
I'm not talking about removing duplicate item edits, I'm talking about
avoiding running multiple refreshLinks jobs for each client page. I
thought refreshLinks was what Denny was talking about when he said
"re-render", thanks for clearing that up.
...
Ok, so there would be a re-parse queue with duplicate removal. When a change
notification is processed (after coalescing notifications), the target page is
invalidated using Title::invalidateCache() and it's also placed in the re-parse
queue to be processed later. How is this different from the job queue used for
parsing after template edits?
There's no duplicate removal with template edits, and no 24-hour delay
in updates to improve the effectiveness of duplicate removal.
It's the same problem, it's just that the current system for template
edits is cripplingly inefficient and unscalable. So I'm bringing up
these performance ideas before Wikidata increases the edit rate by a
factor of 10.
The change I'm suggesting is conservative. I'm not sure if it will be
enough to avoid serious site performance issues. Maybe if we deploy
Lua first, it will work.
...
...
...
Also, when the page is edited manually, and then rendered, the wiki need to
somehow know a) which item ID is associated with this page and b) it needs to
load the item data to be able to render the page (just the language links, or
also infobox data, or eventually also the result of a wikidata query as a list).
You could load the data from memcached while the page is being parsed,
instead of doing it in advance, similar to what we do for images.
How does it get into memcached? What if it's not there?
Push it into memcached when the item is changed. If it's not there on
parse, load it from the repo slave DB and save it back to memcached.
That's not exactly the scheme that we use for images, but it's the
scheme that Asher Feldman recommends that we use for future
performance work. It can probably be made to work.
-- Tim Starling

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Data flow from Wikidata to the Wikipedias