On 07/11/12 22:56, Daniel Kinzler wrote:
As far as I can see, we then can get the updated language links before the page has been re-parsed, but we still need to re-parse eventually.
Why does it need to be re-parsed eventually?
And, when someone actually looks at the page, the page does get parsed/rendered right away, and the user sees the updated langlinks. So... what do we need the pre-parse-update-of-langlinks for? Where and when would they even be used? I don't see the point.
For language link updates in particular, you wouldn't have to update page_touched, so the page wouldn't have to be re-parsed.
We could get around this, but even then it would be an optimization for language links. But wikidata is soon going to provide data for infoboxes. Any aspect of a data item could be sued in an {{#if:...}}. So we need to re-render the page whenever an item changes.
Wikidata is somewhere around 61000 physical lines of code now. Surely somewhere in that mountain of code, there is a class for the type of an item, where an update method can be added.
I don't understand what you are suggesting. At the moment, when EntityContent::save() is called, it will trigger a change notification, which is written to the wb_changes table. On the client side, a maintenance script polls that table. What could/should be changed about that?
I'm saying that you don't really need the client-side maintenance script, it can be done just with repo-side jobs. That would reduce the job insert rate by a factor of the number of languages, and make the task of providing low-latency updates to client pages somewhat easier.
For language link updates, you just need to push to memcached, purge Squid and insert a row into recentchanges. For #property, you additionally need to update page_touched and construct a de-duplicated batch of refreshLinks jobs to be run on the client side on a daily basis.
I don't think it is feasible to parse pages very much more frequently than they are already parsed as a result of template updates (i.e. refreshLinks jobs).
I don't see why we would parse more frequently. An edit is an edit, locally or remotely. If you want a language link to be updated, the page needs to be reparsed, whether that is triggered by wikidata or a bot edit. At least, wikidata doesn't create a new revision.
Surely Wikidata will dramatically increase the amount of data available on the infoboxes of articles in small wikis, and improve the freshness of that data. If it doesn't, something must have gone terribly wrong.
Note that the current system is inefficient, sometimes to the point of not working at all. When bot edits on zhwiki cause a job queue backlog 6 months long, or data templates cause articles to take a gigabyte of RAM and 15 seconds to render, I tell people "don't worry, I'm sure Wikidata will fix it". I still think we can deliver on that promise, with proper attention to system design.
Of course, with template updates, you don't have to wait for the refreshLinks job to run before the new content becomes visible, because page_touched is updated and Squid is purged before the job is run. That may also be feasible with Wikidata.
We call Title::invalidateCache(). That ought to do it, right?
You would have to also call Title::purgeSquid(). But it's not efficient to use these Title methods when you have thousands of pages to purge, that's why we use HTMLCacheUpdate for template updates.
Sitelinks (Language links) too can be accessed via parser functions and used in conditionals.
Presumably that would be used fairly rarely. You could track it separately, or remove the feature, in order to provide efficient language link updates as I described.
The reason I think duplicate removal is essential is because entities will be updated in batches. For example, a census in a large country might result in hundreds of thousands of item updates.
Yes, but for different items. How can we remove any duplicate updates if there is just one edit per item? Why would there be multiple?
I'm not talking about removing duplicate item edits, I'm talking about avoiding running multiple refreshLinks jobs for each client page. I thought refreshLinks was what Denny was talking about when he said "re-render", thanks for clearing that up.
Ok, so there would be a re-parse queue with duplicate removal. When a change notification is processed (after coalescing notifications), the target page is invalidated using Title::invalidateCache() and it's also placed in the re-parse queue to be processed later. How is this different from the job queue used for parsing after template edits?
There's no duplicate removal with template edits, and no 24-hour delay in updates to improve the effectiveness of duplicate removal.
It's the same problem, it's just that the current system for template edits is cripplingly inefficient and unscalable. So I'm bringing up these performance ideas before Wikidata increases the edit rate by a factor of 10.
The change I'm suggesting is conservative. I'm not sure if it will be enough to avoid serious site performance issues. Maybe if we deploy Lua first, it will work.
Also, when the page is edited manually, and then rendered, the wiki need to somehow know a) which item ID is associated with this page and b) it needs to load the item data to be able to render the page (just the language links, or also infobox data, or eventually also the result of a wikidata query as a list).
You could load the data from memcached while the page is being parsed, instead of doing it in advance, similar to what we do for images.
How does it get into memcached? What if it's not there?
Push it into memcached when the item is changed. If it's not there on parse, load it from the repo slave DB and save it back to memcached.
That's not exactly the scheme that we use for images, but it's the scheme that Asher Feldman recommends that we use for future performance work. It can probably be made to work.
-- Tim Starling