From: "Mark Christensen" mchristensen@humantech.com To: wikitech-l@wikipedia.org Reply-To: wikitech-l@wikipedia.org
A quick start might be to temporarily disable all checking of links, and see if that helps much.
This seems to be a helpful suggestion. Without profiling, it's hard to tell where the bottleneck is, but I think link checking is a good guess.
Thanks very much!! I think measuring without link-checking would great, it would certainly answer many questions. I don't have a machine I can test on, sadly; does someone else?
It's worth noting that link-checking not only causes additional processing - if link-checking is disabled, and user formatting is limited, many OTHER optimizations become easy. In particular, caching becomes really easy if article text doesn't depend on other state (i.e., doesn't require link-checking and processing to support fancy user options). For example, without link-checking, you don't have to follow lists to invalidate "related" caches. The most effective optimization is to do nothing at all :-). In-filesystem caches of HTML fragments would make sense in such a situation, and Linux's sendfile() could do a rather impressive job of improving performance when sending cached article text. Allowing users to select which stylesheet to blast back at them would give users a limited amount of control, but seriously improve performance.
Link-checking is that it's not as useful as you'd wish, anyway. After all, it only identifies existence. An article with only a tiny amount of content appears "complete" to the link-check, but clearly you'd want people to work on that article too! If disabling link-checking (and the optimizations it makes complicated) turns out to seriously improve performance, then I think it's an obvious capability to disable (at least as a configurable option).
My two cents, hope they help.
Hi David,
nice to see you here -- I enjoyed reading your Linux/OSS-related papers. I have to say that disabling link checking on the live Wikipedia, even for a short time, is hardly acceptable. It is essential for both readers and authors. For readers, having to check pages via "trial and error", esp. on link heavy lists where 10 or 90% of articles may already be written, is very painful, aside from the irritation factors in normal articles. For authors, link checking is essential to tell them whether what they were trying to do worked -- is there an article at [[J.K. Rowling]], or at [[J. K. Rowling]], or at [[Joanne K. Rowling]] ...
Having a clear and simple distinction - red=exists,blue=doesn't exist - is one of the trademark signs of Wikipedia, helps people to understand the very idea of an open growing encyclopedia, and is essential to its operation. It is, in fact, essential to all wikis.
However, the shared memory idea sounds very reasonable. It would be tremendously cool to store more than just 1/0 for each of these links in the hash, and to store the byte size instead (we have to update the hash on save anyway). Then we could use the stub checking feature without a bad conscience (we currently have it disabled by default for performance reasons), which adds a third link state for very short articles.
Regards,
Erik
Erik Moeller wrote:
nice to see you here -- I enjoyed reading your Linux/OSS-related papers. I have to say that disabling link checking on the live Wikipedia, even for a short time, is hardly acceptable. It is essential for both readers and authors.
I know I expressed sympathy to disabling link checking yesterday or the day before, but several developers have come out against it, and I'm now swayed to the view that it's an essential, not a frill.
I now wonder about our usage patterns and whether it would be best to update a cache at edit-time, rather than at read-time.
Changes in 'existence' (yes or no) come up infrequently. When someone creates a brand new article, all other cached articles that are affected by the change could be updated at that time. So a hundred times a day (if that), we have to do a fancy cache update to change affected other articles. But most edits don't affect other articles, because they are edits to already existing pages.
Just an idea...
Jimmy Wales wrote:
Erik Moeller wrote:
nice to see you here -- I enjoyed reading your Linux/OSS-related papers. I have to say that disabling link checking on the live Wikipedia, even for a short time, is hardly acceptable. It is essential for both readers and authors.
I know I expressed sympathy to disabling link checking yesterday or the day before, but several developers have come out against it, and I'm now swayed to the view that it's an essential, not a frill.
I now wonder about our usage patterns and whether it would be best to update a cache at edit-time, rather than at read-time.
Changes in 'existence' (yes or no) come up infrequently. When someone creates a brand new article, all other cached articles that are affected by the change could be updated at that time. So a hundred times a day (if that), we have to do a fancy cache update to change affected other articles. But most edits don't affect other articles, because they are edits to already existing pages.
More than that, you don't have to re-generate the cached pages, you only have to invalidate them. Thus, when a page is created, you have to "touch" all pages with a link to that not-yet-existing article, and when a page is deleted, you do the same to all pages that link to that article.
This is good, because page access patterns are so sparse: many pages go days or weeks without being accessed, but traffic is high because there are so many pages in aggregate.
By not updating articles until they are accessed, you can defer a lot of work that would otherwise bog the system down at update time. Lazy evaluation is much nicer: when the page is demanded, the code should first look for a cached page, and generate it if necessary, before using that data to generate the article output.
Note that editing the page itself would go through the same code-path: just store the new content, invalidate the cached page, and then serve the page, forcing a re-render.
What can be cached: * wiki parsing * link lookup * article content HTML generation
What can't be cached: * page skin (changes per user) * user details (ditto) * menu links (ditto) * things like {{NUMBEROFARTICLES}}
However, applying these as a final pass should be much cheaper than current page serving.
Potential downsides: * some articles are linked by 1000s+ of pages (like the pages linked by auto-generated articles). Hitting these could cause a significant pause as the cache is invalidated -- or will it?
-- Neil
wikitech-l@lists.wikimedia.org