I discussed some ideas a few days ago on #wikimedia-tech, here's a summary email about that.
overall goals: - preserve all wikipedia functionality - make it more resistant to scaling forces - highly modular - reduce interdependence - identify core "must have" functionality - push everything else out to the edge as far as possible - cache as much as possible at every level - design caching into the system
Five logical components I'm focusing on: - the article (all articles = the content) - a content cache - authentication server(s) - UI server(s) - squid cache(s)
As I understand the architecture today, several of these functions are currently being performed monolithically by the appservers... so in part I'm proposing a refactoring where the core functions (article, content cache, authentication) are protected from traffic by a ring of "defenses" in the UI servers and squid caches. Here's a breakdown of each layer:
ARTICLE - for storing the content The unit of content is a 3-tuple: {wikitext, red coloured links, templates} - each time I am edited: --- change my content --- if I'm a new/moved article, change colour of links in articles that reference me --- if I'm a template, change each article that uses me
- goal: --- when I'm edited, propagate those changes as *efficiently* as possible to my fellow articles --- insist that when I'm changed, directly or indirectly, I am only read *once* by the Content cache --- insist that I'm only changed by the authentication server
CONTENT CACHE - for caching articles for browsing - goal: --- I only hit an article *once* for each change to that article --- no one else ever *reads* from the articles but me
AUTHENTICATION SERVER - for authenticating users for editing - goal: --- I'm only involved when you have to be *certain* of a user's ID --- that is, first log-in, and when they submit an edit --- no one else ever *writes* to the articles but me (once I've ID'd the user)
UI SERVER - for serving up HTML pages - goal: --- for browsing, I read from content cache, add user dressing, and serve --- for submitting edits, I send them to authentication server
tricks: - could get into tricks with javascript, IFRAMEs, whatever to push work farther to the edge - could create a distributed UI server system that can be replicated and run by universities, etc.
SQUID CACHE - for especially non-logged in users - goal: --- remove browsing load from the UI server
On 11/25/05, S. Woodside sbwoodside@yahoo.com wrote:
ARTICLE
- for storing the content
The unit of content is a 3-tuple: {wikitext, red coloured links, templates}
- each time I am edited:
--- change my content --- if I'm a new/moved article, change colour of links in articles that reference me --- if I'm a template, change each article that uses me
- goal:
--- when I'm edited, propagate those changes as *efficiently* as possible to my fellow articles --- insist that when I'm changed, directly or indirectly, I am only read *once* by the Content cache --- insist that I'm only changed by the authentication server
And so you perform a linear scan of all articles redlinks to find the ones you must remove and a reparse of all articles to find the redlinks you must add every time there is a move or delete?
Moves only decrease redlinks, but deletes must be handled as well.
We have enough people dreaming up ideas, myself included. Show us the code, and the benchmarks.
wikitech-l@lists.wikimedia.org