After getting back into wikiland catching up with wikipedia-l was pretty easy, but catching up with the wikitech list took a little longer. It seems you guys have had interesting times lately (in the Chinese curse sense). Sorry I abandoned you, but you guys do seem to have risen to the challenge.
Magnus did a great service by giving us code with features that made Wikipedia usable and popular. When that code bogged down to the point where the wiki became nearly unusable, there wasn't much time to sit down and properly architect and develop a solution, so I just reorganized the existing architecture for better performance and hacked all the code. This got us over the immediate crisis, but now my code is bogging down, and we are having to remove useful features to keep performance up.
I think it's time for Phase IV. We need to sit down and design an architecture that will allow us to grow without constantly putting out fires, and that can become a stable base for a fast, reliable Wikipedia in years to come. I'm now available and equipped to help in this, but I thought I'd start out by asking a few questions here and making a few suggestions.
* Question 1: How much time do we have?
Can we estimate how long we'll be able to limp along with the current code, adding performance hacks and hardware to keep us going? If it's a year, that will give us certain opportunities and guide some choices; if it's only a month or two, that will constrain a lot of those choices.
* Suggestion 1: The test suite.
I think the most critical piece of code to develop right now is a comprehensive test suite. This will enable lots of things. For example, if we have a performance question, I can set up one set of wiki code on my test server, run the suite to get timing data, tweak the code, then run the suite again to get new timing. The success of the suite will tell us if anything broke, and timing will tell us if we're on the right track. This will be useful even during the limp-along with current code phase. I have a three-machine network at home, with one machine I plan to dedicate 100% to wiki code testing, and my test server in San Antonio that we can use. This will also allow us to safely refactor code. I'd like to use something like Latka for the suite (see http://jakarta.apache.org/commons/latka/index.html ).
* Question 2: How wedded are we to the current tools?
Apache/MySQL/PHP seems a good combo, and it probably would be possible to scale them up further, but there certainly are other options. Also, are we willing to take chances on semi-production quality versions like Apache 2.X and MySQL 4.X? I'd even like to revisit the decision of using a database at all. After all, a good file system like ReiserFS (or to a lesser extent, ext3) is itself a pretty well-optimized database for storing pieces of free-form text, and there are good tools available for text indexing, etc. Plus it's easier to maintain and port.
* Suggestion 2: Use the current code for testing features.
In re-architecting the codebase, we will almost certainly come to points where we think a minor feature change will make a big performance difference that won't hurt usability, or just features that we want to implement anyway. For example, we could probably make it easier to cache page requests if we made most of the article content HTML not dependent on skin by tagging elements well and using CSS appropriately. Also, we probably want to eventually render valid XHTML. I propose that while we are building the phase IV code, we add little features like this to the existing code to guage things like user reactions and visual impact.
Other suggestions/questions/answers humbly requested (including "Are you nuts? Let's stick with Phase III!" if you have that opinion).