After getting back into wikiland catching up with wikipedia-l was pretty easy, but catching up with the wikitech list took a little longer. It seems you guys have had interesting times lately (in the Chinese curse sense). Sorry I abandoned you, but you guys do seem to have risen to the challenge.
Magnus did a great service by giving us code with features that made Wikipedia usable and popular. When that code bogged down to the point where the wiki became nearly unusable, there wasn't much time to sit down and properly architect and develop a solution, so I just reorganized the existing architecture for better performance and hacked all the code. This got us over the immediate crisis, but now my code is bogging down, and we are having to remove useful features to keep performance up.
I think it's time for Phase IV. We need to sit down and design an architecture that will allow us to grow without constantly putting out fires, and that can become a stable base for a fast, reliable Wikipedia in years to come. I'm now available and equipped to help in this, but I thought I'd start out by asking a few questions here and making a few suggestions.
* Question 1: How much time do we have?
Can we estimate how long we'll be able to limp along with the current code, adding performance hacks and hardware to keep us going? If it's a year, that will give us certain opportunities and guide some choices; if it's only a month or two, that will constrain a lot of those choices.
* Suggestion 1: The test suite.
I think the most critical piece of code to develop right now is a comprehensive test suite. This will enable lots of things. For example, if we have a performance question, I can set up one set of wiki code on my test server, run the suite to get timing data, tweak the code, then run the suite again to get new timing. The success of the suite will tell us if anything broke, and timing will tell us if we're on the right track. This will be useful even during the limp-along with current code phase. I have a three-machine network at home, with one machine I plan to dedicate 100% to wiki code testing, and my test server in San Antonio that we can use. This will also allow us to safely refactor code. I'd like to use something like Latka for the suite (see http://jakarta.apache.org/commons/latka/index.html ).
* Question 2: How wedded are we to the current tools?
Apache/MySQL/PHP seems a good combo, and it probably would be possible to scale them up further, but there certainly are other options. Also, are we willing to take chances on semi-production quality versions like Apache 2.X and MySQL 4.X? I'd even like to revisit the decision of using a database at all. After all, a good file system like ReiserFS (or to a lesser extent, ext3) is itself a pretty well-optimized database for storing pieces of free-form text, and there are good tools available for text indexing, etc. Plus it's easier to maintain and port.
* Suggestion 2: Use the current code for testing features.
In re-architecting the codebase, we will almost certainly come to points where we think a minor feature change will make a big performance difference that won't hurt usability, or just features that we want to implement anyway. For example, we could probably make it easier to cache page requests if we made most of the article content HTML not dependent on skin by tagging elements well and using CSS appropriately. Also, we probably want to eventually render valid XHTML. I propose that while we are building the phase IV code, we add little features like this to the existing code to guage things like user reactions and visual impact.
Other suggestions/questions/answers humbly requested (including "Are you nuts? Let's stick with Phase III!" if you have that opinion).
Lee,
I don't think we should completely redesign things from scratch. See http://www.joelonsoftware.com/articles/fog0000000069.html about rewriting in general.
We are getting to the point where we know what the performance bottlenecks are, and we are fixing them. Brion has built some basic profiling into the code, and we've checked the slow query log. We still haven't fully understood everything, but I think we are pretty certain about the following:
1) PHP is not and has never been a problem. Virtually all our performance problems have been related to specific SQL queries (either a very high number of them, or complex ones). I do not see any reason at all to stop using PHP.
2) Getting MySQL to perform properly largly depends on using indexes the right way. This means providing composite indexes where needed. In the case of timestamps, we had to add a reverse timestamp column for it to be index-sorted fast, but while this is a hack, it is a needed hack until MySQL4.
It is specifically complex SQL queries which require ordering the whole result set that create headaches. These have been disabled for the moment, but I believe we can fix them. None of them are mission critical.
As for MySQL4, I support trying it out on our test server, test.wikipedia.org, and possibly on meta.wikipedia.org as well. We shouldn't switch the main site(s) until these two have run on it for a while.
We also need to keep in mind that we are growing very fast. We now have several highly active Wikipedias, all of them residing on the same server. While I think our server still has some room, at some point we will have to upgrade and no amount of hacking will prevent that. Separating web and database server, as is planned, should help, but I don't know how much.
I think our priorities should be this:
1) Get some of the other language Wikipedias up that people are waiting for. If there are motivated users who want to start a Wikipedia in their language, we should not let them wait.
2) Fix known bugs and try to improve the speed in case of remaining bottlenecks.
3) Implement suggested improvements. - Improve search + redirect handling - Finish Magnus' interlanguage links redesign - Fix Recent changes layout - Redesign image pages - Redesign talk pages - Improved edit conflict handling (CVS style merge) - Backends (SVG, Lilypond), syntax improvements etc.
Aside from this, an entirely new project is the dedicated Wikipedia client, for offline reading and, hopefully, ultimately for editing as well. Magnus has started working on this.
What I understand to be "Phase IV" is, then, a point where we have finished all the important fixes and improvements and then decide to move on to "nice to have" stuff. Among this is the much requested multilanguage portal for Wikipedia, with a multilanguage search, RC etc., and possibly merging the databases of the different Wikipedias (at least the user data). I do think the current software can and should be used as a basis for the next phase(s).
Regards,
Erik
(Erik Moeller erik_moeller@gmx.de): Lee,
I don't think we should completely redesign things from scratch. See http://www.joelonsoftware.com/articles/fog0000000069.html about rewriting in general.
I'm well aware of refactoring; that's the main reason I want the test suite first. But this is a case a little different from what Joel is describing: we're starting with a complete feature set and (at least initially) not making any changes at all to that. That in a sense makes it a refactor job even if we do replace the actual code. If we decide that Apache/PHP/MySQL is the tool of choice, we will of course not throw away code at all but just refactor all the way.
- PHP is not and has never been a problem. Virtually all our
performance problems have been related to specific SQL queries (either a very high number of them, or complex ones). I do not see any reason at all to stop using PHP.
I can believe that.
- Getting MySQL to perform properly largly depends on using
indexes the right way. This means providing composite indexes where needed. In the case of timestamps, we had to add a reverse timestamp column for it to be index-sorted fast, but while this is a hack, it is a needed hack until MySQL4...
I'm still concerned, though, that even if we optimize all the indexing, we'll never achievea speedup of more than 2-3x. I don't know if that will be enough in the long run. After the test suite is done, I can do some head-to-head testing of things like indexes and MySQL 4.X.
I think our priorities should be this:
- Get some of the other language Wikipedias up that people are
waiting for. If there are motivated users who want to start a Wikipedia in their language, we should not let them wait.
I agree that this should be a priority for the project. I'm not as convinced that it's the best use of /my/ time, and since my absence I'm more committed to ensuring that I don't burn out again by spending my own time on things I'm not best suited to. I think it should be up to motivated foreign users to migrate their own wiki. The presence of someone skilled enough to do that should be evidence of the level of desire.
- Fix known bugs and try to improve the speed in case of
remaining bottlenecks.
Agreed. This can also be done in parallel with new development if needed, and if the speedups are dramatic enough, perhaps it will show that new development isn't needed after all.
- Implement suggested improvements.
- Improve search + redirect handling
- Finish Magnus' interlanguage links redesign
- Fix Recent changes layout
That's one of my concerns too: I spent about three weeks trying to do this, but it just wasn't possible to get the features I wanted with the current architecture and acceptable performance.
- Redesign image pages - Redesign talk pages - Improved edit conflict handling (CVS style merge)
Hmm. I'm not sure about that one.
- Backends (SVG, Lilypond), syntax improvements etc.
Aside from this, an entirely new project is the dedicated Wikipedia client, for offline reading and, hopefully, ultimately for editing as well. Magnus has started working on this.
That's great. That's probably better suited to his talents.
What I understand to be "Phase IV" is, then, a point where we have finished all the important fixes and improvements and then decide to move on to "nice to have" stuff. Among this is the much requested multilanguage portal for Wikipedia, with a multilanguage search, RC etc., and possibly merging the databases of the different Wikipedias (at least the user data). I do think the current software can and should be used as a basis for the next phase(s).
Cross-language stuff is a big issue too. I confess that I ignored that issue 100% in the present design. If we can add those features without looking like a hack, I'm all for it, but I suspect a new architecture will help there more than anywhere.
I think our priorities should be this: 3) Implement suggested improvements. - Improve search + redirect handling - Finish Magnus' interlanguage links redesign - Fix Recent changes layout - Redesign image pages - Redesign talk pages - Improved edit conflict handling (CVS style merge) - Backends (SVG, Lilypond), syntax improvements etc.
Improvements in management of the deleted files. Right now, it's *very* difficult to find a file which has been erased, then to check it to estimate whether it should be restored or not. It would be great if this page could at least be sorted out by date of deletion, author of deletion, name space and alphabetical order. Does anybody else have trouble with this log ?
__________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/
Improvements in management of the deleted files. Right now, it's *very* difficult to find a file which has been erased, then to check it to estimate whether it should be restored or not. It would be great if this page could at least be sorted out by date of deletion, author of deletion, name space and alphabetical order. Does anybody else have trouble with this log ?
Okay, the noise of comment hurt my ears, so, I'll do it another time
Here's our deletion log http://fr.wikipedia.org/wiki/Wikip%E9dia:Deletion_log
About maybe 100 pages were deleted in the past four days.
I think (though we of course all trust one another), an *error* is always possible.
Especially when the page "votes for deletion" is not used at all, so any sysop basically delete those pages he wants to delete, whenever he decides to (yup, me too, in rome, do as the romans).
We assume *peer pressure* when creating/editing articles. Peers being given the same tool than the editors.
Right now, we have *very* little tool to do *checking* of our fellows sysop in terms of deletion (you know *sample checking*...)
It takes about 15 seconds to delete an article, about 5 mn to find it (when we find it) in the page for deleted articles.
Look at that huge deletion log !!!!!!!
Ok, if nobody wants to help me on that, could someone just tell me...a sort of sql query...which would allow me to get a list of the last deletions, with direct access to the article in the bin (something like the deletion log but with access to the deleted articles)
--------
Or imagine the worse, a sysop gets mad and start deleting articles very unwisely. Gonna be a mess to recover everything quickly....
__________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/
On Mon, 24 Feb 2003, Anthere wrote:
Ok, if nobody wants to help me on that, could someone just tell me...a sort of sql query...which would allow me to get a list of the last deletions, with direct access to the article in the bin (something like the deletion log but with access to the deleted articles)
The archive table (which holds deleted pages) does not keep track of when articles were deleted. Currently the only way to see when a page was deleted was to look in the deletion log, which currently provides no connection to the undelete system.
Yeah, it sucks.
What I've suggested before, and if I have time I'll try to implement it, but it'd be lovely if someone else got to it because I'm stretched a little thin right now, is to have a deletion log _table_. (Or perhaps a general 'event log' which also keeps track of bans, protections, unbans, unprotections, creation of user accounts, sysopization, etc; of which we can extract just deletions for purposes of showing a deletion log).
This could then be easily sorted by timestamp or by deleting user, and for sysops a link to the undelete could be made instantly available.
-- brion vibber (brion @ pobox.com)
On Fri, 21 Feb 2003, Lee Daniel Crocker wrote:
Can we estimate how long we'll be able to limp along with the current code, adding performance hacks and hardware to keep us going? If it's a year, that will give us certain opportunities and guide some choices; if it's only a month or two, that will constrain a lot of those choices.
The immediate crisis is over. Now that we're on the track of proper indexing, performance should no longer significantly degrade with increased size.
The special pages that are currently disabled just need to be rewritten to have and use appropriate indexes or summary tables. Performance hacks? Sure.
We're planning to move the database and web server to two separate machines, which should help quite a bit as well, and there's still a lot of optimization to be done in the common path. (Caching HTML would save trips to the database as well as rendering time, though it's not the biggest priority yet.)
I'd feel quite confident giving us another year with the current codebase.
- Suggestion 1: The test suite.
AMEN BROTHER!
I'd even like to revisit the decision of using a database at all. After all, a good file system like ReiserFS (or to a lesser extent, ext3) is itself a pretty well-optimized database for storing pieces of free-form text, and there are good tools available for text indexing, etc. Plus it's easier to maintain and port.
Really though, our text _isn't_ free-form. It's tagged with metadata that either needs to be tucked into a filesystem (non-portably) or a structured file format (XML?). And now we have to worry about locking multiple files for consistency, which likely means separate lockfiles... and we quickly find we've reinvented the database, just using more file descriptors. ;)
The great advantage of the database though is the ability to perform ad-hoc queries. Obviously our regular operations have to be optimized, and special queries have to be set up such that they don't bog down the general functioning of the wiki, but in general the coolest thing about the phase II/III PediaWiki is the SQL query ability: savvy (and responsible) users can cook up their own queries to do useful little things such as:
* looking up new user accounts who haven't yet been greeted * checking for "orphan" talk pages * most frequent contributors
etc, without downloading a 4-gigabyte database to their home machines or begging the developers to write a special-purpose script.
Now, it may well be that it would make sense to store the rendered HTML in files which could be rapidly spit out on request, but that's supplementary to what the database does for us.
For example, we could probably make it easier to cache page requests if we made most of the article content HTML not dependent on skin by tagging elements well and using CSS appropriately.
You mean, like we had in phase II before you rewrote it? ;)
-- brion vibber (brion @ pobox.com)
The great advantage of the database though is the ability to perform ad-hoc queries.
Yep, that's a big one, no doubt about it. That's why I'm mostly just brainstorming at this point. I _suspect_ that the database costs us a lot in things like unnecessary locking and indexing, but that may well be offset by the gains.
For example, we could probably make it easier to cache page requests if we made most of the article content HTML not dependent on skin by tagging elements well and using CSS appropriately.
You mean, like we had in phase II before you rewrote it? ;)
Now, waitaminnit, that's not true at all. Magnus had all kinds of dynamic nonsense that I removed--his code changed the actual HTML of links depending on the user's preference for link color, for example. I removed a lot of those, but I don't know if I caught all of them. I know for sure that the sidebars are fully dynamic and likely uncacheable, but the article content should be mostly cacheable now or close to it.
On Fri, 21 Feb 2003, Lee Daniel Crocker wrote:
You mean, like we had in phase II before you rewrote it? ;)
Now, waitaminnit, that's not true at all. Magnus had all kinds of dynamic nonsense that I removed--his code changed the actual HTML of links depending on the user's preference for link color, for example.
You must have been working from an older version of the codebase, because I know for a fact that I replaced those with stylesheets, which is how we had caching of rendered HTML in phase II.
-- brion vibber (brion @ pobox.com)
(Brion Vibber vibber@aludra.usc.edu): On Fri, 21 Feb 2003, Lee Daniel Crocker wrote:
You mean, like we had in phase II before you rewrote it? ;)
Now, waitaminnit, that's not true at all. Magnus had all kinds of dynamic nonsense that I removed--his code changed the actual HTML of links depending on the user's preference for link color, for example.
You must have been working from an older version of the codebase, because I know for a fact that I replaced those with stylesheets, which is how we had caching of rendered HTML in phase II.
That's possible, I suppose, but I'm sure I didn't _add_ any dynamic HTML except outside the article content. If I did, then mea maxima culpa.
Yes, it is necessary to eliminate dynamically rendered article content to make caching effective. I don't think that's where the biggest bang-for-the-buck will be, though. The things I personally think would be the biggest wins are (1) Fine-tuning the hell out the queries needed to do "RecentChanges, and (2) making the link cache persistent across queries so we don't have to look up every page that's linked to when rendering.
Could a User contributions page also give a "diff" link for "top" edits?
thus:
18:45 Feb 22, 2003 2000s http://www.wikipedia.org/wiki/2000s (top) [diff] [rollback http://www.wikipedia.org/w/wiki.phtml?title=2000s&action=rollback]
On Fri, Feb 21, 2003 at 05:31:51PM -0600, Lee Daniel Crocker wrote:
Other suggestions/questions/answers humbly requested (including "Are you nuts? Let's stick with Phase III!" if you have that opinion).
My suggestion: * first, add all most important features we need (SVG, more rending engines, XHTML, PS/PDF output, what else ?) and move everything to UTF-8 and some nice skin like Cologne Blue without link underlining with minimal amount of changes * then, make Phase IV
If we do step 2 before step 1 we will have to redesign it into Phase V soon.
wikitech-l@lists.wikimedia.org