There are a couple possibilities here. One is to rewrite MediaWiki in some other language entirely, say Python or C# or whathaveyou.
The other is to keep refactoring the PHP codebase (and it's been much changed since you left it, Lee) and, optionally, rewrite particular hotspots in another language.
A possible middle road is to rewrite the core wiki engine to a separate daemon, and adapt the existing PHP user interface to call into it for much of the backend work that actually touches data.
The disadvantages of a big rewrite are the disadvantages that all rewrites carry: a long lead time, getting behind in the game as we lose useful code and break things again that had been solved, not to mention introducing brand new exciting bugs. Even the phase 3 rewrite, which was helpful in many areas, broke a lot of features, reintroduced previously solved bugs, and set back localization work by many months.
Possible advantages of a complete or core rewrite include performance enhancements (a JITed or precompiled language could perform better on some tasks than bytecode-interpreted PHP) and a change in server architecture.
In particular, PHP tends to impose an architecture where each request is served by an entirely new script invocation: you have to build any information up from scratch on each hit, and sharing things like localization tables between invocations is kind of hard. In another language it might be easier to run MediaWiki as a standlone server, which can keep shared data in memory and use it transparently from each thread.
On the other hand, it's not clear that that's the biggest burden, and we would still have to synchronize changes in cached data between multiple servers. Even within PHP we could probably make better use of local shared memory caches if necessary.
Parsing and rendering speed, as Tim has said, really *is* something we can throw hardware at. A faster rewritten or plug-in parser might let us do more with the same hardware, but scaling it is straightforward either way.
Our biggest, nastiest burden is with internal communications: the database changes too much. We have to wait on things getting sent, received, applied, and copied around, and lagging databases send everything into the toilet fast.
We ask the database too often for information that hasn't changed. We make a lot of roundtrips, making it hard to make efficient use of remotely hosted web servers. We make huge writes that produce huge lag in the replication updates, and then we talk to the master too much because that lagging replication breaks basic things like page views in annoying ways.
The new schema is designed to help with these things by minimizing what has to get written to the database. (We still need to fix up the links tables to really do this right.) A separate cache-purging daemon could further help in dealing with things like template invalidation in a consistent way in the background. Better distinguishing between requests that need to be absolutely current and requests where it's ok to load a page that's 30 seconds out of date could make much better use of the slave servers.
As for the code architecture within MediaWiki itself: originally it was really really hackish, with lots of cut-n-pasted code and SQL queries embedded in the middle of HTML output generation. Over time we've been refactoring things bit by bit to a cleaner interface where backend code is centralized and consistent, and frontend code can concentrate on UI. This isn't fully done, but it's still proceeding.
If there's a really strong reason for a rewrite, then we should start planning a (cleanly implemented, *compatible*, complete) rewrite, making use of the existing parser tests and other tests in existence and yet to be written to make sure it'll really be able to take over. If we do, we should probably target the end of this year or early next year for taking it live. (No need to rush though; if we're going to rewrite, the point is to plan ahead and do it right.)
I'm not totally convinced that there is a really strong reason, though.
-- brion vibber (brion @ pobox.com)
The other is to keep refactoring the PHP codebase (and it's been much changed since you left it, Lee) and, optionally, rewrite particular hotspots in another language.
The downside of this are for people like myself who run an external (though not public) version of mediawiki exclusively for use in taking parts of it to convert to other formats (in my case, mobile and handheld formats). If you change the core language mediawiki is driven by, you further burdon external contributors and supporters of mediawiki as a whole.
..unless you're talking about a rewrite _exclusively_ for use on the Wikipedia/etc. servers, and not for the main SF.net project as distributed to the community. I suspect you're not talking about this approach because that means maintaining two separate (and gradually diverging) codebases.
In particular, PHP tends to impose an architecture where each request is served by an entirely new script invocation: you have to build any information up from scratch on each hit, and sharing things like localization tables between invocations is kind of hard.
(Below comments are paraphrased from a conversation I just had about an hour ago with Rasmus Lerdorf):
Isn't this exactly what ICU[1] was developed to solve? ICU automatically sticks requests in shared memory in order to optimize itself across different processes on the same server.
The goal should be that scalability is pushed out into the individual layers and doesn't become a factor at the language level like with Java or ASP.
To be truly scalable, nothing should prevent subsequent requests to be handled by different physical web servers. If you want to impose an application layer dependance on shared memory on a single physical server, you'll need to do that yourself.
And no, it is not hard to use shared memory from PHP.
[1] http://www-306.ibm.com/software/globalization/icu/index.jsp
David A. Desrosiers desrod@gnu-designs.com http://gnu-designs.com
David A. Desrosiers wrote:
The other is to keep refactoring the PHP codebase (and it's been much changed since you left it, Lee) and, optionally, rewrite particular hotspots in another language.
The downside of this are for people like myself who run an external (though not public) version of mediawiki exclusively for use in taking parts of it to convert to other formats (in my case, mobile and handheld formats). If you change the core language mediawiki is driven by, you further burdon external contributors and supporters of mediawiki as a whole.
..unless you're talking about a rewrite _exclusively_ for use on the Wikipedia/etc. servers, and not for the main SF.net project as distributed to the community. I suspect you're not talking about this approach because that means maintaining two separate (and gradually diverging) codebases.
MediaWiki is primarily targeted at Wikipedia and Wikimedia's other projects (and other similar large-scale sites with people running their own servers), secondarily at people running local instances to work with data from our sites, and only incidentally at anyone else.
If to serve our primary target users we have to do something that cuts out the guy running in a hyper-limited cheap hosting account, sorry but we may have to do that. Someone running their own installation on their own box should always be able to obtain the necessary tools, however; we're committed to always being able to run on a pure free software stack.
The old 'pure PHP' MediaWiki would continue to exist even if we go in a different direction, and it could be maintained separately for that user segment if there's interest.
In particular, PHP tends to impose an architecture where each request is served by an entirely new script invocation: you have to build any information up from scratch on each hit, and sharing things like localization tables between invocations is kind of hard.
(Below comments are paraphrased from a conversation I just had about an hour ago with Rasmus Lerdorf):
Isn't this exactly what ICU[1] was developed to solve? ICU automatically sticks requests in shared memory in order to optimize itself across different processes on the same server.
I was primarily thinking of the localized user interface messages there, but yes the Unicode normalization tables also need to be loaded when they're needed (though just from source, as they're not user-editable!)
I'm not entirely sure what you're getting at, but we do have an experimental PHP extension for using ICU to do Unicode normalization. It needs some more thorough testing before we take it live on our own servers, though. (When I last tried it it failed completely, returning empty strings for everything. This may have been an ICU library version mismatch, I haven't had a chance to fiddle with it again.)
To be truly scalable, nothing should prevent subsequent requests to be handled by different physical web servers.
Subsequent requests are virtually always handled by different physical web servers, and we would always expect this to be so. Server-local retained data thus either needs to be for things that don't change (such as Unicode normalization tables!) or things that can easily be updated when necessary (such as caches of the localized UI messages).
Otherwise we have and use the cluster-wide memcached cloud; this involves some network latency and serialization/deserialization.
And no, it is not hard to use shared memory from PHP.
PHP has a shared memory extension, but IIRC it basically entails copying a binary string into and out of a shared memory segment, and uses serializing/deserializing to store arrays and objects. It works, sure, but if we're _trying_ to avoid constantly copying around large chunks of data that's only a limited help over what we're already doing.
Compare this with a multithreaded Java or C# app which can simply refer to the live object or array in memory, in a synchronization block if necessary.
-- brion vibber (brion @ pobox.com)
On Mon, 2005-05-09 at 12:43 -0700, Brion Vibber wrote:
A possible middle road is to rewrite the core wiki engine to a separate daemon, and adapt the existing PHP user interface to call into it for much of the backend work that actually touches data.
That's the approach I'd favor. The existing PHP code represents a lot of good user-interface work for which PHP is perfectly suited. The underlying stuff could easily be split up into multiple daemons (say, one for wikitext, one for images, one for equations,...) that could feed the PHP front-end.
This would also allow incremental development, since each of the daemons could be written and attached indicidually without disturbing the rest of the codebase.
Our biggest, nastiest burden is with internal communications: the database changes too much. We have to wait on things getting sent, received, applied, and copied around, and lagging databases send everything into the toilet fast.
Yep. That's why I think the bulk of the text should be stored in a plain filesystem, where those problems are already well-known and solved, for the most part. That would reduce the database to just metadata, which would be much smaller and more efficient.
...Better distinguishing between requests that need to be absolutely current and requests where it's ok to load a page that's 30 seconds out of date could make much better use of the slave servers.
Yes! There's only one tricky part for which we may have to consider creative implementations: I tried as much as possible to take style markup (especially skin-specific) out of the rendered wikitext to allow it to be cached, but there's one case that's still a problem: red links (i.e., links to non-existent pages). Users shouted at me that this was a sine-qua-non feature, and so I had to leave it in. But it makes caching rendered wikitext hard, and slows down rendering. One alternative is to simply tolerate them being out of date for the life of the cache. Another is to possibly update the cache in some cheaper way. Yet another is to optimize the hell out of discovering the simple existence of a page, so that it's not a bottleneck in rendering (say, by having a daemon that keeps a one-bit field for every page using a spell-checker data structure)
If there's a really strong reason for a rewrite, then we should start planning a (cleanly implemented, *compatible*, complete) rewrite, making use of the existing parser tests and other tests in existence and yet to be written to make sure it'll really be able to take over. If we do, we should probably target the end of this year or early next year for taking it live. (No need to rush though; if we're going to rewrite, the point is to plan ahead and do it right.)
I'm not totally convinced that there is a really strong reason, though.
I'm all for your method, and I agree it's not an urgent need. But I think we can slip the timeline even more. The existing codebase will eventually be a liability, but I think we can throw hardware at it for a year or two. Also, if we go the route of making independent daemons linked into the existing UI code, we don't have to deploy all at once. We could, for example, make and deploy the math daemon as a proof-of- concept, work out bugs with that, then do the others afterward.
Another thing to consider: at least some of the wikipedia-driven development will be totally unnecessary for mediawiki as a general- purpose open source project. We may want to decouple those projects at some point.
On 5/10/05, Lee Daniel Crocker lee@piclab.com wrote:
Yes! There's only one tricky part for which we may have to consider creative implementations: I tried as much as possible to take style markup (especially skin-specific) out of the rendered wikitext to allow it to be cached, but there's one case that's still a problem: red links (i.e., links to non-existent pages).
[...] Yet another is to optimize the hell out of discovering the simple existence of a page, so that it's not a bottleneck in rendering (say, by having a daemon that keeps a one-bit field for every page using a spell-checker data structure)
This seems like the best method.
Maybe then you could also send the cached text to the user along with a separate list of which links should be red, using a simple JavaScript to modify the HTML on the browser side? Not sure this would be better than just modifying an existing cached version on the server side though.
Fredrik
That's the approach I'd favor. The existing PHP code represents a lot of good user-interface work for which PHP is perfectly suited. The underlying stuff could easily be split up into multiple daemons (say, one for wikitext, one for images, one for equations,...) that could feed the PHP front-end.
The math module actually works alot like this at the moment, it's not a daemon but it would be easy to make it one since it communicates with the rest of the code through STDOUT and STDIN anyway.
Lee Daniel Crocker wrote:
Yes! There's only one tricky part for which we may have to consider creative implementations: I tried as much as possible to take style markup (especially skin-specific) out of the rendered wikitext to allow it to be cached, but there's one case that's still a problem: red links (i.e., links to non-existent pages). Users shouted at me that this was a sine-qua-non feature, and so I had to leave it in. But it makes caching rendered wikitext hard, and slows down rendering. One alternative is to simply tolerate them being out of date for the life of the cache. Another is to possibly update the cache in some cheaper way. Yet another is to optimize the hell out of discovering the simple existence of a page, so that it's not a bottleneck in rendering (say, by having a daemon that keeps a one-bit field for every page using a spell-checker data structure)
We already optimised it, didn't we? In the last public profiling run:
http://meta.wikimedia.org/wiki/Profiling/20050328
...it came in at 2.6% for the non-stub bundled query and 0.4% for the stub query. I'd hardly call that a bottleneck. Individual link existence tests came in at 5.9%, mostly due to special pages, but I've largely fixed that in 1.5 by bundling the existence tests for commonly requested special pages. It wasn't so long ago it was taking 15% for individual queries and 15% for LinkCache::preFill():
http://meta.wikimedia.org/wiki/Profiling/Live_aggregate_20040604
...so we've come a long way.
I'm all for your method, and I agree it's not an urgent need. But I think we can slip the timeline even more. The existing codebase will eventually be a liability, but I think we can throw hardware at it for a year or two. Also, if we go the route of making independent daemons linked into the existing UI code, we don't have to deploy all at once. We could, for example, make and deploy the math daemon as a proof-of- concept, work out bugs with that, then do the others afterward.
We've already got two proof-of-concept daemons: the Chinese word segmenter and Lucene.
There is a technical problem with Lucene at the moment: it uses file() to fetch the result over HTTP, but that has an unconfigurable 3 minute timeout. If the search daemon goes down, we hit apache connection limits within a minute and the site stops working. We can either patch PHP to use default_socket_timeout in this case, or switch to another method like DIY pfsockopen or curl.
Another thing to consider: at least some of the wikipedia-driven development will be totally unnecessary for mediawiki as a general- purpose open source project. We may want to decouple those projects at some point.
Brion doesn't want to.
-- Tim Starling
On 5/10/05, Tim Starling t.starling@physics.unimelb.edu.au wrote:
Lee Daniel Crocker wrote:
Another thing to consider: at least some of the wikipedia-driven development will be totally unnecessary for mediawiki as a general- purpose open source project. We may want to decouple those projects at some point.
Brion doesn't want to.
-- Tim Starling
We've gotten pretty good debugging/testing because they're coupled though right?
On Tue, 2005-05-10 at 13:01 +0000, Dori wrote:
On 5/10/05, Tim Starling t.starling@physics.unimelb.edu.au wrote:
Lee Daniel Crocker wrote:
Another thing to consider: at least some of the wikipedia-driven development will be totally unnecessary for mediawiki as a general- purpose open source project. We may want to decouple those projects at some point.
Brion doesn't want to.
-- Tim Starling
We've gotten pretty good debugging/testing because they're coupled though right?
That's probably true; leaving them coupled just means that there will need to be more configuration work, because there will certainly be configurations needed for the largest wiki on the planet that would be a nuisance to any other wiki. Testing is a good argument in favor, though.
Lee Daniel Crocker wrote:
On Mon, 2005-05-09 at 12:43 -0700, Brion Vibber wrote:
A possible middle road is to rewrite the core wiki engine to a separate daemon, and adapt the existing PHP user interface to call into it for much of the backend work that actually touches data.
That's the approach I'd favor. The existing PHP code represents a lot of good user-interface work for which PHP is perfectly suited. The underlying stuff could easily be split up into multiple daemons (say, one for wikitext, one for images, one for equations,...) that could feed the PHP front-end.
This would also allow incremental development, since each of the daemons could be written and attached indicidually without disturbing the rest of the codebase.
<nod>
Our biggest, nastiest burden is with internal communications: the database changes too much. We have to wait on things getting sent, received, applied, and copied around, and lagging databases send everything into the toilet fast.
Yep. That's why I think the bulk of the text should be stored in a plain filesystem, where those problems are already well-known and solved, for the most part. That would reduce the database to just metadata, which would be much smaller and more efficient.
Replication's still an issue with filesystems; it seems like every network filesystem a) sucks and b) is a SPOF, and every distributed filesystem a) sucks and b) sucks.
This is one reason we've been talking about using an external object store rather than a filesystem. But a filesystem might work too; anyway that's a separate issue.
The 1.5 schema separates the text storage backend from the revision metadata for just this purpose; the metadata tables are smaller, and we can more easily transition to putting bulk text elsewhere entirely. In fact we can do a gradual transition by storing serialized reference objects in the text table.
Yes! There's only one tricky part for which we may have to consider creative implementations: I tried as much as possible to take style markup (especially skin-specific) out of the rendered wikitext to allow it to be cached, but there's one case that's still a problem: red links (i.e., links to non-existent pages). Users shouted at me that this was a sine-qua-non feature, and so I had to leave it in. But it makes caching rendered wikitext hard, and slows down rendering.
That doesn't really make output caching harder at all; you just follow the link tables back a step and purge the affected pages.
Currently this is done 'inline' during save; we could return from certain saves a little faster if we can shunt the purging to a background daemon, and it could handle the occasional mass-change case more gracefully than a giant 30,000-row update.
(There are two steps here: updating the page_touched timestamp on page records, and sending PURGE requests to the squid proxies. The updated touched time invalidates the parser cache and client-side cached output on the next visit, while making explicit squid purges lets the squids serve the cached pages they do have to anonymous visitors without having to check with the master servers on every hit.)
At one point I had started to write a purge daemon to handle the squid purge requests, but put it off after various improvements were made to how they're handled which sped it up quite a bit. (Currently we're using some kind of multicast thing, IIRC.)
Note that updates to template pages share roughly the same issue here, exacerbated by the possibility that a changed template will contain different links. Currently we make no attempt to update the link references from pages containing a template when the template changes, as I recall.
Somewhat different but worth mentioning is the updating of link tables on save: we've tended to flip-flop between deleting and rewriting all the rows, and ending up locking things during the save which turned ugly. Should make sure those are maintained cleanly, swiftly, and without too much pain.
[snip various other stuff mostly addressed by Tim & co]
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
Somewhat different but worth mentioning is the updating of link tables on save: we've tended to flip-flop between deleting and rewriting all the rows, and ending up locking things during the save which turned ugly. Should make sure those are maintained cleanly, swiftly, and without too much pain.
Currently we're not locking, it's patched live to do a non-locking read and then an incremental update. So the site keeps running, but inconsistencies will inevitably be introduced.
The problem with locking reads turns out to be a bug in InnoDB. Well, I'm calling it a bug anyway. The problem is that if you select all links from a particular article for update, every link table row linking to any of those pages is locked. So if one person is updating a stub, nobody else can update a stub. This is easily repeatable in a controlled situation with the following test. With two threads labelled T1 and T2:
T1> create table ltest (f int, t int, unique key (f,t), key t(t)) type=innodb; T1> insert into ltest values (1,10),(2,10),(3,10); T1> begin; T1> select * from ltest where f=1 for update;
T2> begin; T2> select * from ltest where f=2 for update; (T2 waits for T1)
Last time I reported a locking problem to the MySQL bug tracker, I was told that it wasn't a bug, and that I just needed to enable the reassuringly-named innodb_locks_unsafe_for_binlog option. I'm happy to concentrate on public ridicule if that's what they prefer.
The other problem was also related to link table updates. If you do locking reads, you get many unnecessary deadlocks.
-- Tim Starling
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
The problem with locking reads turns out to be a bug in InnoDB. Well, I'm calling it a bug anyway.
I know this has been covered before, but if we are talking about "longterm software strategy", I'd like to point out that PostgreSQL is still totally free, has no locking issues, and now offers replication. There are even some PG hackers lurking on this list (besides me) that would be more than happy to help move things in that direction. :)
- -- Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200505100626 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
Greg Sabino Mullane wrote:
The problem with locking reads turns out to be a bug in InnoDB. Well, I'm calling it a bug anyway.
I know this has been covered before, but if we are talking about "longterm software strategy", I'd like to point out that PostgreSQL is still totally free, has no locking issues, and now offers replication. There are even some PG hackers lurking on this list (besides me) that would be more than happy to help move things in that direction. :)
Why do people keep pushing for PostgreSQL as if MySQL was some sort of evil daemon? (No pun intended...)
I have never seen any substantial evidence that PostgreSQL is in any way more suitable for the task at hand; people keep claiming it's "more efficient" or "more robust" or whatever, but in the end, LiveJournal and Slashdot are still using MySQL.
Brion Vibber wrote:
Lee Daniel Crocker wrote:
On Mon, 2005-05-09 at 12:43 -0700, Brion Vibber wrote:
A possible middle road is to rewrite the core wiki engine to a separate daemon, and adapt the existing PHP user interface to call into it for much of the backend work that actually touches data.
That's the approach I'd favor. The existing PHP code represents a lot of good user-interface work for which PHP is perfectly suited. The underlying stuff could easily be split up into multiple daemons (say, one for wikitext, one for images, one for equations,...) that could feed the PHP front-end.
This would also allow incremental development, since each of the daemons could be written and attached indicidually without disturbing the rest of the codebase.
<nod>
Our biggest, nastiest burden is with internal communications: the database changes too much. We have to wait on things getting sent, received, applied, and copied around, and lagging databases send everything into the toilet fast.
Yep. That's why I think the bulk of the text should be stored in a plain filesystem, where those problems are already well-known and solved, for the most part. That would reduce the database to just metadata, which would be much smaller and more efficient.
Replication's still an issue with filesystems; it seems like every network filesystem a) sucks and b) is a SPOF, and every distributed filesystem a) sucks and b) sucks.
This is one reason we've been talking about using an external object store rather than a filesystem. But a filesystem might work too; anyway that's a separate issue.
See my previous comments about Mr Torvald's lock-free "git" system. The ideas behind this look ideal as a way of running an object store for article versions, images and other BLOBs, and the whole linux-kernel list are kicking the **** out of it as real-life beta-testers as we speak. So far, no-one's found any show-stopping problems with this radical approach to large-scale version management. (or rather, version lack-of-management, since it more-or-less discards all previous wisdom about version control).
-- Neil
See my previous comments about Mr Torvald's lock-free "git" system. The ideas behind this look ideal as a way of running an object store for article versions, images and other BLOBs, and the whole linux-kernel list are kicking the **** out of it as real-life beta-testers as we speak. So far, no-one's found any show-stopping problems with this radical approach to large-scale version management. (or rather, version lack-of-management, since it more-or-less discards all previous wisdom about version control).
I played with an early version, and whatever else I can say about it, it's a nice proof of concept. Fast, too. Not sure how well it'd adapt to this situation, though.
-- Neil
-- Austin
There are a couple possibilities here. One is to rewrite MediaWiki in some other language entirely, say Python or C# or whathaveyou.
This is obviously the most difficult and problematic, but could perhaps be a target for 3.0 (or even 4.0).
The other is to keep refactoring the PHP codebase (and it's been much changed since you left it, Lee) and, optionally, rewrite particular hotspots in another language.
This would be a minimum, I think. I'd like to see what the Flex parser gives us, although I suspect an OCaml implementation may be able to do one better.
A possible middle road is to rewrite the core wiki engine to a separate daemon, and adapt the existing PHP user interface to call into it for much of the backend work that actually touches data.
Any external parser may need to be daemonized anyway for speed, even taking OS-level binary caching into consideration. Breaking out the database engine with it offers the possibility of a more flexible database solution, although that may be best handled as part of the object store project.
If there's a really strong reason for a rewrite, then we should start planning a (cleanly implemented, *compatible*, complete) rewrite, making use of the existing parser tests and other tests in existence and yet to be written to make sure it'll really be able to take over. If we do, we should probably target the end of this year or early next year for taking it live. (No need to rush though; if we're going to rewrite, the point is to plan ahead and do it right.)
I'm not totally convinced that there is a really strong reason, though.
Making it a goal now ensures that we *can* take our time "doing it right," *before* it becomes an immediate need. A comprehensive test suite is a must for this, and I know a lot of work has been put into improving the one we have over the past few weeks. And, of course, there's still room for optimization in the current codebase.
-- brion vibber (brion @ pobox.com)
Making it a goal now ensures that we *can* take our time "doing it right," *before* it becomes an immediate need. A comprehensive test suite is a must for this, and I know a lot of work has been put into improving the one we have over the past few weeks. And, of course, there's still room for optimization in the current codebase.
Agreed, but our test suite isn't nearly as good as it should be, we only have a few tests for each feature most of which test only basic functionality of that feature as opposed to testing it in some situations which are used "in the wild" and/or expected.
The parser tests are currently alot better than nothing, and if something passes them it's certainly close to being a drop-in replacement for the current parser, but they aren't quite perfect at the moment.
Making it a goal now ensures that we *can* take our time "doing it right," *before* it becomes an immediate need. A comprehensive test suite is a must for this, and I know a lot of work has been put into improving the one we have over the past few weeks. And, of course, there's still room for optimization in the current codebase.
Fully agreed, I just hope the whole doesn't disappear in some colossal programming language war ( and that I haven't just started one right now ).
On 5/9/05, Brion Vibber brion@pobox.com wrote:
There are a couple possibilities here. One is to rewrite MediaWiki in some other language entirely, say Python or C# or whathaveyou.
...
In particular, PHP tends to impose an architecture where each request is served by an entirely new script invocation: you have to build any information up from scratch on each hit, and sharing things like localization tables between invocations is kind of hard. In another language it might be easier to run MediaWiki as a standlone server, which can keep shared data in memory and use it transparently from each thread.
Blaming the language is rarely a productive way to fix such a problem. In particular, Python usually uses the same initialize-and-run model you complain about PHP using, and mod_mono doesn't seem to be widely used. (And I assume Wikimedia has no interest in switching to Windows servers.)
Most reasonably mature languages are fast enough that the performance bottlenecks are usually in user code. And at least one of the PHP problems--lack of a JIT--will be solved when PHP-on-Parrot is available. (There's at least one project to do this; the interpreter itself is already quite fast and has JITs for several platforms, although much of it still has to be written.)
The other is to keep refactoring the PHP codebase (and it's been much changed since you left it, Lee) and, optionally, rewrite particular hotspots in another language.
I like the idea of porting hotspots, but keep in mind that we want people to be able to use this even if they don't have access to a C compiler.
A possible middle road is to rewrite the core wiki engine to a separate daemon, and adapt the existing PHP user interface to call into it for much of the backend work that actually touches data.
I like this idea, although it carries its own costs. In particular, communicating between processes has inherent overhead; we'd have to be reasonably sure that the caching would outweigh that. Also note that IPC and process-management mechanisms tend to vary across operating systems; we might lose Windows support, for example.
One advantage of this split is that we could rewrite one of the components in another language if we want, without affecting the other one. If the front end becomes little more than a shell around a daemon, we could provide a version written in C as an Apache module, which is about as fast as it gets.
On the back end, Perl 6 is specced to have one of the most powerful pattern-matching engines ever shipped with a language; it should be able to eat wikicode for breakfast. It's also designed to allow easy interoperability with other languages, so just the wikicode parser could be written in it while the rest is left as PHP. Implementation is moving quickly, with the backend Parrot Grammar Engine in progress and the Pugs (Perl 6 in Haskell) compiler just starting on the syntax. (And, of course, you'd have a very happy Perl hacker here.)
I like the idea of porting hotspots, but keep in mind that we want people to be able to use this even if they don't have access to a C compiler.
That's going to be a problem with any non-PHP code, to a greater or lesser extent, but there's no reason someone can't maintain a (now modularized) native-PHP parser.
A possible middle road is to rewrite the core wiki engine to a separate daemon, and adapt the existing PHP user interface to call into it for much of the backend work that actually touches data.
I like this idea, although it carries its own costs. In particular, communicating between processes has inherent overhead; we'd have to be reasonably sure that the caching would outweigh that. Also note that IPC and process-management mechanisms tend to vary across operating systems; we might lose Windows support, for example.
I'd like to see as much portability maintained as possible, but to be honest, I'm not personally worried about it. MediaWiki is, first and foremost, the software that runs the Wikimedia projects. I can't see us running Windows on the cluster anytime soon.
One advantage of this split is that we could rewrite one of the components in another language if we want, without affecting the other one. If the front end becomes little more than a shell around a daemon, we could provide a version written in C as an Apache module, which is about as fast as it gets.
There are numerous options available to us, and that's one of them. Making MediaWiki the server itself is at the far end of that extreme, and the are countless possibilities between it and the other.
On the back end, Perl 6 is specced to have one of the most powerful pattern-matching engines ever shipped with a language; it should be able to eat wikicode for breakfast. It's also designed to allow easy interoperability with other languages, so just the wikicode parser could be written in it while the rest is left as PHP. Implementation is moving quickly, with the backend Parrot Grammar Engine in progress and the Pugs (Perl 6 in Haskell) compiler just starting on the syntax. (And, of course, you'd have a very happy Perl hacker here.)
I've yet to be sold on Perl 6, but that's just my personal opinion. It may yet turn out to be the best option, especially if its pattern-matching capabilities are all the evangelists say they are.
-- Brent 'Dax' Royal-Gordon brent@brentdax.com Perl and Parrot hacker
At least you're open about your bias. ;)
-- Austin
Brent 'Dax' Royal-Gordon wrote:
Most reasonably mature languages are fast enough that the performance bottlenecks are usually in user code.
What a horrible generalisation. Most reasonably mature languages have shocking tight loop performance. Tight loops are required for the parsing of wikitext.
Here's the minimalistic example. This C++ code:
int main() {for (unsigned i=0; i<(unsigned)1e9; i++);}
executes in 1.58 seconds on my desktop computer. It was compiled with g++ -O3. The equivalent PHP code:
<?php for ($i=0; $i<1e7; $i++); ?>
...took 6.3 seconds, or 400 times slower than C++ on this task. Perl is 173 times slower. For old times' sake I ran this:
FOR i& = 1 TO 1E+08 NEXT
... on QuickBASIC in safe mode in a DOS box. It took 8.5 seconds, a mere 54 times slower than 32-bit C++. So don't tell me interpreted languages are getting faster, they're as slow as the day they were invented. If you need to execute large numbers of calls to short-running functions, you're sunk.
-- Tim Starling (considering rewriting parser in QuickBASIC)
On 5/10/05, Tim Starling t.starling@physics.unimelb.edu.au wrote:
So don't tell me interpreted languages are getting faster, they're as slow as the day they were invented. If you need to execute large numbers of calls to short-running functions, you're sunk.
I've had some good experience with Python+Psyco. Psyco can deliver excellent performance boosts for tight loops doing simple operations on numbers and strings, and does a good job with function calls.
Using your example, I just timed incrementing an integer variable 10^9 times to 113 seconds with plain Python and 2.35 seconds with Psyco.
In the benchmarks at The Great Computer Language Shootout [1], Python+Psyco is about 5x faster than PHP. It is within 1/10x the speed of C/gcc for most benchmarks. (The worst case, n-body simulation, is 85 times slower than C. However, I tried a trivial optimization for the inner loop and it got twice as fast :-)
[1] http://shootout.alioth.debian.org/great/benchmark.php?test=all&lang=psyc...
Fredrik
I've had some good experience with Python+Psyco. Psyco can deliver excellent performance boosts for tight loops doing simple operations on numbers and strings, and does a good job with function calls.
Don't forget Pyrex[1] also.. (I abhor Python, but I will give it a fair shake in arguments such as this).
[1] http://www.cosc.canterbury.ac.nz/~greg/python/Pyrex/
David A. Desrosiers desrod@gnu-designs.com http://gnu-designs.com
On Tue, 2005-05-10 at 22:06 +0200, Fredrik Johansson wrote:
On 5/10/05, Tim Starling t.starling@physics.unimelb.edu.au wrote:
So don't tell me interpreted languages are getting faster, they're as slow as the day they were invented. If you need to execute large numbers of calls to short-running functions, you're sunk.
I've had some good experience with Python+Psyco. Psyco can deliver excellent performance boosts for tight loops doing simple operations on numbers and strings, and does a good job with function calls.
Using your example, I just timed incrementing an integer variable 10^9 times to 113 seconds with plain Python and 2.35 seconds with Psyco.
In the benchmarks at The Great Computer Language Shootout [1], Python+Psyco is about 5x faster than PHP. It is within 1/10x the speed of C/gcc for most benchmarks. (The worst case, n-body simulation, is 85 times slower than C. However, I tried a trivial optimization for the inner loop and it got twice as fast :-)
I did a small benchmark based on fibonacci(40) recently- Java 1.5 won with 3.95 seconds over C with 4.1 seconds, python/psyco took 9.3 seconds, php 1.5 with eaccelerator didn't finish within two minutes. The Java number is probably the result of some clever caching of previous results within the JVM.
For what it's worth- i'd strongly support a MediaWiki rewrite in Python. I've played with Twisted recently which seems to be particularly well suited to the task of a distributed wiki system with many supported interfaces and protocols, a long-running process and better performance than LAMP. Even MoinMoin on Twisted beats MediaWiki: MediaWiki1.4/PHP5 4.7req/sec, MediaWiki1.4/PHP5/Eaccelerator: 16.2req/sec, Moin 1.3 on Twisted: 34.6req/sec. This is without any use of Psyco, Pyrex or Tisted's new c modules. A simple DB interface app using Nevow on Twisted does 230req/sec on my laptop.
On Tuesday 10 May 2005 21:18, Tim Starling wrote:
int main() {for (unsigned i=0; i<(unsigned)1e9; i++);} executes in 1.58 seconds on my desktop computer.
hm.. we all know how badly microbenchmarks can go wrong. in this case i'd have expected the loop to be optimized away, since it does not affect the state of the rest of the program in any way..
daniel
On 5/17/05, Daniel Wunsch the.gray@gmx.net wrote:
hm.. we all know how badly microbenchmarks can go wrong. in this case i'd have expected the loop to be optimized away, since it does not affect the state of the rest of the program in any way..
What if it's purposefully used to create a delay to avoid a clash between two threads in the control system for a nuclear reactor? ;-)
- Fredrik
On 5/17/05, Fredrik Johansson fredrik.johansson@gmail.com wrote:
On 5/17/05, Daniel Wunsch the.gray@gmx.net wrote:
hm.. we all know how badly microbenchmarks can go wrong. in this case i'd have expected the loop to be optimized away, since it does not affect the state of the rest of the program in any way..
What if it's purposefully used to create a delay to avoid a clash between two threads in the control system for a nuclear reactor? ;-)
In that case anyone who uses a waiting loop for a specified number of loops or even a specified time rather than any explicit clash handling, should be fired immediately.
Andre Engels
On Tue, May 17, 2005 at 09:20:15PM +0200, Daniel Wunsch wrote:
On Tuesday 10 May 2005 21:18, Tim Starling wrote:
int main() {for (unsigned i=0; i<(unsigned)1e9; i++);} executes in 1.58 seconds on my desktop computer.
hm.. we all know how badly microbenchmarks can go wrong. in this case i'd have expected the loop to be optimized away, since it does not affect the state of the rest of the program in any way..
It might be pretty off-topic, but not much more than the rest of this thread.
This loop is indeed optimized in any decent SSA-style compiler. gcc3 experimentally supports SSA, and it's enabled by -fssa -fssa-dce -fssa-ccp gcc4 is supposed to be totally SSA-based.
It is possible in non-SSA compilers too (there is no such thing as "the" optimization algorithm, every compiler has different bag of tricks), but they usually assume that if something affects control flow (and the i variable does), it should not be optimized away.
$ cat foo.cc int main() { for (unsigned i=0; i<(unsigned)1e9; i++); return 0; } $ g++ --version g++ (GCC) 3.3.6 (Debian 1:3.3.6-5) Copyright (C) 2003 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. $ g++ -O6 foo.cc -o foo $ time ./foo
real 0m1.882s user 0m1.573s sys 0m0.101s $ g++ -fssa -fssa-dce -O6 foo.cc -o foo $ time ./foo
real 0m0.006s user 0m0.004s sys 0m0.002s $ g++ -fssa -fssa-dce -O6 foo.cc -S $ cat foo.s .file "foo.cc" .text .align 2 .p2align 4,,15 .globl main .type main, @function main: .LFB3: pushl %ebp .LCFI0: movl %esp, %ebp .LCFI1: subl $8, %esp .LCFI2: andl $-16, %esp xorl %eax, %eax movl %ebp, %esp popl %ebp ret .LFE3: .size main, .-main .section .note.GNU-stack,"",@progbits .ident "GCC: (GNU) 3.3.6 (Debian 1:3.3.6-5)" $
Please adjust your expectations now. :-)
Daniel Wunsch wrote:
On Tuesday 10 May 2005 21:18, Tim Starling wrote:
int main() {for (unsigned i=0; i<(unsigned)1e9; i++);} executes in 1.58 seconds on my desktop computer.
hm.. we all know how badly microbenchmarks can go wrong. in this case i'd have expected the loop to be optimized away, since it does not affect the state of the rest of the program in any way..
If I was using such a compiler, I would have just disabled that optimisation. For all compilers/interpreters, I tried with a few different loop sizes to make sure it was actually taking O(N) time. As it happens, for g++ -O3 I did dump the assembly code:
xor eax, eax .p2align 4,,15 L6: inc eax cmp eax, 999999999 jbe L6
I thought I'd be clever and make a tighter loop by hand:
mov ecx, 1000000000 .p2align 4,,15 L6: loop L6
But to my disappointment it was slower than the machine generated version. That's pipelining for you, I guess. I could have unrolled the loop, but, well, that's getting a bit silly.
The question I was trying to answer was: what is the fastest possible loop in a given language? I think it was a reasonable question to ask. For the interpreters, it was mainly a test of variable access speed. I did try Java, but I didn't include it because it was clear that it was using a JIT compiler to produce native code similar to the C++ version. My purpose was to make fun of the interpreters, the compiled language was just included as reference for what a fast language is.
Don't worry, I do have ways to make fun of Java, that just isn't one of them :)
Another point I would make is that fundamentally, it is weak typing that makes Perl and PHP slow. To optimise Perl or PHP properly, you'd have to not only convert it to a low level language, you'd have to inline the simpler variable access functions and then use an optimiser capable of identifying the types as loop invariants, thus removing some of the conditional branches from the loop. Object-oriented C++ has the same problem.
-- Tim Starling
On 5/18/05, Tim Starling t.starling@physics.unimelb.edu.au wrote:
I thought I'd be clever and make a tighter loop by hand:
mov ecx, 1000000000 .p2align 4,,15
L6: loop L6
But to my disappointment it was slower than the machine generated version.
Note that if that has not changed (I do not really study instruction-level optimization nowadays), the "loop" instruction was slower than an identical (almost, except for flags) dec (e)cx, jnz label. Go figure.
-- [[cs:User:Mormegil | Petr Kadlec]]
Brent 'Dax' Royal-Gordon wrote:
Blaming the language is rarely a productive way to fix such a problem. In particular, Python usually uses the same initialize-and-run model you complain about PHP using, and mod_mono doesn't seem to be widely used. (And I assume Wikimedia has no interest in switching to Windows servers.)
Just a note: I've specifically mentioned standalone daemons as an alternative possibility, *not* CGI programs or ASP.NET.
PHP really is much more strongly tied to the CGI-bin script execution model than say Python, which is fairly routinely used to write network servers and GUI programs.
There's nothing in PHP-the-language that makes a standalone daemon written in PHP _impossible_, but it's rather awkward with PHP-the- implementation's limitations -- no threading, awkward subprocess control, limited exception handling. (Domas has experimented a bit with this.)
(A complaint about PHP-the-language I do have though is its total lack of Unicode string support. The only sensible way to deal with non-ASCII material ends up being using byte-oriented strings in UTF-8 encoding, and you have to worry about character boundaries and invalid character sequences yourself where relevant. Not insurmountable, but it's certainly an annoyance. Python carries the legacy of transitioning from this model, and now has both byte-oriented and Unicode string types; perhaps one day PHP will make the jump too.)
Most reasonably mature languages are fast enough that the performance bottlenecks are usually in user code. And at least one of the PHP problems--lack of a JIT--will be solved when PHP-on-Parrot is available. (There's at least one project to do this; the interpreter itself is already quite fast and has JITs for several platforms, although much of it still has to be written.)
Well, we'll see when it gets there. :)
-- brion vibber (brion @ pobox.com)
So, on an alternate note: Wikimedia's needs notwithstanding, LAMP is a ubiquitous platform that's very useful for a large number of communities for various reasons. If Wikimedia decides to move on from Linux + Apache + PHP + MySQL and rewrite from scratch, or almost scratch, it would be a generous gesture to launch the current MediaWiki codebase onto its own trajectory as a separate project -- either under the same name, or something similar.
If that's the plan, it might be a good idea to start moving MediaWiki to be more of a general-purpose wiki plus Wikimedia add-ons. That would make the transition easier and more fruitful.
Just one man's opinion.
~Evan
Brent 'Dax' Royal-Gordon wrote:
Most reasonably mature languages
Well, that excludes PHP for starters. *ducks, runs, and hides* :)
I like the idea of porting hotspots, but keep in mind that we want people to be able to use this even if they don't have access to a C compiler.
I can't believe people are still bringing forward this argument. No, we don't! We do *not* want to limit our set of alternatives for the benefit of other webmasters. Whoever needs a pure-PHP wiki engine should fork MediaWiki and maintain it themselves when it becomes too un-PHP for their taste. This shouldn't - rather, MUST NOT - be Wikimedia's responsibility, and it MUST NOT adversely affect Wikipedia's performance.
Also note that IPC and process-management mechanisms tend to vary across operating systems; we might lose Windows support, for example.
Funny, first you clearly state that Wikimedia is not likely to run Windows servers anytime soon, and then you call for maintenance of Windows support. :-)
On the back end, Perl 6 is specced to have one of the most powerful pattern-matching engines ever shipped with a language; it should be able to eat wikicode for breakfast.
Yet Another Not-Really-A-Parser (a.k.a. RegExp HodgePodge)?
Timwi
Timwi timwi@gmx.net wrote:
I can't believe people are still bringing forward this argument. No, we don't! We do *not* want to limit our set of alternatives for the benefit of other webmasters. Whoever needs a pure-PHP wiki engine should fork MediaWiki and maintain it themselves when it becomes too un-PHP for their taste. This shouldn't - rather, MUST NOT - be Wikimedia's responsibility, and it MUST NOT adversely affect Wikipedia's performance.
...
Funny, first you clearly state that Wikimedia is not likely to run Windows servers anytime soon, and then you call for maintenance of Windows support. :-)
Then throw in an if($wgUseCExtensions) or something.
Although I contribute to en.wikipedia.org, understand that my primary concern and interest in MediaWiki is as a piece of software I can use on my own websites. I don't use hosting accounts without shell access (and in fact most of my sites are on my own server), but I can sympathize with people who do; I don't use Windows servers, but I can sympathize with those people too.
On the back end, Perl 6 is specced to have one of the most powerful pattern-matching engines ever shipped with a language; it should be able to eat wikicode for breakfast.
Yet Another Not-Really-A-Parser (a.k.a. RegExp HodgePodge)?
grammar MediaWiki::Wikitext { rule start :parsetree { ^ <wikitext> $ } rule wikitext { <literal> [ <command> <literal> ]*: } rule literal { <-[<>[]'{}\n]>*? } rule command { <bold_italic> | <extlink> | <wikilink> | <template> | (\n\h*\n) | . # fallback } ... }
Clearly that isn't complete (or even entirely functional), but you get the idea.
wikitech-l@lists.wikimedia.org