Hi - clearly, it'd be great if Wikipedia had better performance.
I looked at some of the "Database benchmarks" postings, but I don't see any analysis of what's causing the ACTUAL bottlenecks on the real system (with many users & full database). Has someone done that analysis?
I suspect you guys have considered far more options, but as a newcomer who's just read the source code documentation, maybe some of these ideas will be helpful:
1. Perhaps for simple reads of the current article (cur), you could completely skip using MySQL and use the filesystem instead. Simple encyclopedia articles could be simply stored in the filesystem, one article per file. To avoid the huge directory problem (which many filesystems don't handle well, though Reiser does), you could use the terminfo trick.. create subdirectories for the first, second, and maybe even the third characters. E.G., "Europe" is in "wiki/E/u/r/Europe.text". The existence of a file can be used as the link test. This may or may not be faster than MySQL, but it's probably faster: the OS developers have been optimizing file access for a very long time, and instead of having userspace<->kernel<->userspace interaction, it's userspace<->kernel interaction. You also completely avoid locking and other joyless issues.
2. The generation of HTML from the Wiki format could be cached, as has been discussed. It could also be sped up, e.g., by rewriting it in flex. I suspect it'd be easy to rewrite the translation of Wiki to HTML in flex and produce something quite fast. My "html2wikipedia" is written in flex - it's really fast and didn't take long to write. The real problem is, I suspect that isn't the bottleneck.
3. You could start sending out text ASAP, instead of batching it. Many browsers start displaying text as it's available, so to users it might _feel_ faster. Also, holding text in-memory may create memory pressure that forces more useful stuff out of memory.
Anyway, I don't know if these ideas are all that helpful, but I hope they are.
(David A. Wheeler dwheeler@dwheeler.com):
- Perhaps for simple reads of the current article (cur), you
could completely skip using MySQL and use the filesystem instead.
In other words, caching. Yes, various versions of that have been tried and proposed, and more will be. The major hassles are (1) links, which are displayed differently when they point to existing pages, so a page may appear differently from one view to the next depending on the existence of other pages, and (2) user settings, which will cause a page to appear differentlt for different users. But caching is still possible within limits, and using the filesystem rather than the database to store cached page info is certainly one possible implementation to be tried.
[Rendering] could also be sped up, e.g., by rewriting it in flex. My "html2wikipedia" is written in flex - it's really fast and didn't take long to write. The real problem is, I suspect that isn't the bottleneck.
It isn't. And there's no reason to expect flex to be any faster than any other language.
- You could start sending out text ASAP, instead of batching it.
Many browsers start displaying text as it's available, so to users it might _feel_ faster. Also, holding text in-memory may create memory pressure that forces more useful stuff out of memory.
Not an issue. HTML is sent out immediately after it's rendered. Things like database updates are deferred until after sending; the only time taken before that is spent in rendering, and as I said, that's not a bottleneck.
One things that would be nice is if the HTTP connection could be dropped immediately after sending and before those database updates. That's easy to do with threads in Java Servlets, but I haven't found any way to do it with Apache/PHP.
On Tue, 2003-04-29 at 23:33, Lee Daniel Crocker wrote:
(David A. Wheeler dwheeler@dwheeler.com):
- Perhaps for simple reads of the current article (cur), you
could completely skip using MySQL and use the filesystem instead.
In other words, caching.
Not necessarily; it would also be possible to keep the wiki text in files. But I'm not sure what great benefit this would have, as you still have to go looking up various information to render it.
Yes, various versions of that have been tried and proposed, and more will be. The major hassles are (1) links, which are displayed differently when they point to existing pages, so a page may appear differently from one view to the next depending on the existence of other pages,
That's not a problem; one simply invalidates the caches of all linking pages when creating/deleting.
This is already done in order to handle browser-side caching; each page's cur_touched timestamp is updated whenever a linked page is created or deleted. Simply regenerate the page if cur_touched is more recent than the cached HTML.
- You could start sending out text ASAP, instead of batching it.
Many browsers start displaying text as it's available, so to users it might _feel_ faster.
A few things (like language links) currently require parsing the entire wikitext before we output the topbar. Hypothetically we could output the topbar after the text and let CSS take care of its location as we do for the sidebar, but this may be problematic (ie in case of varying vertical size due to word wrap) and would leave users navigationally stranded while loading.
Also, holding text in-memory may create memory pressure that forces more useful stuff out of memory.
Not an issue. HTML is sent out immediately after it's rendered.
Well... many passes of processing are done over the wikitext on its way to HTML, then the whole bunch is dumped out in a chunk.
Things like database updates are deferred until after sending;
I'm not 100% sure how safe this is; if the user closes the connection from their browser deliberately (after all, the page _seems_ to be done loading, why is the icon still spinning?) or due to an automatic timeout, does the script keep running through the end or is it halted in between queries?
One things that would be nice is if the HTTP connection could be dropped immediately after sending and before those database updates. That's easy to do with threads in Java Servlets, but I haven't found any way to do it with Apache/PHP.
For some things (search index updates) we use INSERT/REPLACE DELAYED queries, whose actual action will happen at some point in the future, taken care of for us by the database. There doesn't seem to be an equivalent for UPDATE queries.
Hypothetically we could have an entirely separate process to perform asynchronous updates and just shove commands at it via a pipe or shared memory, but that's probably more trouble than it's worth.
-- brion vibber (brion @ pobox.com)
Hi,
I have a question about Wikipedia code. I noticed that the way it accesses GET/POST variables from URL is by using global variables. There are two problems with that: - it doesn't work if register_globals options is off (which is a default in newer versions of PHP) - it is considered to be a security risk (http://www.php.net/manual/en/configuration.directives.php#ini.register-globa..., http://www.php.net/manual/en/security.registerglobals.php)
The fix for those problems is very simple: for each variable passed through GET/POST add the code like this: $title = $HTTP_GET_VARS['title'];
My questions: a) is there any special reason it's being done this way in Wikipedia? b) any chance it can be changed? If yes, what can I do to help make this happen (I can write the code, test it and submit a patch)
Thanks,
Krzysztof Kowalczyk
[register_globals problem] My questions: a) is there any special reason it's being done this way in Wikipedia? b) any chance it can be changed? If yes, what can I do to help make this happen (I can write the code, test it and submit a patch)
No, and yes. You'll notice I already started doing that for SearchEngine.php. If you want to help me out with the others, go for it. Please see that as a model.
On Wed, 30 Apr 2003, Krzysztof Kowalczyk wrote:
I have a question about Wikipedia code. I noticed that the way it accesses GET/POST variables from URL is by using global variables. There are two problems with that:
- it doesn't work if register_globals options is off (which is a default
in newer versions of PHP)
The wiki uses a number of non-standard options...
- it is considered to be a security risk
Sure, if you use *uninitialized* global variables and assume they can only have trusted values. Don't do that. :)
My questions: a) is there any special reason it's being done this way in Wikipedia?
Force of habit.
b) any chance it can be changed? If yes, what can I do to help make this happen (I can write the code, test it and submit a patch)
Sure, please send patches. $_GET / $_POST are ugly as heck, but it's theoretically a better coding practise.
Keep in mind that a few things might work by either GET or POST (searches; some legit bots).
-- brion vibber (brion @ pobox.com)
- it doesn't work if register_globals options is off (which is
a default in newer versions of PHP)
The wiki uses a number of non-standard options...
Actually, register_globals is the only thing you have to change in php.ini to get the wiki running.
- it is considered to be a security risk
Sure, if you use *uninitialized* global variables and assume they can only have trusted values. Don't do that. :)
Hopefully. I'm not that confident that either we don't do that, or that future coders won't do that, so I think avoiding the problem by coding so that register_globals isn't needed is a good idea.
My questions: a) is there any special reason it's being done this way in Wikipedia?
Force of habit.
Don't forget laziness. :-)
Sure, please send patches. $_GET / $_POST are ugly as heck, but it's theoretically a better coding practise.
In SearchEngine.php, I used $_REQUEST[], because I don't really care whether the variables come from a GET or a POST.
On Wed, 30 Apr 2003, Lee Daniel Crocker wrote:
The wiki uses a number of non-standard options...
Actually, register_globals is the only thing you have to change in php.ini to get the wiki running.
You also need iconv support compiled in, although for a latin-1-only wiki that doesn't need to interact with incoming and outgoing links in UTF-8 it _probably_ won't get triggered.
Sure, if you use *uninitialized* global variables and assume they can only have trusted values. Don't do that. :)
Hopefully. I'm not that confident that either we don't do that, or that future coders won't do that, so I think avoiding the problem by coding so that register_globals isn't needed is a good idea.
Yup. Like overflowing your buffers: nobody does it on _purpose_. :)
In SearchEngine.php, I used $_REQUEST[], because I don't really care whether the variables come from a GET or a POST.
Oh hey, I learn something new every day. :)
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org