Does anybody have any guesses why the site has slowed down recently? What did we change? Is it possible that the new diff engine eats too many resources?
I think it would be good to have a user account on wikipedia.com so that we can monitor the load status and the SQL queries in order to diagnose these problems better.
Axel
Axel Boldt wrote:
Does anybody have any guesses why the site has slowed down recently? What did we change? Is it possible that the new diff engine eats too many resources?
I think it would be good to have a user account on wikipedia.com so that we can monitor the load status and the SQL queries in order to diagnose these problems better.
Axel _______________________________________________ Wikitech-l mailing list Wikitech-l@ross.bomis.com http://ross.bomis.com/mailman/listinfo/wikitech-l
I agree. What would be useful would be a server logfile with some timing/load figures (a real logfile, not in the database). This could be made available at a static URL, and we could run crunching scripts on it to try to analyze what the problem might be related to.
-- Neil
At the end of this month, I am planning to buy a new server for some of the other stuff that's on the wikipedia server (jimmywales.com/timshell.com/kirawales.com/ and various other odds and ends). Once I get that stuff off of this machine, I plan to isolate this machine from the rest of my network (which sounds fancy but just involves making sure there are no stray ssh keys around) and give login accounts to active developers, including root access to anyone who REALLY needs it.
This will help to remove *me* as the bottleneck to improvements, which I am right now.
In the meantime, here's the latest log of slow queries -- this should be helpful in diagnosing our current problems.
Should I install the latest from the cvs?
--Jimbo
On mar, 2002-04-16 at 12:12, Jimmy Wales wrote:
At the end of this month, I am planning to buy a new server for some of the other stuff that's on the wikipedia server (jimmywales.com/timshell.com/kirawales.com/ and various other odds and ends). Once I get that stuff off of this machine, I plan to isolate this machine from the rest of my network (which sounds fancy but just involves making sure there are no stray ssh keys around) and give login accounts to active developers, including root access to anyone who REALLY needs it.
This will help to remove *me* as the bottleneck to improvements, which I am right now.
In the meantime, here's the latest log of slow queries -- this should be helpful in diagnosing our current problems.
Should I install the latest from the cvs?
Probably a good idea -- there are a number of bugs fixed there that we've been getting multiple bug reports on (including the 'bad link causes following text on same line to disappear' and 'edit links for articles with non-ascii characters in title are mysteriously converted to UTF-8 by certain versions of Internet Explorer, resulting in the wrong page being edited' bugs).
-- brion vibber (brion @ pobox.com)
With the recent increases in traffic (which fit well with my estimates earlier) I think we should be giving some thought now to how the Wikipedia can be made massively scalable. If we start now, we should have a solution available before it is needed, rather than hitting a crunch point and then fixing things in a hurry.
Some thoughts on this:
* The single image of the database is the ultimate bottleneck. This should be held in RAM as far as possible. Thought should be given to using PostgreSQL rather than MySQL, which should scale better in heavily loaded cases.
* Wikipedia is read an order or magnitude more often than it is written. Caching can give a major performance boost, and will help palliate underlying scaling problems.
* Wikipedia read accesses appear to obey [[Zipf's law]]: most pages are read relatively infrequently, and a few very frequently. However, most of the load is from the aggregated accesses of the many low-traffic pages. Therefore, a 'hot spot' cache will not work: the cache has to be big enough to encompass the whole Wikipedia. Again, the cache will have to be held in RAM to avoid the 1000-times slower performance of disk accesses.
* Web serving consumes RAM: a slow modem transfer of a page will lock down resources in RAM for the duration of a page load. At one end of the scale, this is simply socket buffers. At the other end of the scale, it can be a page-rendering thread and its stack, unable to progress to do other work because it can't commit its output to the lower levels of the web server. On low-traffic sites, this is insignificant. On high-traffic sites, it can become a bottleneck. Machines with vast amounts of RAM are very expensive: it is cheaper to buy a number of smaller machines with reasonable amounts of RAM in -- and you get the benefit of extra CPUs for nothing.
=== The grand vision ===
For all these reasons, I propose a solution in which the database is kept on a separate server from the user-facing servers. The extra overhead involved in doing this should be repaid many times over in the long run by the fact that the front-end servers will take more than 90% of the load off the central server.
The front-end servers should be kept _dumb_, and all the control logic, program code, and configuration files should reside on the master server. This removes the need to configure vast numbers of machines in sync.
The front-end server code can also be kept small and tight, with most of the work (page rendering, parsing, etc.) being done by the existing code. (Later, rendering and parsing could be moved to the front-end servers if needed).
=== How to get there easily ===
The first step is to implement a separate caching server.
The system can even be trialled now, by running the master and slave servers on the same box! The existing Wikipedia just needs to have a special skin that serves up 'raw' rendered pages, without any page decorations, CSS, or whatever.
The caching server will deal with all user interaction: cookies, skins, etc. It should serve up 'raw' pages decorated with user-specific content and style sheets.
The caching system could be put on an experimental server, or later put on the main site, as a beta test.
There is only one other thing needed to implement this. Each page in the central DB needs to have a 'last changed' time, and a way to get hold of it. Now, this is where things get slightly tricky. This 'last changed' timestamp should be updated, not only when the page itself is edited, but also when any page linked in the page is created or deleted. That is to say: when creating or deleting a page, timestamps on all pages that link to that page should be touched.
Note that there is no need to touch other pages' timestamps when a page is edited. This is good, as otherwise the extra overhead of all that marking would be unreasonably high.
Now, this last-modified timestamp can be determined using a special request format: the underlying HTTP mechanism could be used, but this is too awkward to fiddle with as a way of supporting experimental software. Better to do something like:
http://www.wikipedia.com/wiki/last_changed?Example_article
(which even avoids invoking the main code at all)
result: a page with the content:
20020417120254
=== Criticisms ===
But, I hear you say -- surely you have added another bottleneck? For every page hit, you still generate a hit on the central DB for the 'last changed' timestamp. The answer is: yes, but -- the timestamp db is very small, performs a very simple lightweight operation, has a single writer and multiple readers, and can therefore later be made separate from the main DB, replicated, etc.
So, there are some ideas.
* Client/server database split * Front-end/back-end script split
Does anyone have any idea how reasonable this is as an idea, based on refactoring retrofitting the existing script? I think that if things were done slowly, the front-end code could be made about 10% of the overall complexity, mostly refactored from existing code.
-- Neil
A few links regarding comparisons of PostgreSQL vs. MySQL.
The general consensus seems to be the MySQL is fast but very slightly flaky and may have scaling problems. PostgreSQL is slower under low loads, but getting better, and may perform better than MySQL under heavy loads.
Really, the only way to find out is to try both and compare.
---------------------------------------------------------------------
From:
http://linux.oreillynet.com/pub/a/linux/2002/01/31/worldforge.html?page=2
Quotes:
Riddoch:
"MySQL performs poorly under a heavy load," he says. "In particular, it does not handle very large tables well and does not optimize complex queries well.
Harrington:
"For read-only applications [PostgreSQL] can be significantly slower than MySQL, but we have not done any benchmarking or timing so we can't really say much there.
http://www.mysql.com/information/benchmarks.html
shows MySQL generally outperforming PostgreSQL on simple reads.
http://www.cbbrowne.com/info/rdbmssql.html
mostly critiques MySQL as unstable compared to other databases
the Open Source Database Benchmark
http://openacs.org/philosophy/why-not-mysql.html
generally criticises MySQL!
wikitech-l@lists.wikimedia.org