Gabriel Wicke wrote:
On Tue, 13 Jan 2004 00:07:50 +0000, Nick Hill wrote:
The most commonly used pages are going to be in the memory of the database server so these are not costly to serve. The costly pages to serve are those which need disk seeks to serve. The more I/O seek operations a page requires, the more costly it is to serve.
Yup. So lets avoid them.
Given that popular articles will be in the database memory cache, requests for popular articles should not lead to database HDD seeking.
I would expect a Squid proxy be best at serving popular pages and poor at serving less popular pages. So I can't imagine how squid is very helpful at saving HDD seeks.
The proxy server will need to make a database lookup (for the URL)
Nope. Only if a page is *not* in the cache or marked as not cacheable.
I meant the squid server will need to look up it's own database (in whatever form that may be- filesystem or indexed DBMS) to check if it has a copy of the required data- using the URL as a key. If it has a copy of the data, it will need to either pull it out of memory or from the disk. If not, forward the request then add the page to it's own database.
If the squid server needs to pull an article page off the disk, then disk I/O will be required in the same way as it would be required if an uncached piece of data is read from the database by the web server.
As I/O is the bottleneck, the squid server is likely to suffer the same problems as the underlying database server. These problems are likely to be bigger as the chunks of fine grained data handled by squid will be larger (compressed fully formed HTML pages) than the fine grained chunks of data (article text) handled by the database server.
If performance is the criteria, I suggest a proxy isn't a good idea.
Well- please read up some docs. Or benchmark http://www.aulinx.de/ - commodity server (Celeron 2Ghz) running Squid.
I am not contending that squid is not a very high performance server. I believe it is a high performance server. I believe it can substantially reduce the bandwidth needed by ISPs to serve web surfers.
The issue for wikipedia is how many disk accesses, in total, are needed for each article hit.
Wikipedia has millions of discrete pieces of data. Most pieces referenced singularly by a unique URL. Squid will not be able to hold a substantial proportion of these in memory. Squid will be able to hold fewer of these data chunks (articles/ HTML pages) in a given amount of memory than a database server could as the articles stored on the database consist of less data than the article rendered in HTML. For the larger articles, the comressed html page will be smaller, but for most articles, the compressed HTML page will be bigger. (the relative weight of the page HTML is much greater for short articles than for long articles. Compression reduces page size by about half.)
I assume viewing an article history page requires several pieces of information leading to multiple seeks per request: If squid were able to serve article histories, then a single I/O on the squid box could save several database seeks on the database server, providing a substantial economy. However, individual page histories are each fairly rare and forcing a squid cache reload on an article history page when an article is updated may be a poor use of resources.
I suggest four avenues for investigation: 1) Store articles in the MySQL table in compressed (gzip) format. This will reduce the size of the articles, making them fit more easily into the available cache memory, increasing the chances of a cache hit almost by a factor of two. Perhaps this can be made as a patch to MySQL. 2) Investigate ways of prioritising data cached in memory such that smaller chunks of data have a higher value than larger chunks so that smaller chunks are not flushed according to the basic least recently used algorithm. To reflect the relative cost of reading a small chunk of data from the HDD. 3) If the SQL code underlying wikipedia relies on temporary tables as part of the SQL queries, investigate whether the I/O of writing temporary tables tends to flush data from the disk cache. If so, write temporary tables to ramdisk or other storage which does not cause flushing. More recent versions of MySQL support sub-queries. This may obviate the need for temporary tables. 4) Judicious use of solid state storage. Could dramatically reduce seek times and I/O bottleneck. Some issues to resolve regarding flash memory durability and possible MySQL hotspots. Also cost of mass solid state storage. Might be worthwhile for some wiki data.