== Access patterns ==
I did some stats ages ago, showing an approximately Zipfian distribution for page accesses. A bit of calculation shows that for small numbers of articles, this means a small amount of cache will give a large performace boost. However, as the number of articles increases, the vast majority of seldom-accessed articles will start to dominate the behavior of the system for article fetches. Thus, RAM caching will decrease in usefulness over time as the project progresses, unless the RAM cache is close to the size of the entire working set.
Nick's suggestion of tuning the filing system page size to the article size is a good idea; it will tend to make the RAM cache which is currently available more effective. I'm rather dubious about some of his other suggestions.
Where RAM caching is really important is in the "hot" data such as article timestamps and link tables. These have already been partially addressed by the use of memcached, I believe. These commonly accessed pieces of data should be small enough to keep in RAM all the time, giving a large speedup to the system.
== Seek bound performance ==
Since disk I/O requests are effectively random, the load will be dominated by seek and rotational latency. It will cost very nearly the same to pick 64kbytes off the disk for an article as to get 4 bytes for a timestamp.
Using high-performance disks and spreading the database across many RAID spindles should greatly increase performance.
I agree with the posters who are arguing for software RAID: it has higher performance than hardware RAID in many cases, and again, we can fine-tune stripe sizes etc. to our application. (Big stripe sizes are a bad idea for random-seek loads, but give better performance for streaming loads). We should also consider kernel 2.6: there are major gains in disk I/O performance in this kernel, and most of the teething troubles are not related to server issues.
== Not all disks are equal ==
Consider buying the disks specifically by access time statistics. In particular, high-performance SCSI disks should greatly out-perform IDE for random seek access patterns, even though their performance may be roughly the same for data streaming. SCSI command tagging will further increase performance, where there is concurrency on a single spindle.
See http://www.storagereview.com/php/benchmark/bench_sort.php for some interesting stats:
* a Fujitsu MAS3735 has an average read access time of 5.6ms, for a price of $700 for 73 GB. * a Hitachi Deskstar 7K250 has an average read access time of 12.1ms, for a price of $250 for 250 GB * a seagate U6 has an average read access time of 20.0ms, for a price of ??? for 80 GB
According to this, if performance is dominated by read access time, the most expensive drive should have almost four times the random-read performance of the cheapest, all else being equal.
Using price and performance figures such as those above, we should be able to calculate the best price/performance/storage compromise for this application.
== The Google strategy for article caching ==
Google seem to use a large number of RAM-based cache servers, based on the observation that network access latency on a small network is tiny, but disk latency is large. This does not make any sense for us now: we don't have the resources, unless Google open-source their Google filesystem.
For future expansion, it might be cheaper to buy 10 4Gbyte RAM commodity machines than one 40 Gbyte enterprise-class machine, and spread the load across them. Although this would still be costly, the performance of serving data directly from RAM would be very high.
-- Neil */
/*