A quick start might be to temporarily disable all checking of links, and see if that helps much.
This seems to be a helpful suggestion. Without profiling, it's hard to tell where the bottleneck is, but I think link checking is a good guess.
And this fairly simple (now that Lee has created a functioning test suite) could probably tell us if this is a bottleneck. If so, then at least we know where to focus our optimization efforts.
If this is the problem, we are in luck because there have been a lot of good improvement suggestions. But they all add complexity to the code (or database setup) and "premature optimization is the root of all kinds of evil," so if link checking isn't a bottleneck it would be counterproductive to spend a lot of time to try to optimize it.
--Mark
(Mark Christensen mchristensen@humantech.com):
A quick start might be to temporarily disable all checking of links, and see if that helps much.
This seems to be a helpful suggestion. Without profiling, it's hard to tell where the bottleneck is, but I think link checking is a good guess.
And this fairly simple (now that Lee has created a functioning test suite) could probably tell us if this is a bottleneck. If so, then at least we know where to focus our optimization efforts.
If this is the problem, we are in luck because there have been a lot of good improvement suggestions. But they all add complexity to the code (or database setup) and "premature optimization is the root of all kinds of evil," so if link checking isn't a bottleneck it would be counterproductive to spend a lot of time to try to optimize it.
I think it's a bit premature for that yet. I think the differently- rendered missing links feature is pretty critical, not just a frill. In the short term the hardware will bail us out until we can find a solution.
Along those lines, I'd like to ask for some feedback: if I do the Bloom filter in shared memory thing, I can choose parameters to optimize things. So here's the first question: Bloom filters have no false negatives (that is, there's no risk that they'll show an existing page as non-existing), but there are false positives. What is an acceptable false-positive rate? With a 16-bit filter, the rate will be one in 65,000; I think that's a bit too high. With a 32-bit filter, it's one in 4 billion, which seems reasonable. 24 bits is one in 16 million, which might also be OK.
A 32-bit filter for 200,000 pages will fit into 2Mb of memory, which seems reasonable. It would require a 768-bit hash function. Anybody have recommendations for a good 768-bit hash function for titles?
Alternatively, 2Mb is probably enough space for a traditional hash table of titles and article IDs, provided the average length of a title isn't too long, and there's some reasonably efficient storage method for it.
On Thu, May 01, 2003 at 02:47:35PM -0500, Lee Daniel Crocker wrote:
Along those lines, I'd like to ask for some feedback: if I do the Bloom filter in shared memory thing, I can choose parameters to optimize things. So here's the first question: Bloom filters have no false negatives (that is, there's no risk that they'll show an existing page as non-existing), but there are false positives. What is an acceptable false-positive rate? With a 16-bit filter, the rate will be one in 65,000; I think that's a bit too high. With a 32-bit filter, it's one in 4 billion, which seems reasonable. 24 bits is one in 16 million, which might also be OK.
If all the content of CUR is moved to/duplicat as a filesystem, like David Wheeler suggest it (and a agree completly with him), you don't need the Bloom filter, the filesystem will do the work.
---- Luc Van Oostenryck aka User:Looxix
On Thu, May 01, 2003 at 10:31:06PM +0200, Luc Van Oostenryck wrote:
On Thu, May 01, 2003 at 02:47:35PM -0500, Lee Daniel Crocker wrote:
Along those lines, I'd like to ask for some feedback: if I do the Bloom filter in shared memory thing, I can choose parameters to optimize things. So here's the first question: Bloom filters have no false negatives (that is, there's no risk that they'll show an existing page as non-existing), but there are false positives. What is an acceptable false-positive rate? With a 16-bit filter, the rate will be one in 65,000; I think that's a bit too high. With a 32-bit filter, it's one in 4 billion, which seems reasonable. 24 bits is one in 16 million, which might also be OK.
If all the content of CUR is moved to/duplicat as a filesystem, like David Wheeler suggest it (and a agree completly with him), you don't need the Bloom filter, the filesystem will do the work.
Actually, I think the best win would come from having OLD on the filesystem. CUR would be excellent, too. As it is, OLD has complete copies of every edit ever made, and it forces MySQL to consume an unholy amount of memory. :(
Actually, I think the best win would come from having OLD on the filesystem. CUR would be excellent, too. As it is, OLD has complete copies of every edit ever made, and it forces MySQL to consume an unholy amount of memory. :(
There's nothing magical about filesystems--the same work has to be done whether the data is in a database or a filesystem. However it is true that filesystems tend to be optimized for different kinds of access patterns than databases, and those access patterns may be better suited to big chunks of text.
Also, I don't see why the size of the old table has anything to do with the amount of memory used by MySQL. There's almost no difference in performance between a database with 10,000 entries and one with 100,000 entries. A more significant win is likely to be reducing the size of the /individual records/, perhaps by putting the full text in the filesystem and just having pointers and stats in the database.
Moving only the old data to the filesystem would also be less of a problem than moving the cur records, because we don't have the problem of losing MySQL's fulltext index, which we need to keep for the cur table to implement the search function.
Another win for the two-server setup might be keeping old text in the front-end file system rather than the database machine, to reduce traffic over the wire.
On Thu, 2003-05-01 at 15:56, Lee Daniel Crocker wrote:
Moving only the old data to the filesystem would also be less of a problem than moving the cur records, because we don't have the problem of losing MySQL's fulltext index, which we need to keep for the cur table to implement the search function.
Actually, that's not the case. We've always used a separate, munged text field for the searchable text (stripping HTML markup, accounting for things like changing "[[foobar]]s" to "foobar foobars", massaging Unicode chars for the non-Latin wikis to make search work right in Esperanto, Polish, Japanese, Chinese, etc).
Originally these were fields in cur, but now they're in a separate table (searchindex) since cur is InnoDB and we can only do fulltext searches on MyISAM tables.
-- brion vibber (brion @ pobox.com)
(Brion Vibber brion@pobox.com):
Moving only the old data to the filesystem would also be less of a problem than moving the cur records, because we don't have the problem of losing MySQL's fulltext index, which we need to keep for the cur table to implement the search function.
Actually, that's not the case. We've always used a separate, munged text field for the searchable text (stripping HTML markup, accounting for things like changing "[[foobar]]s" to "foobar foobars", massaging Unicode chars for the non-Latin wikis to make search work right in Esperanto, Polish, Japanese, Chinese, etc).
Originally these were fields in cur, but now they're in a separate table (searchindex) since cur is InnoDB and we can only do fulltext searches on MyISAM tables.
Ah, that's right. Funny how one can forget details about one's own code (sort of; I wrote the text munging, you moved it to the new table). At any rate, yes, we could store cur text in the filesystem as well. I'll have to try that sometime.
Also, I don't see why the size of the old table has anything to do with the amount of memory used by MySQL. There's almost no difference in performance between a database with 10,000 entries and one with 100,000 entries. A more significant win is likely to be reducing the size of the /individual records/, perhaps by putting the full text in the filesystem and just having pointers and stats in the database.
I didn't say having more entries will reduce performance. A select on one of the columns probably doesn't take a lot of time. Just having a lot of records in old isn't making MySQL slow. But, it is trying to keep each individual recard in memory. Just look at the bz2 files available for download. The cur table is 73MB, and old is 740MB. The old table makes the database 10x as big. Ouch! Since it is that big, the processes grow and grow, eventually causing the machine to swap like mad.
Even with the new machine, I think it would be helpful to perhaps split old into two databases. The database we have now would contain everything we have now, plus a key to articles on the other database. That way, looking at comments and dates of old articles (ie, Page History) would still be fast, and we could reduce the memory footprint of the primary server. If you wanted to retrieve an older article, it would take that key, connect to the other database and pull it down. That other database could have perhaps a lower priority, and a much lower memory buffer. Or, we could store it on the filesystem, whatever.
In the evening (1:00 AM GMT), I see: MemTotal: 2059088 kB MemFree: 93096 kB MemShared: 40 kB Buffers: 14892 kB SwapTotal: 2047992 kB SwapFree: 1289140 kB
That's pretty heavy swap usage (33%), and heavy memory usage.
Nick Reinking wrote:
I didn't say having more entries will reduce performance. A select on one of the columns probably doesn't take a lot of time. Just having a lot of records in old isn't making MySQL slow. But, it is trying to keep each individual recard in memory. Just look at the bz2 files available for download. The cur table is 73MB, and old is 740MB. The old table makes the database 10x as big. Ouch! Since it is that big, the processes grow and grow, eventually causing the machine to swap like mad.
The database have to handle that. And I think the database is used because of the fact that a lot of data is involved. Possibly you can setup the database with appropriate memory usage options, but I think thats done.
Even with the new machine, I think it would be helpful to perhaps split old into two databases. The database we have now would contain everything we have now, plus a key to articles on the other database. That way, looking at comments and dates of old articles (ie, Page History) would still be fast, and we could reduce the memory footprint of the primary server. If you wanted to retrieve an older article, it would take that key, connect to the other database and pull it down. That other database could have perhaps a lower priority, and a much lower memory buffer. Or, we could store it on the filesystem, whatever.
In the evening (1:00 AM GMT), I see: MemTotal: 2059088 kB MemFree: 93096 kB MemShared: 40 kB Buffers: 14892 kB SwapTotal: 2047992 kB SwapFree: 1289140 kB
That's pretty heavy swap usage (33%), and heavy memory usage.
I hope you know that this values are very useless. You have to know the ratio of pagefaults to pagehits. Linux often has a lot of nearly never used things in swapspace.
Smurf
On Fri, May 02, 2003 at 06:46:08AM +0200, Thomas Corell wrote:
In the evening (1:00 AM GMT), I see: MemTotal: 2059088 kB MemFree: 93096 kB MemShared: 40 kB Buffers: 14892 kB SwapTotal: 2047992 kB SwapFree: 1289140 kB
That's pretty heavy swap usage (33%), and heavy memory usage.
I hope you know that this values are very useless. You have to know the ratio of pagefaults to pagehits. Linux often has a lot of nearly never used things in swapspace.
That's the basic point. We have one symptom: Wikipedia is slow. We have lots of ideas what might be the cause. We don't have any useful statistics about what hits the server: - Swapping activity - Processes memory allocation - Disk I/O ratio
What Lee Daniel's and my benchmarks have indicated is that the effect of checking links or checking for stubs is probably minor. The cause must be somewhere else.
From a previous mail (Brion once sent a ps aux) I remember
that there weren't many unused processes idling around. Definitely not enough processes to allocate 800 MB of swap space. So what else affects the machine?
Can someone with server access please send us a recent "ps auxwww"-output and the output of vmstat 1 60 ? (vmstat is included in recent versions of procps)
If we really have a swap-related problem, then probably all our tuning activities won't show any improvements.
Best regards,
JeLuF
On Fri, May 02, 2003 at 12:01:20PM -0700, Brion Vibber wrote:
On Fri, 2 May 2003, Jens Frank wrote:
Can someone with server access please send us a recent "ps auxwww"-output and the output of vmstat 1 60 ?
Attached.
-- brion vibber (brion @ pobox.com)
I feel like I should note that there are a couple differences from what I saw mid-week. Load was about twice as high, mysqld processes were using 49.2% memory, and si was hovering around 20-30.
A quick start might be to temporarily disable all checking of links, and see if that helps much.
This seems to be a helpful suggestion. Without profiling, it's hard to tell where the bottleneck is, but I think link checking is a good guess.
It's more than a guess--one of the very first things we did to test the new software is to run bots to fetch lots of pages, which we then sorted by response time. The pages that sank to the bottom of that list were special pages with complex queries and long pages with lots of links. Long pages without many links were not a problem.
I just did an ad-hoc benchmark on Piclab of the same installation with and without link-checking, and on the limited set of pages used by the test suite, the speedup was only about 3%. Of course all benchmarks on single-servers may be less applicable to the multiple-server installation we're going to have soon.
wikitech-l@lists.wikimedia.org