There's two things I don't understand here. At the test site this query now seems to take around 7 seconds. So did you already implement this there?
Yes, it's in place already. I tested it on my own server first as always, but this and other bug fixes that I'm judging either non- risky or important enough to justify some risk I'm putting on the live server as well so they can get hit with Neil's bots.
Secondly, I don't see why counting links should be easier unless you are doing something really weird like allowing duplicates in the table 'brokenlinks'. You are still using this table, are you?
There is no unique keys in the links or brokenlinks tables, so yes, there are "duplicates" if you want to call them that--but they represent real info, in that if page A links to non-existent page B twice, there will be exactly two records "A -> B" in the table. And of course there are lots and lots of duplicates of the individual "from" and "to" fields, because there has to be.
I could easily enough add a unique key with a combination of the "from" and "to" fields. Is there a good reason to do that? I never access the table by that--I only access it by the individual fields *i.e., there's never any query in which both fields appear in the where clause or an order by or group by).
Temp file? Are you now doing the sorting yourself?
MySQL is doing it, but it uses a temp file ("explain" on the query says so, anyway). And the error we were getting before was the size of the tempfile bumping into a disk quota. 0
On Mon, Jul 08, 2002 at 04:01:38PM -0700, lcrocker@nupedia.com wrote:
I could easily enough add a unique key with a combination of the "from" and "to" fields.
Actually, I don't think you can. I tried that at the time and found out that MySQL cannot define unique indexes over such large (in bytes) combinations. So unless you changed the types of the columns, that is not possible.
Is there a good reason to do that? I never access the table by that--I only access it by the individual fields *i.e., there's never any query in which both fields appear in the where clause or an order by or group by).
The main reason would be data integrity and perhaps compactness, not access, and, well, we would be breaking Codd's first commandment: "Thou shalt always have a primary key". :-) But seriously, you could quite easily maintain this constraint by making the variabel for broken links an associative array that associates the links names with "1" or something. Thanks to the excellent object-oriented way in which you organized the code, that would probably be a simple change in the class LinkCache plus some changes when reading/writing it from the database.
Would it be worth it? I think that #pages is a slightly better metric than #links because although it is policy to link every word only once on a page, a newcomer might make a certain titel inadvertantly very wanted by linking it very often on the same page.
MySQL is doing it, but it uses a temp file ("explain" on the query says so, anyway). And the error we were getting before was the size of the tempfile bumping into a disk quota.
Ultimately the best solution would be to have a table wanted(title, #pages) with an index on #pages (and a unique index on title), then MySQL wouldn't need to sort at all. I don't know of the top of my head if there are any other queries that depend on 'brokenlinks' but I don't believe so and if there are not then I would recommend replacing it. That would really speed things up and save resources.
Translating the updates of 'brokenlinks' to updates of 'mostwanted' is not trivial but also not very difficult either. If a page is deleted/replaced you subtract 1 of #pages for every title it had a broken link to. If a page is created/updated you add 1 to #pages for every title it has a broken link to. The only extra work is that you have to check before the increment if the title is already in the table (otherwise you have to create it first) and on the other hand if after the decremetn #pages has become zero (in which case you probaly want to delete it).
-- Jan Hidders
(It seems that we're discussing a bit too much technical stuff here instead of wikitech-l?)
lcrocker@nupedia.com wrote:
MySQL is doing it, but it uses a temp file ("explain" on the query says so, anyway). And the error we were getting before was the size of the tempfile bumping into a disk quota.
Since the new machine is wikipedia-only, and since it has a bigger disk, then if there is a way to increase this quota, we might want to feel free to do it.
Although, for performance reasons, we might strongly prefer to force ourselves not to go about sorting long lists willy-nilly anyway.
--Jimbo
wikipedia-l@lists.wikimedia.org