[Wikipedia-l] "Wanted pages" fix

Jan.Hidders hidders at uia.ua.ac.be
Tue Jul 9 10:12:09 UTC 2002


On Mon, Jul 08, 2002 at 04:01:38PM -0700, lcrocker at nupedia.com wrote:
> 
> I could easily enough add a unique key with a combination of 
> the "from" and "to" fields.

Actually, I don't think you can. I tried that at the time and found out that
MySQL cannot define unique indexes over such large (in bytes) combinations.
So unless you changed the types of the columns, that is not possible.

> Is there a good reason to do that?  I 
> never access the table by that--I only access it by the individual 
> fields *i.e., there's never any query in which both fields appear in 
> the where clause or an order by or group by).

The main reason would be data integrity and perhaps compactness, not access,
and, well, we would be breaking Codd's first commandment: "Thou shalt always
have a primary key". :-) But seriously, you could quite easily maintain this
constraint by making the variabel for broken links an associative array that
associates the links names with "1" or something. Thanks to the excellent
object-oriented way in which you organized the code, that would probably be
a simple change in the class LinkCache plus some changes when
reading/writing it from the database.

Would it be worth it? I think that #pages is a slightly better metric than
#links because although it is policy to link every word only once on a page,
a newcomer might make a certain titel inadvertantly very wanted by linking
it very often on the same page.

> MySQL is doing it, but it uses a temp file ("explain" on the query 
> says so, anyway).  And the error we were getting before was the size 
> of the tempfile bumping into a disk quota.

Ultimately the best solution would be to have a table wanted(title, #pages)
with an index on #pages (and a unique index on title), then MySQL wouldn't
need to sort at all. I don't know of the top of my head if there are any
other queries that depend on 'brokenlinks' but I don't believe so and if
there are not then I would recommend replacing it. That would really speed
things up and save resources.

Translating the updates of 'brokenlinks' to updates of 'mostwanted' is not
trivial but also not very difficult either. If a page is deleted/replaced
you subtract 1 of #pages for every title it had a broken link to. If a page
is created/updated you add 1 to #pages for every title it has a broken link
to. The only extra work is that you have to check before the increment if
the title is already in the table (otherwise you have to create it first) and
on the other hand if after the decremetn #pages has become zero (in which
case you probaly want to delete it).

-- Jan Hidders




More information about the Wikipedia-l mailing list