----- Original Message -----
From: "Brion Vibber" <brion(a)pobox.com>
To: <wikitech-l(a)wikipedia.org>
Sent: Sunday, November 17, 2002 11:35 PM
Subject: Re: [Wikitech-l] questions about the SQL structure
On Sat, 2002-11-16 at 01:10, Magnus Manske wrote:
> Jonathan Walther wrote:
> > 13) Similar to the question about the orphanedlinks table, why is
l_from
a string, while l_to is an integer?
I think l_from could be an integer as well. Would probably increase
lookup speed, and save memory. Some things might get more complicated,
though, like "what links here".
On closer inspection, I think using an integer would make whatlinkshere
*simpler*. Currently, every linking page's title has to be parsed into
namespace/title and checked for redirect status:
SELECT DISTINCT l_from FROM links WHERE l_to={$lid}
(now loop over the results with:)
SELECT cur_id,cur_is_redirect FROM cur
WHERE cur_namespace={$ns} AND cur_title='{$t}'
If we used an integer index, I think we could do this in one step:
SELECT cur_id,cur_namespace,cur_title,cur_is_redirect
FROM cur,links WHERE l_from=cur_id AND l_to={$lid}
Of course, I'm the guy flipping through "Learn SQL in a weekend" as I
type, so take all my SQL advice with a suitably large grain of salt. ;)
An even better move would be to use a 64-bit number for the index rather
than an integer. If this were to be done it would become possible to assign
article id numbers even to articles which did not exist by using a
cryptographic checksum function to create an article ID directly from the
article title text. This then gives the simpler query benefits that
Jonathan Walter was talking about earlier without the need to actually
create a whole load of empty articles some of which will inevitably be
orphaned before they are actually used. It is necessary to use at least a
64-bit number because anything smaller gives too much chance of a hash
collision occurring and two articles accidentally ending up with the same
id.
By the way an SQL clause like -- SELECT a FROM b WHERE c = NULL -- should
always select zero records because comparing anything to NULL always
produces FALSE as a result. The correct way to do this in standard SQL
s -- SELECT a FROM b WHERE c IS NULL.
Cheers
Derek