On Sat, Nov 21, 2009 at 6:39 PM, Aryeh Gregor
<Simetrical+wikilist(a)gmail.com> wrote:
Selecting a list of all titles that are not
redirects will take a long time on any database, unless you have
everything in memory, because it requires a table scan -- there's no
index that covers the relevant columns (IIRC).
You could build an index on page_is_redirect in the "page" table (see
enwiki-*-page.sql.gz). But I'm pretty sure Postgresql wouldn't use
it, and would do a sequential scan, since pretty much all the pages
are going to have to be accessed anyway.
Five or ten minutes sounds about right. I can't imagine this is a
query you want to run over and over again. If it is, you'd probably
want to use partitioning
(
http://www.postgresql.org/docs/8.4/interactive/ddl-partitioning.html),
but you're still not going to cut down the query time very much, as
it's going to be returning millions of rows.
It's also worth pointing out that Wikipedia uses a
version of MySQL
with substantial modifications, and Wikimedia sysadmins are very
familiar with its behavior. Switching to a new technology might
theoretically be better in the long term (although I wouldn't take
that for granted in this case), but the transition cost would be
substantial. Heck, Wikipedia hasn't even upgraded to MySQL 4.1, let
alone a whole different DBMS.
Yes, it can be very hard to switch your DBMS, and that's a very good
thing for MySQL. :)