On Sat, Nov 21, 2009 at 6:39 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
Selecting a list of all titles that are not redirects will take a long time on any database, unless you have everything in memory, because it requires a table scan -- there's no index that covers the relevant columns (IIRC).
You could build an index on page_is_redirect in the "page" table (see enwiki-*-page.sql.gz). But I'm pretty sure Postgresql wouldn't use it, and would do a sequential scan, since pretty much all the pages are going to have to be accessed anyway.
Five or ten minutes sounds about right. I can't imagine this is a query you want to run over and over again. If it is, you'd probably want to use partitioning (http://www.postgresql.org/docs/8.4/interactive/ddl-partitioning.html), but you're still not going to cut down the query time very much, as it's going to be returning millions of rows.
It's also worth pointing out that Wikipedia uses a version of MySQL with substantial modifications, and Wikimedia sysadmins are very familiar with its behavior. Switching to a new technology might theoretically be better in the long term (although I wouldn't take that for granted in this case), but the transition cost would be substantial. Heck, Wikipedia hasn't even upgraded to MySQL 4.1, let alone a whole different DBMS.
Yes, it can be very hard to switch your DBMS, and that's a very good thing for MySQL. :)