New subject: 1 million redirects missing from redirect.sql file?

24 Oct 2007


      My apologies on that last blank email.
I will add to my explanation that yesterday I was about to extract over 2
million legitimate redirects.  By legitimate I mean both the source and
target were pages existing in the page.sql file.  I obtained these by
extracting them from the large articles-pages.xml file.
This 2 million number is very consistent with the 1 million which exist in
the redirect.sql file plus the 1 million more which I see listed in the
page.sql file (pages flagged as redirects).  So my question stands as to why
these are missing from the redirect.sql file.
Is this the right place to ask this question, or is there a more direct
contact I should make or bug I should file?
thanks!!
John
Date: Tue, 23 Oct 2007 10:43:58 -0500
From: "John Lehmann" john.lehmann@gmail.com
Subject: [Wikitech-l] 1 million redirects missing from redirect.sql
       file?
To: wikitech-l@lists.wikimedia.org
Message-ID:
       91ba8f10710230843n3407bf4vefedc42269987ecb@mail.gmail.com
Content-Type: text/plain; charset=ISO-8859-1
It looks to me like there are a large number (as many as 1 million)
redirects missing from the redirect.sql file.
My script extracts redirects from the redirect.sql file and the page id's
using the page.sql file.  Most of these pages can be resolved (about 1
million).  However, when I scan the page.sql file for page names which are
redirects which were never resolved to any relation in the
redirect.sqlfile, there is about 1 million more.
Here are some examples (the ones on the left are missing from redirect.sql)
which were derived from 20070908 but I believe the problem is not limited to
this date:
 Alstrom's syndrome -> Alstrom syndrome
 Tito's Handmade Vodka -> Tito's Vodka
 Titov_Drvar -> Drvar
Another experiment which seems to confirm this is that I can extract
2.4million redirects from the
page-articles.xml file, which is approximately the number of redirects I get
from redirect.sql + the number which seem missing according to page.sql.
Am I misunderstanding something?
A related question is why the redirect.sql file has the destination link as
a string and not as a page id?  The category-links.sql file does this also.
Is this just for readability, because it takes more effort to construct
linked databases.
I hope I have posted this in the right place.
thanks!!
John

Re: [Wikitech-l] 1 million redirects missing from redirect.sql file?