It looks to me like there are a large number (as many as 1 million) redirects missing from the redirect.sql file.
My script extracts redirects from the redirect.sql file and the page id's using the page.sql file. Most of these pages can be resolved (about 1 million). However, when I scan the page.sql file for page names which are redirects which were never resolved to any relation in the redirect.sqlfile, there is about 1 million more.
Here are some examples (the ones on the left are missing from redirect.sql) which were derived from 20070908 but I believe the problem is not limited to this date: Alstrom's syndrome -> Alstrom syndrome Tito's Handmade Vodka -> Tito's Vodka Titov_Drvar -> Drvar
Another experiment which seems to confirm this is that I can extract 2.4million redirects from the page-articles.xml file, which is approximately the number of redirects I get from redirect.sql + the number which seem missing according to page.sql.
Am I misunderstanding something?
A related question is why the redirect.sql file has the destination link as a string and not as a page id? The category-links.sql file does this also. Is this just for readability, because it takes more effort to construct linked databases.
I hope I have posted this in the right place. thanks!! John
On 10/23/07, John Lehmann john.lehmann@gmail.com wrote:
A related question is why the redirect.sql file has the destination link as a string and not as a page id? The category-links.sql file does this also. Is this just for readability, because it takes more effort to construct linked databases.
Categories and redirect can link to non-existent pages.
Bryan
wikitech-l@lists.wikimedia.org