My apologies on that last blank email.
I will add to my explanation that yesterday I was about to extract over 2 million legitimate redirects. By legitimate I mean both the source and target were pages existing in the page.sql file. I obtained these by extracting them from the large articles-pages.xml file.
This 2 million number is very consistent with the 1 million which exist in the redirect.sql file plus the 1 million more which I see listed in the page.sql file (pages flagged as redirects). So my question stands as to why these are missing from the redirect.sql file.
Is this the right place to ask this question, or is there a more direct contact I should make or bug I should file?
thanks!! John
Date: Tue, 23 Oct 2007 10:43:58 -0500 From: "John Lehmann" john.lehmann@gmail.com Subject: [Wikitech-l] 1 million redirects missing from redirect.sql file? To: wikitech-l@lists.wikimedia.org Message-ID: 91ba8f10710230843n3407bf4vefedc42269987ecb@mail.gmail.com Content-Type: text/plain; charset=ISO-8859-1
It looks to me like there are a large number (as many as 1 million) redirects missing from the redirect.sql file.
My script extracts redirects from the redirect.sql file and the page id's using the page.sql file. Most of these pages can be resolved (about 1 million). However, when I scan the page.sql file for page names which are redirects which were never resolved to any relation in the redirect.sqlfile, there is about 1 million more.
Here are some examples (the ones on the left are missing from redirect.sql) which were derived from 20070908 but I believe the problem is not limited to this date: Alstrom's syndrome -> Alstrom syndrome Tito's Handmade Vodka -> Tito's Vodka Titov_Drvar -> Drvar
Another experiment which seems to confirm this is that I can extract 2.4million redirects from the page-articles.xml file, which is approximately the number of redirects I get from redirect.sql + the number which seem missing according to page.sql.
Am I misunderstanding something?
A related question is why the redirect.sql file has the destination link as a string and not as a page id? The category-links.sql file does this also. Is this just for readability, because it takes more effort to construct linked databases.
I hope I have posted this in the right place. thanks!! John
Usually, bugzilla is the place to report these things. There is already an open bug report for this, hopefully it's on someone's todo list :) http://bugzilla.wikimedia.org/show_bug.cgi?id=10931
r.
John Lehmann wrote:
My apologies on that last blank email.
I will add to my explanation that yesterday I was about to extract over 2 million legitimate redirects. By legitimate I mean both the source and target were pages existing in the page.sql file. I obtained these by extracting them from the large articles-pages.xml file.
This 2 million number is very consistent with the 1 million which exist in the redirect.sql file plus the 1 million more which I see listed in the page.sql file (pages flagged as redirects). So my question stands as to why these are missing from the redirect.sql file.
Is this the right place to ask this question, or is there a more direct contact I should make or bug I should file?
thanks!! John
Date: Tue, 23 Oct 2007 10:43:58 -0500 From: "John Lehmann" john.lehmann@gmail.com Subject: [Wikitech-l] 1 million redirects missing from redirect.sql file? To: wikitech-l@lists.wikimedia.org Message-ID: 91ba8f10710230843n3407bf4vefedc42269987ecb@mail.gmail.com Content-Type: text/plain; charset=ISO-8859-1
It looks to me like there are a large number (as many as 1 million) redirects missing from the redirect.sql file.
My script extracts redirects from the redirect.sql file and the page id's using the page.sql file. Most of these pages can be resolved (about 1 million). However, when I scan the page.sql file for page names which are redirects which were never resolved to any relation in the redirect.sqlfile, there is about 1 million more.
Here are some examples (the ones on the left are missing from redirect.sql) which were derived from 20070908 but I believe the problem is not limited to this date: Alstrom's syndrome -> Alstrom syndrome Tito's Handmade Vodka -> Tito's Vodka Titov_Drvar -> Drvar
Another experiment which seems to confirm this is that I can extract 2.4million redirects from the page-articles.xml file, which is approximately the number of redirects I get from redirect.sql + the number which seem missing according to page.sql.
Am I misunderstanding something?
A related question is why the redirect.sql file has the destination link as a string and not as a page id? The category-links.sql file does this also. Is this just for readability, because it takes more effort to construct linked databases.
I hope I have posted this in the right place. thanks!! John _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org