Hi, I am wondering if someone could help me create a list of Page Titles, and list of Redirect Pages from the wikidb database in MySQL. I have downloaded the XML/SQL dumps along with the SQL dumps of the remaining Tables and imported them into the wikidb database in MySQL. (This is the English version of Wikipedia.) I am looking at the Mediawiki Database Schema at http://upload.wikimedia.org/wikipedia/commons/4/41/Mediawiki-database-schema... . I also know that there is a list of Titles in NS0 provided along with the dumps. This does not suit my purpose, because I would like to have all the Titles i.e. including those that are out of NS0 too. I would also like to have the list of redirects and where they redirect to.
The Mediawiki Database Schema mentioned above, does provide a lot of information – but I am not experienced enough to make use of it. I hope some of you can help me.
For the list of Page Titles, I looked into the Page Table, and simply got a list of all of the Titles. In SQL I assumed this would be the result of the Query: “select page_title from wikidb.page;” – The problem with this is that I am getting a number of Titles that are repeated?? So I think I am doing something wrong?
As far as getting the List of Redirects, I hope someone would clarify if the way to do this would be to go through the Redirect Table, for each rd_from – look in the Page Table for the matching Page ID, and then get the Title from that row. This can be done using a Join – but I have not implemented this for now. Is this the correct way to go?
Thanks again to all you guys, O. O.
Hi!
2009/3/18 O. O. olson_ot@yahoo.com:
For the list of Page Titles, I looked into the Page Table, and simply got a list of all of the Titles. In SQL I assumed this would be the result of the Query: “select page_title from wikidb.page;” – The problem with this is that I am getting a number of Titles that are repeated?? So I think I am doing something wrong?
page_title does not contains the full title, only its namespace-relative part. You need to use select page_namespace, page_title from wikidb.page Only this whole tuple (page_namespace, page_title) is a unique identifier of a page (this is true for the whole MediaWiki).
HTH, -- [[cs:User:Mormegil | Petr Kadlec]]
On Wed, Mar 18, 2009 at 6:18 AM, Petr Kadlec petr.kadlec@gmail.com wrote:
page_title does not contains the full title, only its namespace-relative part. You need to use select page_namespace, page_title from wikidb.page Only this whole tuple (page_namespace, page_title) is a unique identifier of a page (this is true for the whole MediaWiki).
And note that the namespace is stored as a number. You'll need to refer to a list of the namespace numbers on the specific wiki you're dealing with to translate it into the appropriate prefix. There's a way to get the list from the API, but I don't know it offhand.
Aryeh Gregor schreef:
On Wed, Mar 18, 2009 at 6:18 AM, Petr Kadlec petr.kadlec@gmail.com wrote:
page_title does not contains the full title, only its namespace-relative part. You need to use select page_namespace, page_title from wikidb.page Only this whole tuple (page_namespace, page_title) is a unique identifier of a page (this is true for the whole MediaWiki).
And note that the namespace is stored as a number. You'll need to refer to a list of the namespace numbers on the specific wiki you're dealing with to translate it into the appropriate prefix. There's a way to get the list from the API, but I don't know it offhand.
http://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=...
Note that namespaces with an ID of 100 or higher are specific to enwiki and may have different names or not be used at all on other wikis. To get an accurate list for another wiki, ask that wiki's api.php .
As for redirects: yes, you'll want to do something like:
SELECT page_namespace, page_title, rd_namespace, rd_title FROM page LEFT JOIN redirect ON rd_from=page_id;
This'll list all page titles and their redirect targets, with rd_namespace and rd_title set to NULL for pages that aren't redirects. Note that the redirect table doesn't handle section redirects (like redirects to [[Foo#Bar]], which are stored as redirects to [[Foo]]) and interwiki redirects (like redirects to [[wikt:dog]], which are stored as redirects to [[dog]]) too well and that some redirects may be missing from it entirely (IIRC about half a million redirects are missing from enwiki's redirect table). Even worse, the data dump you downloaded might not even contain the redirect table. You can rebuild the redirect table with:
php maintenance/refreshLinks.php --redirects-only
(Use --old-redirects-only to only add missing entries rather than checking existing entries for validity as well.)
Roan Kattouw (Catrope)
On Wed, Mar 18, 2009 at 11:06 AM, Roan Kattouw roan.kattouw@home.nl wrote:
http://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=...
Note that namespaces with an ID of 100 or higher are specific to enwiki and may have different names or not be used at all on other wikis. To get an accurate list for another wiki, ask that wiki's api.php .
The same is pretty much true for all namespaces. There's no guarantee that any namespaces other than main will have the same names on other wikis. To ensure that, you need to use the canonical name if one exists (it's helpfully provided in the API result . . . actually, what does it mean that "Portal" and "Portal talk" are canonical? shouldn't there be no canonical attribute if the namespace is custom?).
In particular, "Wikipedia" and "Wikipedia talk" will likely not work on most other wikis.
On Wed, Mar 18, 2009 at 11:54 AM, O. O. olson_ot@yahoo.com wrote:
Thanks Petr and Aryeh for getting back. From the Documentation at http://www.mediawiki.org/wiki/Page_table and http://meta.wikimedia.org/wiki/Help:Namespace you can get the names of the Real and Virtual Namespeaces in includes/Defines.php and then get what text they convert to in English using languages/messages/MessagesEn.php.
This is fine, but where can I find information on custom namespaces i.e. those that lie above 100.
Use Roan's link:
http://en.wikibooks.org/w/api.php?action=query&meta=siteinfo&siprop=...
You might prefer this to Defines.php/MessagesEn.php. Those will give you the canonical names, which will always work, but which might not be the ones used on Wikipedia. For instance, namespace 4 is canonically "Project", but on Wikipedia the normal name for it is "Wikipedia". "Project" URLs will work on Wikipedia, but automatically redirect to "Wikipedia". E.g.,
Aryeh Gregor wrote:
On Wed, Mar 18, 2009 at 11:06 AM, Roan Kattouw roan.kattouw@home.nl wrote:
http://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=...
Note that namespaces with an ID of 100 or higher are specific to enwiki and may have different names or not be used at all on other wikis. To get an accurate list for another wiki, ask that wiki's api.php .
The same is pretty much true for all namespaces. There's no guarantee that any namespaces other than main will have the same names on other wikis. To ensure that, you need to use the canonical name if one exists (it's helpfully provided in the API result . . . actually, what does it mean that "Portal" and "Portal talk" are canonical? shouldn't there be no canonical attribute if the namespace is custom?).
Agree. Portal and Portal talk could still be acceptable, since the namespace ids 100-101 are more or less reserved for portals across the wikis. What is scaryier is seeing <ns id="102" canonical="Cookbook"> on enwikibooks whereas the same ns 102 mean Wikiproject on some pedias.
Since the API provides namespacealiases linked to the id, not to the "informal canonical name" I see no reason to keep the canonical parameter on the extra ns.
Platonides schreef:
(it's helpfully provided in the API result . . . actually, what does it mean that "Portal" and "Portal talk" are canonical? shouldn't there be no canonical attribute if the namespace is custom?).
Agree. Portal and Portal talk could still be acceptable, since the namespace ids 100-101 are more or less reserved for portals across the wikis. What is scaryier is seeing <ns id="102" canonical="Cookbook"> on enwikibooks whereas the same ns 102 mean Wikiproject on some pedias.
Since the API provides namespacealiases linked to the id, not to the "informal canonical name" I see no reason to keep the canonical parameter on the extra ns.
This was brought up before in bug 16672 comment #5. My reply was:
b) custom namespaces shouldn't have a canonical name
Maybe, maybe not; I see arguments for and against. But since $wgCanonicalNames contains canonical names for custom namespaces too and since removing the canonical attribute for some namespaces but not others would violate expectations and be a breaking change, I'll just keep stuff the way it is. Regardless of whether custom namespaces should or shouldn't have a canonical name, removing it from the API output isn't worth the trouble.
Roan Kattouw (Catrope)
Thanks Roan.
Roan Kattouw wrote:
Aryeh Gregor schreef: http://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=...
Note that namespaces with an ID of 100 or higher are specific to enwiki and may have different names or not be used at all on other wikis. To get an accurate list for another wiki, ask that wiki's api.php .
Your above link helped me a lot. I am only concerned about the English Version of the Wikipedia – so I think that is all I need. I was afraid that there might have been other values of “namespace” but the result of
SELECT page_namespace, page_title FROM wikidb.page WHERE (page_namespace < -2 OR page_namespace >15) AND page_namespace != 100 AND page_namespace != 101;
confirms that there are no custom namespaces other than those listed in your Link. So I am good here.
As for redirects: yes, you'll want to do something like:
SELECT page_namespace, page_title, rd_namespace, rd_title FROM page LEFT JOIN redirect ON rd_from=page_id;
Actually you simply need a JOIN here i.e. Inner Join (that would eliminate the NULLs) something like
SELECT page_namespace, page_title, rd_namespace, rd_title FROM wikidb.page JOIN wikidb.redirect ON rd_from=page_id;
(NOTE: I did not rebuild the Redirect Table, but I simply imported it as it is provided in SQL format with the other Wiki Dumps.)
My problem is the following:
I cannot understand why there are more redirects in the redirect table than there are in the pages in the Page table with the redirect flag set.
select count(*) from wikidb.redirect ; --- Gives 3422938
while
select count(*) from wikidb.page where page_is_redirect=1; --- Gives 3242340
Why should there be this difference? Are there redirects that are not listed in the Page Table?
Also SELECT count(*) FROM wikidb.page JOIN wikidb.redirect ON rd_from=page_id; --- Gives 3210708
Any ideas why the difference? Thanks again, O. O.
On Fri, Mar 20, 2009 at 1:08 PM, O. O. olson_ot@yahoo.com wrote:
I cannot understand why there are more redirects in the redirect table than there are in the pages in the Page table with the redirect flag set.
select count(*) from wikidb.redirect ; --- Gives 3422938
while
select count(*) from wikidb.page where page_is_redirect=1; --- Gives 3242340
Why should there be this difference? Are there redirects that are not listed in the Page Table?
It's denormalized data, so I'd assume that those are just errors of some kind. If you need to be really really sure whether something is a redirect, you have to parse its text (it's just a regex, should be somewhere in includes/Article.php or something). That's the authoritative check, which IIRC is used when actually deciding whether to redirect.
Thanks Aryeh.
Aryeh Gregor wrote:
On Fri, Mar 20, 2009 at 1:08 PM, O. O. olson_ot@yahoo.com wrote:
I cannot understand why there are more redirects in the redirect table than there are in the pages in the Page table with the redirect flag set.
select count(*) from wikidb.redirect ; --- Gives 3422938
while
select count(*) from wikidb.page where page_is_redirect=1; --- Gives 3242340
Why should there be this difference? Are there redirects that are not listed in the Page Table?
It's denormalized data, so I'd assume that those are just errors of some kind. If you need to be really really sure whether something is a redirect, you have to parse its text (it's just a regex, should be somewhere in includes/Article.php or something). That's the authoritative check, which IIRC is used when actually deciding whether to redirect.
As, SELECT count(*) FROM wikidb.page JOIN wikidb.redirect ON rd_from=page_id; --- Gives 3210708
Which is less than the total rows in the Redirect Table i.e.
select count(*) from wikidb.redirect ; --- Gives 3422938 which is more than 3210708
I think this might mean that there are entries in the Redirect Table where the rd_from does not correspond to a valid page_id in the Page Table? I don’t know what these would be used for though.
Thanks again, O. O.
On Fri, Mar 20, 2009 at 1:26 PM, O. O. olson_ot@yahoo.com wrote:
I think this might mean that there are entries in the Redirect Table where the rd_from does not correspond to a valid page_id in the Page Table? I don’t know what these would be used for though.
They're probably just wrong. The page may have been deleted and the redirect entry not, for some reason. This sort of thing happens with denormalized data. *In theory* everything is in transactions, so it shouldn't happen, except when they aren't and so it does, or they are and it still does somehow.
Aryeh Gregor wrote:
On Fri, Mar 20, 2009 at 1:26 PM, O. O. olson_ot@yahoo.com wrote:
I think this might mean that there are entries in the Redirect Table where the rd_from does not correspond to a valid page_id in the Page Table? I don’t know what these would be used for though.
They're probably just wrong. The page may have been deleted and the redirect entry not, for some reason. This sort of thing happens with denormalized data. *In theory* everything is in transactions, so it shouldn't happen, except when they aren't and so it does, or they are and it still does somehow.
Thanks Aryeh for the explanations. Now it makes sense i.e. I should take these numbers with a pinch of salt, because there might be errors in the Wikipedia Database itself :) .
O. O.
Aryeh Gregor wrote:
On Wed, Mar 18, 2009 at 6:18 AM, Petr Kadlec petr.kadlec@gmail.com wrote:
page_title does not contains the full title, only its namespace-relative part. You need to use select page_namespace, page_title from wikidb.page Only this whole tuple (page_namespace, page_title) is a unique identifier of a page (this is true for the whole MediaWiki).
And note that the namespace is stored as a number. You'll need to refer to a list of the namespace numbers on the specific wiki you're dealing with to translate it into the appropriate prefix. There's a way to get the list from the API, but I don't know it offhand.
Thanks Petr and Aryeh for getting back. From the Documentation at http://www.mediawiki.org/wiki/Page_table and http://meta.wikimedia.org/wiki/Help:Namespace you can get the names of the Real and Virtual Namespeaces in includes/Defines.php and then get what text they convert to in English using languages/messages/MessagesEn.php.
This is fine, but where can I find information on custom namespaces i.e. those that lie above 100.
Thanks, O. O.
2009/3/18 O. O. olson_ot@yahoo.com:
This is fine, but where can I find information on custom namespaces i.e. those that lie above 100.
In $wgExtraNamespaces (see http://www.mediawiki.org/wiki/Manual:Using_custom_namespaces)
-- [[cs:User:Mormegil | Petr Kadlec]]
wikitech-l@lists.wikimedia.org