To keep one's website's links fresh, one uses a linkchecker to detect broken links. But how is a linkchecker to check if a wikipedia article exists, given that this returns $ HEAD -H User-Agent: http://*.wikipedia.org/wiki/*%7Csed q 200 OK for any article, non-existent and existent, and even the 'our servers are experiencing technical problems' message. (-H avoids 403 Forbidden)
Should I make a list of the wikipedia URLs I want to check and send each to a API URL for it? Will this API URL return a "more correct" HTTP code?
Or must I do something like $ GET $URL|grep 'wiki.* does not currently have an article called .*, maybe it was deleted' && echo $URL Broken
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
jidanni@jidanni.org wrote:
To keep one's website's links fresh, one uses a linkchecker to detect broken links. But how is a linkchecker to check if a wikipedia article exists, given that this returns $ HEAD -H User-Agent: http://*.wikipedia.org/wiki/*%7Csed q 200 OK for any article, non-existent and existent,
This is bug 2585:
https://bugzilla.wikimedia.org/show_bug.cgi?id=2585
As noted there we had problems when originally implementing it, which may or may not be legit or continuing, and we haven't got 'round to reimplementing it.
- -- brion vibber (brion @ wikimedia.org)
jidanni@jidanni.org schreef:
To keep one's website's links fresh, one uses a linkchecker to detect broken links. But how is a linkchecker to check if a wikipedia article exists, given that this returns $ HEAD -H User-Agent: http://*.wikipedia.org/wiki/*%7Csed q 200 OK for any article, non-existent and existent, and even the 'our servers are experiencing technical problems' message. (-H avoids 403 Forbidden)
Should I make a list of the wikipedia URLs I want to check and send each to a API URL for it? Will this API URL return a "more correct" HTTP code?
Or must I do something like $ GET $URL|grep 'wiki.* does not currently have an article called .*, maybe it was deleted' && echo $URL Broken
First, save up a list of articles you wanna check. When you've got a couple hundred of them (or have run out of articles to check), issue an API request like:
http://en.wikipedia.org/w/api.php?action=query&titles=Dog%7CWP:WAX%7CJid...
It returns some basic data (namespace and existence) for every article. For production use, you probably want pure XML, so use &format=xml . Alternatively, you can use format=json or format=php .
Roan Kattouw (Catrope)
jidanni wrote:
To keep one's website's links fresh, one uses a linkchecker to detect broken links. But how is a linkchecker to check if a wikipedia article exists, given that this returns $ HEAD -H User-Agent: http://*.wikipedia.org/wiki/*%7Csed q 200 OK for any article, non-existent and existent, and even the 'our servers are experiencing technical problems' message. (-H avoids 403 Forbidden)
Should I make a list of the wikipedia URLs I want to check and send each to a API URL for it? Will this API URL return a "more correct" HTTP code?
Or must I do something like $ GET $URL|grep 'wiki.* does not currently have an article called .*, maybe it was deleted' && echo $URL Broken
Check the db. All existing pages appear on the 'page' table
The links are at pagelinks, templatelinks, imagelinks and categorylinks (although a non-existing category still has a function)
wikitech-l@lists.wikimedia.org