On 2013-09-16 8:01 PM, Gabriel Wicke wrote:
On 09/16/2013 07:24 PM, Daniel Friesen wrote:
On 2013-09-16 7:09 PM, Gabriel Wicke wrote:
Any of the entry points? Any new entry point? Anything we ever want to put into the root? We should be able to avoid most conflicts by picking prefixed entry points. However, as we can't drop the clashing /w/api.php any time soon I have removed the /wiki/ part from the RFC:
https://www.mediawiki.org/wiki/Requests_for_comment/Clean_up_URLs
So now only the conversion from
/w/index.php?title=foo?action=history to /foo?action=history
is under discussion.
Gabriel
Has the practice of disallowing /w/ or /index.php inside robots.txt to force search engines to completely ignore search, edit pages, exponential pagination, etc.. been considered?
See https://www.mediawiki.org/wiki/Requests_for_comment/Clean_up_URLs#Migration
Ok. Though even assuming the * and Allow: non-standard features are supported by all bots we want to target I actually don't like the idea of blacklisting /wiki/*? in this way.
I don't think that every url with a query in it qualifies as something we want to blacklist from search engines. There are plenty but sometimes there is content that's served with a query which could otherwise be a good idea to index.
For example the non-first pages on long categories and Special:Allpages' pagination. The latter has robots=noindex – though I think we may want to reconsider that – but the former is not noindexed and with the introduction of rel="next", etc... would be pretty reasonable to index but is currently blacklisted by robots.txt. Additionally while we normally want to noindex edit pages. This isn't true of redlinks in every case. Take redlinked category links for example. These link to an action=edit&redlink=1 which for a search engine would then redirect back to the pretty url for the category. But because of robots.txt this link is masked because the intermediate redirect cannot be read by the search engine.
The idea I had to fix that naturally was to make MediaWiki aware of this and whether by a new routing system or simply filters for specific simple queries make it output /wiki/title?query urls for those cases where it's a query we would want indexed and leave robots blacklisted stuff under /w/ (though I did also consider a separate short url path like /w/page/$1 to make internal/robots blacklisted urls pretty). However adding Disallow: /wiki/*? to robots.txt will preclude the ability to do that.
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]