Hi,
while tinkering with a RESTful content API I was reminded of an old pet peeve of mine: The URLs we use in Wikimedia projects are relatively long and ugly. I believe that we now have the ability to clean this up if we want to.
It would be nice to
* drop the /wiki/ prefix https://en.wikipedia.org/Foo instead of https://en.wikipedia.org/wiki/Foo
* use simple action urls https://en.wikipedia.org/Foo?action=history instead of https://en.wikipedia.org/w/index.php?title=Foo&action=history
The details of this proposal are discussed in the following RFC:
https://www.mediawiki.org/wiki/Requests_for_comment/Clean_up_URLs
I'm looking forward to your input!
Gabriel
On Mon, Sep 16, 2013 at 6:12 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
- drop the /wiki/ prefix https://en.wikipedia.org/Foo instead of https://en.wikipedia.org/wiki/Foo
Where would we put the API entry point? It can't be at https://en.wikipedia.org/w/api.php because there might be an article named "w/api.php".
- use simple action urls https://en.wikipedia.org/Foo?action=history instead of https://en.wikipedia.org/w/index.php?title=Foo&action=history
This already works.
*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com
On 09/16/2013 03:21 PM, Tyler Romeo wrote:
On Mon, Sep 16, 2013 at 6:12 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
- drop the /wiki/ prefix https://en.wikipedia.org/Foo instead of https://en.wikipedia.org/wiki/Foo
Where would we put the API entry point? It can't be at https://en.wikipedia.org/w/api.php because there might be an article named "w/api.php".
There *might* be, in theory. In practice I doubt that there are any articles starting with 'w/'. To avoid future conflicts, we should probably prefix private paths with an underscore as titles cannot start with it (and REST APIs often use it for special resources).
- use simple action urls https://en.wikipedia.org/Foo?action=history instead of https://en.wikipedia.org/w/index.php?title=Foo&action=history
This already works.
Both parts of the proposal have been working for a long time. The RFC is mainly about using the capability in Wikimedia projects.
Gabriel
On Mon, Sep 16, 2013 at 6:34 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
There *might* be, in theory. In practice I doubt that there are any articles starting with 'w/'. To avoid future conflicts, we should probably prefix private paths with an underscore as titles cannot start with it (and REST APIs often use it for special resources).
When talking about URI design and REST, it has nothing to do with functionality, but with organization and logical design. In URIs, the path part of the URI is considered a hierarchical structure. It doesn't make sense for api.php to be a sub-resource of the wiki itself. Even doing some sort of underscore design wouldn't make sense, because you're implying that the _images/ resource is the same level sub-resource as a normal article.
*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com
On Tue, Sep 17, 2013 at 12:34 AM, Gabriel Wicke gwicke@wikimedia.org wrote:
In practice I doubt that there are any articles starting with 'w/'.
Actually, there are. Looking at enwiktionary only, there are 10 pages starting with "w/". Some of those are redirects (e.g "w/r/t"), but others are normal articles (e.g. "w/", "w/e").
Petr Onderka [[en:User:Svick]]
On 09/16/2013 04:09 PM, Petr Onderka wrote:
On Tue, Sep 17, 2013 at 12:34 AM, Gabriel Wicke gwicke@wikimedia.org wrote:
In practice I doubt that there are any articles starting with 'w/'.
Actually, there are. Looking at enwiktionary only, there are 10 pages starting with "w/". Some of those are redirects (e.g "w/r/t"), but others are normal articles (e.g. "w/", "w/e").
Ah, ok. That would make it hard to keep /w/api.php working. /_w/api.php would not suffer from that problem, but then current API users would break.
So I guess that kills the /wiki/ removal in the shorter term. Maybe we should however consider using something like /_w/ if we ever introduce a new API entry point to avoid conflicts with valid article names in the future.
Gabriel
On Mon, Sep 16, 2013 at 7:51 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
Ah, ok. That would make it hard to keep /w/api.php working. /_w/api.php would not suffer from that problem, but then current API users would break.
So I guess that kills the /wiki/ removal in the shorter term. Maybe we should however consider using something like /_w/ if we ever introduce a new API entry point to avoid conflicts with valid article names in the future.
I disagree. Having separate naming conventions for our entry points just makes things more inconsistent. Also I don't think it's even necessary in the first place to get rid of the /wiki/. It doesn't look messy at all.
*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com
On Tue, Sep 17, 2013 at 8:34 AM, Gabriel Wicke gwicke@wikimedia.org wrote:
There *might* be, in theory. In practice I doubt that there are any articles starting with 'w/'. To avoid future conflicts, we should probably prefix private paths with an underscore as titles cannot start with it (and REST APIs often use it for special resources).
I bet people have said that about single letter interwikis, but we do have quiet a few "<single letter>:" page titles around. have "<single letter>/" is not un-believable.
On 17/09/13 10:24, K. Peachey wrote:
On Tue, Sep 17, 2013 at 8:34 AM, Gabriel Wicke gwicke@wikimedia.org wrote:
There *might* be, in theory. In practice I doubt that there are any articles starting with 'w/'. To avoid future conflicts, we should probably prefix private paths with an underscore as titles cannot start with it (and REST APIs often use it for special resources).
I bet people have said that about single letter interwikis, but we do have quiet a few "<single letter>:" page titles around. have "<single letter>/" is not un-believable.
I have found 2476 pages in English Wikipedia that start with '[something]/', inlcuding pages starting with '//'. None of them start with a small letter though, for obvious reasons.
On 2013-09-17 2:29 AM, Nikola Smolenski wrote:
On 17/09/13 10:24, K. Peachey wrote:
On Tue, Sep 17, 2013 at 8:34 AM, Gabriel Wicke gwicke@wikimedia.org wrote:
There *might* be, in theory. In practice I doubt that there are any articles starting with 'w/'. To avoid future conflicts, we should probably prefix private paths with an underscore as titles cannot start with it (and REST APIs often use it for special resources).
I bet people have said that about single letter interwikis, but we do have quiet a few "<single letter>:" page titles around. have "<single letter>/" is not un-believable.
I have found 2476 pages in English Wikipedia that start with '[something]/', inlcuding pages starting with '//'. None of them start with a small letter though, for obvious reasons.
The problem with that query is you're searching Wikipedia. Try Wiktionary instead. I found 5 just on the first letter I tested https://en.wiktionary.org/wiki/Special:PrefixIndex/a/
Also pages prefixed with "<single letter>/" aren't the only thing that creates conflicts. As far as standard rewrite rules and webservers are considered a directory at /a/ and /a are the same thing. See how https://en.wikipedia.org/w is not a 404 pointing to [[w]] like https://en.wikipedia.org/a is but instead is the same as w/ and hence w/index.php. So really any single letter article on a root pathed wiki conflicts with any single letter root directory. ;) And Wikipedia has a redirect like that for every single letter of the latin alphabet. (Actually forget the latin alphabet, they've practically got most of Unicode there)
Side topic https://en.wiktionary.org/w/r/t is messed up: " To check for "r/t" on Wikipedia, see: //en.wikipedia.org/wiki/r/t https://en.wikipedia.org/wiki/r/t"
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]
On 17/09/13 11:59, Daniel Friesen wrote:
On 2013-09-17 2:29 AM, Nikola Smolenski wrote:
I have found 2476 pages in English Wikipedia that start with '[something]/', inlcuding pages starting with '//'. None of them start with a small letter though, for obvious reasons.
The problem with that query is you're searching Wikipedia. Try Wiktionary instead. I found 5 just on the first letter I tested https://en.wiktionary.org/wiki/Special:PrefixIndex/a/
There are 124 of which 63 start with a small letter.
On 09/17/2013 05:59 AM, Daniel Friesen wrote:
Side topic https://en.wiktionary.org/w/r/t is messed up: " To check for "r/t" on Wikipedia, see: //en.wikipedia.org/wiki/r/t https://en.wikipedia.org/wiki/r/t"
Good catch, filed: https://bugzilla.wikimedia.org/show_bug.cgi?id=54357
Matt Flaschen
Am 17.09.2013 00:34, schrieb Gabriel Wicke:
There *might* be, in theory. In practice I doubt that there are any articles starting with 'w/'.
I count 10 on en.wiktionary.org:
https://en.wiktionary.org/w/index.php?title=Special%3APrefixIndex&prefix...
To avoid future conflicts, we should probably prefix private paths with an underscore as titles cannot start with it (and REST APIs often use it for special resources).
That would be better.
But still, I think this is a bad idea. Essentially, putting Articles at the root of the domain mains hogging the domain as a namespace. Depending on what you want to do with your wiki, this is not a good idea.
For insteancve, wikidata uses the /entity/ path for URIs representing things, while the documents under /wiki/ are descriptions of these things. If page content was located at the root, we'd have nasty namespace pollution.
Basically: page content is only one of the things a wiki may server. "Internal" resources like CSS are another. But there may be much more, like structured data. It's good to use prefixes to keep these apart.
-- daniel
On 2013-09-17 2:48 AM, Daniel Kinzler wrote:
To avoid future conflicts, we should probably prefix private paths with an underscore as titles cannot start with it (and REST APIs often use it for special resources).
That would be better.
But still, I think this is a bad idea. Essentially, putting Articles at the root of the domain mains hogging the domain as a namespace. Depending on what you want to do with your wiki, this is not a good idea.
For insteancve, wikidata uses the /entity/ path for URIs representing things, while the documents under /wiki/ are descriptions of these things. If page content was located at the root, we'd have nasty namespace pollution.
Basically: page content is only one of the things a wiki may server. "Internal" resources like CSS are another. But there may be much more, like structured data. It's good to use prefixes to keep these apart.
-- daniel
+1
We've got others for content-related things too besides ones for internal resources and structured data.
eg: https://test2.wikipedia.org/s/85
((And I'll try to resist starting a rant about the knockoff "REST" which is a partial premise here))
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]
On 09/17/2013 02:48 AM, Daniel Kinzler wrote:
Am 17.09.2013 00:34, schrieb Gabriel Wicke:
There *might* be, in theory. In practice I doubt that there are any articles starting with 'w/'.
I count 10 on en.wiktionary.org:
https://en.wiktionary.org/w/index.php?title=Special%3APrefixIndex&prefix...
The good news is that none of them is /w/{index,api,load}.php ;)
To avoid future conflicts, we should probably prefix private paths with an underscore as titles cannot start with it (and REST APIs often use it for special resources).
That would be better.
But still, I think this is a bad idea. Essentially, putting Articles at the root of the domain mains hogging the domain as a namespace. Depending on what you want to do with your wiki, this is not a good idea.
I agree that it does not make sense to place the wiki at the root level if you are running (or plan to run) other services on the domain. On Wikipedia, the wiki is the primary use case. Optimizing for the common use case can be a good idea.
Basically: page content is only one of the things a wiki may server. "Internal" resources like CSS are another. But there may be much more, like structured data. It's good to use prefixes to keep these apart.
For different representations of the same resource there is also much to be said for suffixes, even if some of those representations are not visual. Additionally, we have namespaces as a prefix mechanism within a wiki. There will sure be cases where leaving the wiki makes sense, but I am hesitant to discard the flat wiki namespace all too quickly.
Gabriel
On Tue, 17 Sep 2013, at 7:51, Tyler Romeo wrote:
On Mon, Sep 16, 2013 at 6:12 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
- use simple action urls https://en.wikipedia.org/Foo?action=history instead of https://en.wikipedia.org/w/index.php?title=Foo&action=history
This already works.
I would be concerned about proper work of this feature in wikilinks. [[Main Page?action=history|Foo]] makes a red broken link.
On 09/27/2013 06:03 AM, Gryllida wrote:
I would be concerned about proper work of this feature in wikilinks. [[Main Page?action=history|Foo]] makes a red broken link.
So does:
[[/w/index.php?title=Main Page|Foo]]
Neither would be expected to work. Anything to the left of the pipe in your example is considered a page title. I don't think anything about wikilink parsing (or any parsing) is proposed to change.
Matt Flaschen
On Mon, Sep 16, 2013 at 3:12 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
Hi,
while tinkering with a RESTful content API I was reminded of an old pet peeve of mine: The URLs we use in Wikimedia projects are relatively long and ugly. I believe that we now have the ability to clean this up if we want to.
It would be nice to
drop the /wiki/ prefix https://en.wikipedia.org/Foo instead of https://en.wikipedia.org/wiki/Foo
use simple action urls https://en.wikipedia.org/Foo?action=history instead of https://en.wikipedia.org/w/index.php?title=Foo&action=history
The details of this proposal are discussed in the following RFC:
https://www.mediawiki.org/wiki/Requests_for_comment/Clean_up_URLs
I'm looking forward to your input!
< https://www.mediawiki.org/wiki/Manual:Short_URL#URL_like_-_example.com.2FPag...
"*Warning:* this method may create an unstable URL structure and leave some page names unusable on your wiki. See Manual:Wiki in site root directoryhttps://www.mediawiki.org/wiki/Manual:Wiki_in_site_root_directory. Please see the article Cool URIs don't changehttp://www.w3.org/Provider/Style/URIand take a few minutes to devise a stable URL structure for your web site before hopping willy-nilly into rewrites into the URL root."
- Ryan
On 09/16/2013 03:25 PM, Ryan Lane wrote:
https://www.mediawiki.org/wiki/Manual:Short_URL#URL_like_-_example.com.2FPag...
"*Warning:* this method may create an unstable URL structure and leave some page names unusable on your wiki. See Manual:Wiki in site root directoryhttps://www.mediawiki.org/wiki/Manual:Wiki_in_site_root_directory. Please see the article Cool URIs don't changehttp://www.w3.org/Provider/Style/URIand take a few minutes to devise a stable URL structure for your web site before hopping willy-nilly into rewrites into the URL root."
That is a very vague warning. So far I have lower-case 'favicon.ico', 'robots.txt' and 'w/' as potential conflicts. Do you see any others?
In general, I see removing /wiki/ as the less important part of the RFC. Using sub-resources rather than the random switch to /w/index.php is more important for caching (promotes deterministic URLs) and does not seem to involve similar trade-offs.
Gabriel
On Mon, Sep 16, 2013 at 4:41 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
On 09/16/2013 03:25 PM, Ryan Lane wrote:
https://www.mediawiki.org/wiki/Manual:Short_URL#URL_like_-_example.com.2FPag...
"*Warning:* this method may create an unstable URL structure and leave
some
page names unusable on your wiki. See Manual:Wiki in site root directory<
https://www.mediawiki.org/wiki/Manual:Wiki_in_site_root_directory%3E.
Please see the article Cool URIs don't changehttp://www.w3.org/Provider/Style/URIand take a few minutes to devise a stable URL structure for your web site before hopping willy-nilly into rewrites into the URL root."
That is a very vague warning. So far I have lower-case 'favicon.ico', 'robots.txt' and 'w/' as potential conflicts. Do you see any others?
Any of the entry points? Any new entry point? Anything we ever want to put into the root?
- Ryan
On 09/16/2013 04:42 PM, Ryan Lane wrote:
On Mon, Sep 16, 2013 at 4:41 PM, Gabriel Wicke <gwicke@wikimedia.org mailto:gwicke@wikimedia.org> wrote:
That is a very vague warning. So far I have lower-case 'favicon.ico', 'robots.txt' and 'w/' as potential conflicts. Do you see any others?
Any of the entry points? Any new entry point? Anything we ever want to put into the root?
We should be able to avoid most conflicts by picking prefixed entry points. However, as we can't drop the clashing /w/api.php any time soon I have removed the /wiki/ part from the RFC:
https://www.mediawiki.org/wiki/Requests_for_comment/Clean_up_URLs
So now only the conversion from
/w/index.php?title=foo?action=history to /foo?action=history
is under discussion.
Gabriel
On 2013-09-16 7:09 PM, Gabriel Wicke wrote:
Any of the entry points? Any new entry point? Anything we ever want to put into the root? We should be able to avoid most conflicts by picking prefixed entry points. However, as we can't drop the clashing /w/api.php any time soon I have removed the /wiki/ part from the RFC:
https://www.mediawiki.org/wiki/Requests_for_comment/Clean_up_URLs
So now only the conversion from
/w/index.php?title=foo?action=history to /foo?action=history
is under discussion.
Gabriel
Has the practice of disallowing /w/ or /index.php inside robots.txt to force search engines to completely ignore search, edit pages, exponential pagination, etc.. been considered?
Btw, side note on root urls. We still have an open bug allowing attacks on wikis using root paths: https://bugzilla.wikimedia.org/show_bug.cgi?id=38048
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]
On 09/16/2013 07:24 PM, Daniel Friesen wrote:
On 2013-09-16 7:09 PM, Gabriel Wicke wrote:
Any of the entry points? Any new entry point? Anything we ever want to put into the root? We should be able to avoid most conflicts by picking prefixed entry points. However, as we can't drop the clashing /w/api.php any time soon I have removed the /wiki/ part from the RFC:
https://www.mediawiki.org/wiki/Requests_for_comment/Clean_up_URLs
So now only the conversion from
/w/index.php?title=foo?action=history to /foo?action=history
is under discussion.
Gabriel
Has the practice of disallowing /w/ or /index.php inside robots.txt to force search engines to completely ignore search, edit pages, exponential pagination, etc.. been considered?
See https://www.mediawiki.org/wiki/Requests_for_comment/Clean_up_URLs#Migration
Btw, side note on root urls. We still have an open bug allowing attacks on wikis using root paths: https://bugzilla.wikimedia.org/show_bug.cgi?i
That looks like a fixable bug. In Parsoid for example all internal links are relative, which avoids the protocol-relative URL issue you reported there.
Gabriel
I would suggest taking a look at the number of 404s caused by people trying to access pages without the wiki prefix.... This would be interesting data to go alongside this interesting proposal... On 16 Sep 2013 20:01, "Gabriel Wicke" gwicke@wikimedia.org wrote:
On 09/16/2013 07:24 PM, Daniel Friesen wrote:
On 2013-09-16 7:09 PM, Gabriel Wicke wrote:
Any of the entry points? Any new entry point? Anything we ever want to put into the root? We should be able to avoid most conflicts by picking prefixed entry points. However, as we can't drop the clashing /w/api.php any time soon I have removed the /wiki/ part from the RFC:
https://www.mediawiki.org/wiki/Requests_for_comment/Clean_up_URLs
So now only the conversion from
/w/index.php?title=foo?action=history to /foo?action=history
is under discussion.
Gabriel
Has the practice of disallowing /w/ or /index.php inside robots.txt to force search engines to completely ignore search, edit pages, exponential pagination, etc.. been considered?
See https://www.mediawiki.org/wiki/Requests_for_comment/Clean_up_URLs#Migration
Btw, side note on root urls. We still have an open bug allowing attacks on wikis using root paths: https://bugzilla.wikimedia.org/show_bug.cgi?i
That looks like a fixable bug. In Parsoid for example all internal links are relative, which avoids the protocol-relative URL issue you reported there.
Gabriel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 17/09/13 13:59, Jon Robson wrote:
I would suggest taking a look at the number of 404s caused by people trying to access pages without the wiki prefix.... This would be interesting data to go alongside this interesting proposal...
There are lots of different sorts of 404s, so it's necessary to do some filtering. For example:
* double-slashes, due to bug 52253 * sitemap.xml * Apple touch icons * bullet.gif in various directories * vulnerability scanning, e.g. xmlrpc.php * BlueCoat verify/notify, as described in http://www.webmasterworld.com/search_engine_spiders/3859463.htm * Serial numbers like http://en.wikipedia.org/B008NAYASM .
I filtered out everything with a dot or slash in the prospective article title, as well as the BlueCoat URLs and the UAs responsible for serial number URLs. To simplify analysis, I took log lines from the English Wikipedia only.
Most of the remaining log entries were search engine crawlers, so I took those out too.
The result was 149 log entries at a 1/1000 sample rate, for the week of September 8-14, implying a request rate of about 639,000 per month. This is about 0.006% of the English Wikipedia's page view rate.
The 149 URLs are at http://paste.tstarling.com/p/uhtFqg.html
-- Tim Starling
Note also that zhwiki (and others?) profitably uses the first part of the path to do variant selection.
https://zh.wikipedia.org/wiki/User:Cscott uses the wiki default variant (if logged in, uses the variant from the user's preferences) https://zh.wikipedia.org/zh-hans/User:Cscott https://zh.wikipedia.org/zh-hk/User:Cscott etc use the specified variant.
I have a dream to eventually enable https://en.wikipedia.org/en-gb/Football in a similar fashion. --scott
Thanks Tim for running those data. That seems to suggest the URL structure works for the most case.
On Wed, Sep 18, 2013 at 12:07 AM, Tim Starling tstarling@wikimedia.org wrote:
On 17/09/13 13:59, Jon Robson wrote:
I would suggest taking a look at the number of 404s caused by people trying to access pages without the wiki prefix.... This would be interesting data to go alongside this interesting proposal...
There are lots of different sorts of 404s, so it's necessary to do some filtering. For example:
- double-slashes, due to bug 52253
- sitemap.xml
- Apple touch icons
- bullet.gif in various directories
- vulnerability scanning, e.g. xmlrpc.php
- BlueCoat verify/notify, as described in
http://www.webmasterworld.com/search_engine_spiders/3859463.htm
- Serial numbers like http://en.wikipedia.org/B008NAYASM .
I filtered out everything with a dot or slash in the prospective article title, as well as the BlueCoat URLs and the UAs responsible for serial number URLs. To simplify analysis, I took log lines from the English Wikipedia only.
Most of the remaining log entries were search engine crawlers, so I took those out too.
The result was 149 log entries at a 1/1000 sample rate, for the week of September 8-14, implying a request rate of about 639,000 per month. This is about 0.006% of the English Wikipedia's page view rate.
The 149 URLs are at http://paste.tstarling.com/p/uhtFqg.html
-- Tim Starling
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 20/09/13 03:04, Jon Robson wrote:
Thanks Tim for running those data. That seems to suggest the URL structure works for the most case.
I think the request rate for actual articles in the root is very, very low. And if you look at the paste I gave earlier:
http://paste.tstarling.com/p/uhtFqg.html
there's reason to think that the amount of traffic that comes from naive readers typing URLs and expecting an article is much smaller than even 149k per week. A naive user would be more likely to type a URL starting with a lower-case letter, and if you take those entries, and filter out the obvious client bugs and typos, that leaves only 39 log entries. If we filter out some more log entries that are unlikely search terms for Wikipedia articles ("enregistrement-audio-musique", "is", "unlimited_data_plan", etc.), that leaves maybe 30. http://paste.tstarling.com/p/KWuHif.html
Of these, only 12 actually correspond to Wikipedia articles or redirects:
abolition addicting_games apple_inc carnaval dreamshade facade girls insidious karthik online_coupons snam walkabout
So the number of naive readers actually helped by our 404 Refresh to /wiki/ is probably closer to 12k per week than 149k per week.
Personally, I think the refresh is annoying, since it makes it much more difficult to correct typos in manually-typed URLs. If you actually meant to type some non-article URL like a CSS resource, and make a typo which causes it to hit the refresh, the URL you typed is erased from your browser's address bar and history, making correction of the typo much more difficult. Maybe we should just include a link to the search page, rather than redirect or refresh.
-- Tim Starling
Tim Starling wrote:
Personally, I think the refresh is annoying, since it makes it much more difficult to correct typos in manually-typed URLs. If you actually meant to type some non-article URL like a CSS resource, and make a typo which causes it to hit the refresh, the URL you typed is erased from your browser's address bar and history, making correction of the typo much more difficult. Maybe we should just include a link to the search page, rather than redirect or refresh.
Mark Ryan redesigned the 404 page in 2009 and specifically removed the meta refresh tag (cf. https://bugs.wikimedia.org/17316#c0).
The redesigned page eventually got deployed, but the client-side refresh very sneakily moved from the HTML output to a Refresh header (cf. https://bugs.wikimedia.org/35052#c0).
Neither bug is resolved, if anyone is interested in helping out. :-)
MZMcBride
On 19 Sep 2013 18:23, "Tim Starling" tstarling@wikimedia.org wrote:
On 20/09/13 03:04, Jon Robson wrote:
Thanks Tim for running those data. That seems to suggest the URL structure works for the most case.
I think the request rate for actual articles in the root is very, very low.
I agree.. Sorry I guess my message wasn't so clear. I meant "existing" URL structure :)
And if you look at the paste I gave earlier:
http://paste.tstarling.com/p/uhtFqg.html
there's reason to think that the amount of traffic that comes from naive readers typing URLs and expecting an article is much smaller than even 149k per week. A naive user would be more likely to type a URL starting with a lower-case letter, and if you take those entries, and filter out the obvious client bugs and typos, that leaves only 39 log entries. If we filter out some more log entries that are unlikely search terms for Wikipedia articles ("enregistrement-audio-musique", "is", "unlimited_data_plan", etc.), that leaves maybe 30. http://paste.tstarling.com/p/KWuHif.html
Of these, only 12 actually correspond to Wikipedia articles or redirects:
abolition addicting_games apple_inc carnaval dreamshade facade girls insidious karthik online_coupons snam walkabout
So the number of naive readers actually helped by our 404 Refresh to /wiki/ is probably closer to 12k per week than 149k per week.
Personally, I think the refresh is annoying, since it makes it much more difficult to correct typos in manually-typed URLs. If you actually meant to type some non-article URL like a CSS resource, and make a typo which causes it to hit the refresh, the URL you typed is erased from your browser's address bar and history, making correction of the typo much more difficult. Maybe we should just include a link to the search page, rather than redirect or refresh.
-- Tim Starling
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 09/19/2013 10:04 AM, Jon Robson wrote:
Thanks Tim for running those data. That seems to suggest the URL structure works for the most case.
It certainly confirms that search engines link to working links, and users typing URLs manually are rare and (eventually) learn to prefix /wiki/. I am not that convinced that the current number of 404s says that much about the user-friendliness or aesthetics of different URL schemes, but that is besides the point (and subjective).
I see /w/index.php?title=.. as the more important clean-up, which is why the RFC is only about that aspect.
Gabriel
On 2013-09-16 8:01 PM, Gabriel Wicke wrote:
On 09/16/2013 07:24 PM, Daniel Friesen wrote:
On 2013-09-16 7:09 PM, Gabriel Wicke wrote:
Any of the entry points? Any new entry point? Anything we ever want to put into the root? We should be able to avoid most conflicts by picking prefixed entry points. However, as we can't drop the clashing /w/api.php any time soon I have removed the /wiki/ part from the RFC:
https://www.mediawiki.org/wiki/Requests_for_comment/Clean_up_URLs
So now only the conversion from
/w/index.php?title=foo?action=history to /foo?action=history
is under discussion.
Gabriel
Has the practice of disallowing /w/ or /index.php inside robots.txt to force search engines to completely ignore search, edit pages, exponential pagination, etc.. been considered?
See https://www.mediawiki.org/wiki/Requests_for_comment/Clean_up_URLs#Migration
Ok. Though even assuming the * and Allow: non-standard features are supported by all bots we want to target I actually don't like the idea of blacklisting /wiki/*? in this way.
I don't think that every url with a query in it qualifies as something we want to blacklist from search engines. There are plenty but sometimes there is content that's served with a query which could otherwise be a good idea to index.
For example the non-first pages on long categories and Special:Allpages' pagination. The latter has robots=noindex – though I think we may want to reconsider that – but the former is not noindexed and with the introduction of rel="next", etc... would be pretty reasonable to index but is currently blacklisted by robots.txt. Additionally while we normally want to noindex edit pages. This isn't true of redlinks in every case. Take redlinked category links for example. These link to an action=edit&redlink=1 which for a search engine would then redirect back to the pretty url for the category. But because of robots.txt this link is masked because the intermediate redirect cannot be read by the search engine.
The idea I had to fix that naturally was to make MediaWiki aware of this and whether by a new routing system or simply filters for specific simple queries make it output /wiki/title?query urls for those cases where it's a query we would want indexed and leave robots blacklisted stuff under /w/ (though I did also consider a separate short url path like /w/page/$1 to make internal/robots blacklisted urls pretty). However adding Disallow: /wiki/*? to robots.txt will preclude the ability to do that.
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]
On Tue, Sep 17, 2013 at 2:09 AM, Gabriel Wicke gwicke@wikimedia.org wrote:
So now only the conversion from
/w/index.php?title=foo?action=history to /foo?action=history
Do you mean:
to /wiki/foo?action=history
?
is under discussion.
See also https://gerrit.wikimedia.org/r/51595 and RT# 864 (aka https://bugzilla.wikimedia.org/21919 ) which all seem to prefer docroot verification rather than DNS.
-Jeremy
On 09/16/2013 08:48 PM, Jeremy Baron wrote:
On Tue, Sep 17, 2013 at 2:09 AM, Gabriel Wicke gwicke@wikimedia.org wrote:
/w/index.php?title=foo?action=history to /foo?action=history
Do you mean:
to /wiki/foo?action=history
Yes, sorry. The RFC had it right, in case you read that ;)
Gabriel
On Mon, Sep 16, 2013 at 7:41 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
Using sub-resources rather than the random switch to /w/index.php is more important for caching (promotes deterministic URLs) and does not seem to involve similar trade-offs.
Note that "promotes deterministic URLs" applies only to cases where only one parameter other than 'title' is provided to index.php (usually this parameter is 'action'). If the URL has more than one parameter other than 'title', you're still out of luck.
"But you can turn on $wgActionPaths to remove 'action' from the query string too!" you say? But then you're still stuck if the URL has two parameters other than 'action' and 'title'. Such as "offset" and "limit", for example.
On 09/17/2013 08:40 AM, Brad Jorsch (Anomie) wrote:
On Mon, Sep 16, 2013 at 7:41 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
Using sub-resources rather than the random switch to /w/index.php is more important for caching (promotes deterministic URLs) and does not seem to involve similar trade-offs.
Note that "promotes deterministic URLs" applies only to cases where only one parameter other than 'title' is provided to index.php (usually this parameter is 'action'). If the URL has more than one parameter other than 'title', you're still out of luck.
An end point that wants to be cacheable should only use one query parameter, which might well be a path. Hypothetical examples:
http://wiki.org/wiki/Foo?r=latest/html http://wiki.org/wiki/Foo?r=123456/wikitext
An alternative solution would be to specify a list of required query parameters and a canonical ordering, and to reject (or redirect) requests not conforming to this spec. The problem I see with this approach is that many client libraries don't provide control over the order of query parameters, which would make such an interface hard to use.
Gabriel
On Tue, Sep 17, 2013 at 12:27 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
An end point that wants to be cacheable should only use one query parameter, which might well be a path. Hypothetical examples:
http://wiki.org/wiki/Foo?r=latest/html http://wiki.org/wiki/Foo?r=123456/wikitext
So now you're cramming multiple parameters, ordered, into one parameter? Why not go all the way and do http://wiki.org/wiki/123456/wikitext/Foo then?
But IMO, that's ridiculous.
An alternative solution would be to specify a list of required query parameters and a canonical ordering, and to reject (or redirect) requests not conforming to this spec.
"reject" is even more ridiculous. "redirect" is less ridiculous, but is strange and will increase latency and number-of-requests for clients that don't know the magic order.
What is the actual benefit we're trying to get here? All I've gotten so far along those lines is "improve cacheability", but it doesn't seem to have been established whether caching even needs improving in this area.
On 09/17/2013 11:24 AM, Brad Jorsch (Anomie) wrote:
On Tue, Sep 17, 2013 at 12:27 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
An end point that wants to be cacheable should only use one query parameter, which might well be a path. Hypothetical examples:
http://wiki.org/wiki/Foo?r=latest/html http://wiki.org/wiki/Foo?r=123456/wikitext
So now you're cramming multiple parameters, ordered, into one parameter? Why not go all the way and do http://wiki.org/wiki/123456/wikitext/Foo then?
I consider the article to be the main resource we are interested in, with a revision and then a specific part (format) of that revision as a sub-resource. As our titles can contain slashes we need to delimit the main resource from the sub-resource part. A single query parameter that specifies the sub-resource path achieves that.
What is the actual benefit we're trying to get here? All I've gotten so far along those lines is "improve cacheability", but it doesn't seem to have been established whether caching even needs improving in this area.
A heavily-used content API will perform better and use less resources when it is cacheable. This will become more important over time, so I believe it is worth spending a small amount of effort on now.
Gabriel
Gabriel Wicke wrote:
A heavily-used content API will perform better and use less resources when it is cacheable. This will become more important over time, so I believe it is worth spending a small amount of effort on now.
Sure, I think everyone agrees that a heavily used Web resource will perform better with caching. I'm just not sure futzing around with path names is the best way to try to ensure sustainable cacheability.
Is there a breakdown of what in a typical MediaWiki API request takes the most time or uses the most resources (i.e., profiling a local request)? I imagine there are multiple caching opportunities at other layers that don't rely on path name, but it's difficult to say where you might see the most gains without further data.
MZMcBride
On Mon, Sep 16, 2013 at 3:12 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
Hi,
while tinkering with a RESTful content API I was reminded of an old pet peeve of mine: The URLs we use in Wikimedia projects are relatively long and ugly. I believe that we now have the ability to clean this up if we want to.
It would be nice to
drop the /wiki/ prefix https://en.wikipedia.org/Foo instead of https://en.wikipedia.org/wiki/Foo
use simple action urls https://en.wikipedia.org/Foo?action=history instead of https://en.wikipedia.org/w/index.php?title=Foo&action=history
The details of this proposal are discussed in the following RFC:
https://www.mediawiki.org/wiki/Requests_for_comment/Clean_up_URLs
I'm looking forward to your input!
Even better would be getting rid of action urls entirely.
-Chad
Chad wrote:
On Mon, Sep 16, 2013 at 3:12 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
It would be nice to
drop the /wiki/ prefix https://en.wikipedia.org/Foo instead of https://en.wikipedia.org/wiki/Foo
use simple action urls https://en.wikipedia.org/Foo?action=history instead of https://en.wikipedia.org/w/index.php?title=Foo&action=history
The details of this proposal are discussed in the following RFC:
https://www.mediawiki.org/wiki/Requests_for_comment/Clean_up_URLs
Even better would be getting rid of action urls entirely.
In favor of what? Special page URLs?
A variant on https://en.wikipedia.org/Foo?action=history is https://en.wikipedia.org/history/Foo (using $wgActionPaths).
The RFC currently seems to gloss over what problem is attempting to be solved here and what benefits a new URL structure might bring. I'd like to see a clearer statement of a problem and benefits to a switch, taking into account, for example, the overarching goal of making URLs fully localized.
MZMcBride
----- Original Message -----
From: "MZMcBride" z@mzmcbride.com
The RFC currently seems to gloss over what problem is attempting to be solved here and what benefits a new URL structure might bring. I'd like to see a clearer statement of a problem and benefits to a switch, taking into account, for example, the overarching goal of making URLs fully localized.
Concur, especially in light of the face that *this does not permit you to break the old URLs*. They are everywhere, *and they must continue to work forever*.
I hope I don't even have to justify why.
Cheers, -- jra
On Mon, Sep 16, 2013 at 3:36 PM, MZMcBride z@mzmcbride.com wrote:
The RFC currently seems to gloss over what problem is attempting to be solved here and what benefits a new URL structure might bring. I'd like to see a clearer statement of a problem and benefits to a switch, taking into account, for example, the overarching goal of making URLs fully localized.
How about the following?
Our current URL structure is extremely obtuse for non-technical users, and generally defies their expectations. To most people, en.wikipedia.org/Dogor even wikipedia.org/Dog should work just fine, not produce a 404.
On Mon, Sep 16, 2013 at 8:20 PM, Steven Walling steven.walling@gmail.comwrote:
Our current URL structure is extremely obtuse for non-technical users, and generally defies their expectations. To most people, en.wikipedia.org/Dogor even wikipedia.org/Dog should work just fine, not produce a 404.
To be fair, both of those links redirect to the proper URL anyway. It wouldn't be hard to just change that from 404 to a redirect. Nonetheless the canonical URI should still be /wiki/Article_title.
*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com
----- Original Message -----
From: "Steven Walling" steven.walling@gmail.com
How about the following?
Our current URL structure is extremely obtuse for non-technical users, and generally defies their expectations. To most people, en.wikipedia.org/Dogor even wikipedia.org/Dog should work just fine, not produce a 404.
Any collection of "most people" large enough to justify a change like this is, I assert, too technically unsophisticated to be attempting to construct URLs by hand (rather than by copy/pasta).
Do you propose to "fix" also the capitalization and spacing and URLescaping rules, which are much more complicated than that?
My considered reaction, now after several hours, is that this is fixing a problem which is not really broken for *anyone* except those who are OCD about hiding the "tech-y" look in the Location box. No offense. :-)
Cheers, -- jra
On 2013-09-16 7:12 PM, "Gabriel Wicke" gwicke@wikimedia.org wrote:
Hi,
while tinkering with a RESTful content API I was reminded of an old pet peeve of mine: The URLs we use in Wikimedia projects are relatively long and ugly. I believe that we now have the ability to clean this up if we want to.
It would be nice to
drop the /wiki/ prefix https://en.wikipedia.org/Foo instead of https://en.wikipedia.org/wiki/Foo
use simple action urls https://en.wikipedia.org/Foo?action=history instead of https://en.wikipedia.org/w/index.php?title=Foo&action=history
The details of this proposal are discussed in the following RFC:
I'm looking forward to your input!
Gabriel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Well I'm not particularly fond of this idea (probably because im stuck in my ways more than anything else), I do think that making the en.wikipedia.org/foo be an instant http redirect instead of "did you mean/redirecting in 5 seconds" message we currently have might make sense.
Additionally there is some security issues in ie6 when doing foo?action=raw if I recall.
-bawolff
On 09/16/2013 04:34 PM, Brian Wolff wrote:
Additionally there is some security issues in ie6 when doing foo?action=raw if I recall.
Yes, IIRC some version of IE disregarded the Content-type header and guessed the content type based on the URL and the content. If the URL contained .php (only outside the query string?), it disabled this behavior.
Tim mentions in https://www.mediawiki.org/wiki/Special:Code/MediaWiki/49833#c3561 that this only applied to IE3 and earlier, and IE4 respects the Content-type header. As the market share of IE <= 3 is probably non-existent we could probably blacklist it from logging in and content API access altogether.
According to [1] and [2] there is also a 'X-Content-Type-Options: nosniff' header that disables this behavior for IE and Chrome. I doubt that it works in IE3 though. Anybody up for some testing with an ancient IE3 install?
Gabriel
[1]: http://msdn.microsoft.com/en-us/library/dd565661(v=vs.85).aspx [2]: https://www.owasp.org/index.php/List_of_useful_HTTP_headers
On 17/09/13 11:08, Gabriel Wicke wrote:
On 09/16/2013 04:34 PM, Brian Wolff wrote:
Additionally there is some security issues in ie6 when doing foo?action=raw if I recall.
Yes, IIRC some version of IE disregarded the Content-type header and guessed the content type based on the URL and the content. If the URL contained .php (only outside the query string?), it disabled this behavior.
Tim mentions in https://www.mediawiki.org/wiki/Special:Code/MediaWiki/49833#c3561 that this only applied to IE3 and earlier, and IE4 respects the Content-type header. As the market share of IE <= 3 is probably non-existent we could probably blacklist it from logging in and content API access altogether.
This issue affects IE at least up to IE 6, possibly later, see bug 28235.
-- Tim Starling
On 09/16/2013 07:48 PM, Tim Starling wrote:
On 17/09/13 11:08, Gabriel Wicke wrote:
Tim mentions in https://www.mediawiki.org/wiki/Special:Code/MediaWiki/49833#c3561 that this only applied to IE3 and earlier, and IE4 respects the Content-type header. As the market share of IE <= 3 is probably non-existent we could probably blacklist it from logging in and content API access altogether.
This issue affects IE at least up to IE 6, possibly later, see bug 28235.
Thanks for the pointer! It is sad that IE6 (and likely IE7) is still haunting us. IE8+ is covered by the X-Content-Type-Options header.
It sounds like your Content-Disposition solution [1] should still work for IE6/7 where that header is not used otherwise. The existing users of that header all seem to be file-related. Did I miss any use in action handlers?
Gabriel
[1]: https://bugzilla.wikimedia.org/show_bug.cgi?id=28235#c6
On 17/09/13 14:01, Gabriel Wicke wrote:
On 09/16/2013 07:48 PM, Tim Starling wrote:
On 17/09/13 11:08, Gabriel Wicke wrote:
Tim mentions in https://www.mediawiki.org/wiki/Special:Code/MediaWiki/49833#c3561 that this only applied to IE3 and earlier, and IE4 respects the Content-type header. As the market share of IE <= 3 is probably non-existent we could probably blacklist it from logging in and content API access altogether.
This issue affects IE at least up to IE 6, possibly later, see bug 28235.
Thanks for the pointer! It is sad that IE6 (and likely IE7) is still haunting us. IE8+ is covered by the X-Content-Type-Options header.
It sounds like your Content-Disposition solution [1] should still work for IE6/7 where that header is not used otherwise. The existing users of that header all seem to be file-related. Did I miss any use in action handlers?
I'm assuming you can grep for Content-Disposition as well as I can. IIRC, the difficulty with Content-Disposition, in the context of a security patch, was the need to abstract handling of the header out of the various places that send it, so that it would be consistent and demonstrably secure. That would have made the security patch larger and more complex than it needed to be, which would have been a problem for backporters. That shouldn't be a concern for your feature.
-- Tim Starling
On 17/09/13 09:34, Brian Wolff wrote:
Well I'm not particularly fond of this idea (probably because im stuck in my ways more than anything else), I do think that making the en.wikipedia.org/foo be an instant http redirect instead of "did you mean/redirecting in 5 seconds" message we currently have might make sense.
The technical situation has not changed since that meta refresh was introduced, and the same rationale still applies. See e.g.
February 2005: http://article.gmane.org/gmane.science.linguistics.wikipedia.technical/15711
August 2006: http://article.gmane.org/gmane.science.linguistics.wikipedia.technical/25605
-- Tim Starling
wikitech-l@lists.wikimedia.org