[RFC]: Clean URLs- dropping /wiki/ and /w/index.php?title=..

List overview All Threads
Download

newer

older

Backport policy

architectural granularity

Gabriel Wicke

16 Sep 2013 16 Sep '13

3:12 p.m.

Hi,

while tinkering with a RESTful content API I was reminded of an old pet peeve of mine: The URLs we use in Wikimedia projects are relatively long and ugly. I believe that we now have the ability to clean this up if we want to.

It would be nice to

* drop the /wiki/ prefix https://en.wikipedia.org/Foo instead of https://en.wikipedia.org/wiki/Foo

* use simple action urls https://en.wikipedia.org/Foo?action=history instead of https://en.wikipedia.org/w/index.php?title=Foo&action=history

The details of this proposal are discussed in the following RFC:

https://www.mediawiki.org/wiki/Requests_for_comment/Clean_up_URLs

I'm looking forward to your input!

Gabriel

Show replies by date

Tyler Romeo

16 Sep 16 Sep

3:21 p.m.

On Mon, Sep 16, 2013 at 6:12 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...

drop the /wiki/ prefix https://en.wikipedia.org/Foo instead of https://en.wikipedia.org/wiki/Foo

Where would we put the API entry point? It can't be at https://en.wikipedia.org/w/api.php because there might be an article named "w/api.php".

...

use simple action urls https://en.wikipedia.org/Foo?action=history instead of https://en.wikipedia.org/w/index.php?title=Foo&action=history

This already works.

*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com

Gabriel Wicke

3:34 p.m.

On 09/16/2013 03:21 PM, Tyler Romeo wrote:

...

On Mon, Sep 16, 2013 at 6:12 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...

drop the /wiki/ prefix https://en.wikipedia.org/Foo instead of https://en.wikipedia.org/wiki/Foo

Where would we put the API entry point? It can't be at https://en.wikipedia.org/w/api.php because there might be an article named "w/api.php".

There *might* be, in theory. In practice I doubt that there are any articles starting with 'w/'. To avoid future conflicts, we should probably prefix private paths with an underscore as titles cannot start with it (and REST APIs often use it for special resources).

...

...

use simple action urls https://en.wikipedia.org/Foo?action=history instead of https://en.wikipedia.org/w/index.php?title=Foo&action=history

This already works.

Both parts of the proposal have been working for a long time. The RFC is mainly about using the capability in Wikimedia projects.

Gabriel

Tyler Romeo

3:38 p.m.

On Mon, Sep 16, 2013 at 6:34 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...

There *might* be, in theory. In practice I doubt that there are any articles starting with 'w/'. To avoid future conflicts, we should probably prefix private paths with an underscore as titles cannot start with it (and REST APIs often use it for special resources).

When talking about URI design and REST, it has nothing to do with functionality, but with organization and logical design. In URIs, the path part of the URI is considered a hierarchical structure. It doesn't make sense for api.php to be a sub-resource of the wiki itself. Even doing some sort of underscore design wouldn't make sense, because you're implying that the _images/ resource is the same level sub-resource as a normal article.

*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com

Petr Onderka

4:09 p.m.

On Tue, Sep 17, 2013 at 12:34 AM, Gabriel Wicke gwicke@wikimedia.org wrote:

...

In practice I doubt that there are any articles starting with 'w/'.

Actually, there are. Looking at enwiktionary only, there are 10 pages starting with "w/". Some of those are redirects (e.g "w/r/t"), but others are normal articles (e.g. "w/", "w/e").

Petr Onderka [[en:User:Svick]]

Gabriel Wicke

4:51 p.m.

On 09/16/2013 04:09 PM, Petr Onderka wrote:

...

On Tue, Sep 17, 2013 at 12:34 AM, Gabriel Wicke gwicke@wikimedia.org wrote:

...
In practice I doubt that there are any articles starting with 'w/'.

Actually, there are. Looking at enwiktionary only, there are 10 pages starting with "w/". Some of those are redirects (e.g "w/r/t"), but others are normal articles (e.g. "w/", "w/e").

Ah, ok. That would make it hard to keep /w/api.php working. /_w/api.php would not suffer from that problem, but then current API users would break.

So I guess that kills the /wiki/ removal in the shorter term. Maybe we should however consider using something like /_w/ if we ever introduce a new API entry point to avoid conflicts with valid article names in the future.

Gabriel

Tyler Romeo

5:06 p.m.

On Mon, Sep 16, 2013 at 7:51 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...

Ah, ok. That would make it hard to keep /w/api.php working. /_w/api.php would not suffer from that problem, but then current API users would break.

So I guess that kills the /wiki/ removal in the shorter term. Maybe we should however consider using something like /_w/ if we ever introduce a new API entry point to avoid conflicts with valid article names in the future.

I disagree. Having separate naming conventions for our entry points just makes things more inconsistent. Also I don't think it's even necessary in the first place to get rid of the /wiki/. It doesn't look messy at all.

*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com

K. Peachey

17 Sep 17 Sep

1:24 a.m.

On Tue, Sep 17, 2013 at 8:34 AM, Gabriel Wicke gwicke@wikimedia.org wrote:

...

There *might* be, in theory. In practice I doubt that there are any articles starting with 'w/'. To avoid future conflicts, we should probably prefix private paths with an underscore as titles cannot start with it (and REST APIs often use it for special resources).

I bet people have said that about single letter interwikis, but we do have quiet a few "<single letter>:" page titles around. have "<single letter>/" is not un-believable.

Nikola Smolenski

2:29 a.m.

On 17/09/13 10:24, K. Peachey wrote:

...

On Tue, Sep 17, 2013 at 8:34 AM, Gabriel Wicke gwicke@wikimedia.org wrote:

...
There *might* be, in theory. In practice I doubt that there are any articles starting with 'w/'. To avoid future conflicts, we should probably prefix private paths with an underscore as titles cannot start with it (and REST APIs often use it for special resources).

I bet people have said that about single letter interwikis, but we do have quiet a few "<single letter>:" page titles around. have "<single letter>/" is not un-believable.

I have found 2476 pages in English Wikipedia that start with '[something]/', inlcuding pages starting with '//'. None of them start with a small letter though, for obvious reasons.

Daniel Friesen

2:59 a.m.

On 2013-09-17 2:29 AM, Nikola Smolenski wrote:

...

On 17/09/13 10:24, K. Peachey wrote:

...
On Tue, Sep 17, 2013 at 8:34 AM, Gabriel Wicke gwicke@wikimedia.org wrote:

...
There *might* be, in theory. In practice I doubt that there are any articles starting with 'w/'. To avoid future conflicts, we should probably prefix private paths with an underscore as titles cannot start with it (and REST APIs often use it for special resources).

I bet people have said that about single letter interwikis, but we do have quiet a few "<single letter>:" page titles around. have "<single letter>/" is not un-believable.

I have found 2476 pages in English Wikipedia that start with '[something]/', inlcuding pages starting with '//'. None of them start with a small letter though, for obvious reasons.

The problem with that query is you're searching Wikipedia. Try Wiktionary instead. I found 5 just on the first letter I tested https://en.wiktionary.org/wiki/Special:PrefixIndex/a/

Also pages prefixed with "<single letter>/" aren't the only thing that creates conflicts. As far as standard rewrite rules and webservers are considered a directory at /a/ and /a are the same thing. See how https://en.wikipedia.org/w is not a 404 pointing to [[w]] like https://en.wikipedia.org/a is but instead is the same as w/ and hence w/index.php. So really any single letter article on a root pathed wiki conflicts with any single letter root directory. ;) And Wikipedia has a redirect like that for every single letter of the latin alphabet. (Actually forget the latin alphabet, they've practically got most of Unicode there)

Side topic https://en.wiktionary.org/w/r/t is messed up: " To check for "r/t" on Wikipedia, see: //en.wikipedia.org/wiki/r/t https://en.wikipedia.org/wiki/r/t"

~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]

Nikola Smolenski

3:23 a.m.

On 17/09/13 11:59, Daniel Friesen wrote:

...

On 2013-09-17 2:29 AM, Nikola Smolenski wrote:

...
I have found 2476 pages in English Wikipedia that start with '[something]/', inlcuding pages starting with '//'. None of them start with a small letter though, for obvious reasons.

The problem with that query is you're searching Wikipedia. Try Wiktionary instead. I found 5 just on the first letter I tested https://en.wiktionary.org/wiki/Special:PrefixIndex/a/

There are 124 of which 63 start with a small letter.

Matthew Flaschen

19 Sep 19 Sep

5:15 p.m.

On 09/17/2013 05:59 AM, Daniel Friesen wrote:

...

Side topic https://en.wiktionary.org/w/r/t is messed up: " To check for "r/t" on Wikipedia, see: //en.wikipedia.org/wiki/r/t https://en.wikipedia.org/wiki/r/t"

Good catch, filed: https://bugzilla.wikimedia.org/show_bug.cgi?id=54357

Matt Flaschen

Daniel Kinzler

17 Sep 17 Sep

2:48 a.m.

Am 17.09.2013 00:34, schrieb Gabriel Wicke:

...

There *might* be, in theory. In practice I doubt that there are any articles starting with 'w/'.

I count 10 on en.wiktionary.org:

https://en.wiktionary.org/w/index.php?title=Special%3APrefixIndex&prefix...

...

To avoid future conflicts, we should probably prefix private paths with an underscore as titles cannot start with it (and REST APIs often use it for special resources).

That would be better.

But still, I think this is a bad idea. Essentially, putting Articles at the root of the domain mains hogging the domain as a namespace. Depending on what you want to do with your wiki, this is not a good idea.

For insteancve, wikidata uses the /entity/ path for URIs representing things, while the documents under /wiki/ are descriptions of these things. If page content was located at the root, we'd have nasty namespace pollution.

Basically: page content is only one of the things a wiki may server. "Internal" resources like CSS are another. But there may be much more, like structured data. It's good to use prefixes to keep these apart.

-- daniel

Daniel Friesen

3:26 a.m.

On 2013-09-17 2:48 AM, Daniel Kinzler wrote:

...

...
To avoid future conflicts, we should probably prefix private paths with an underscore as titles cannot start with it (and REST APIs often use it for special resources).

That would be better.

But still, I think this is a bad idea. Essentially, putting Articles at the root of the domain mains hogging the domain as a namespace. Depending on what you want to do with your wiki, this is not a good idea.

For insteancve, wikidata uses the /entity/ path for URIs representing things, while the documents under /wiki/ are descriptions of these things. If page content was located at the root, we'd have nasty namespace pollution.

Basically: page content is only one of the things a wiki may server. "Internal" resources like CSS are another. But there may be much more, like structured data. It's good to use prefixes to keep these apart.

-- daniel

We've got others for content-related things too besides ones for internal resources and structured data.

eg: https://test2.wikipedia.org/s/85

((And I'll try to resist starting a rant about the knockoff "REST" which is a partial premise here))

~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]

Gabriel Wicke

9:13 a.m.

On 09/17/2013 02:48 AM, Daniel Kinzler wrote:

...

Am 17.09.2013 00:34, schrieb Gabriel Wicke:

...
There *might* be, in theory. In practice I doubt that there are any articles starting with 'w/'.

I count 10 on en.wiktionary.org:

https://en.wiktionary.org/w/index.php?title=Special%3APrefixIndex&prefix...

The good news is that none of them is /w/{index,api,load}.php ;)

...

...
To avoid future conflicts, we should probably prefix private paths with an underscore as titles cannot start with it (and REST APIs often use it for special resources).

That would be better.

But still, I think this is a bad idea. Essentially, putting Articles at the root of the domain mains hogging the domain as a namespace. Depending on what you want to do with your wiki, this is not a good idea.

I agree that it does not make sense to place the wiki at the root level if you are running (or plan to run) other services on the domain. On Wikipedia, the wiki is the primary use case. Optimizing for the common use case can be a good idea.

...

Basically: page content is only one of the things a wiki may server. "Internal" resources like CSS are another. But there may be much more, like structured data. It's good to use prefixes to keep these apart.

For different representations of the same resource there is also much to be said for suffixes, even if some of those representations are not visual. Additionally, we have namespaces as a prefix mechanism within a wiki. There will sure be cases where leaving the wiki makes sense, but I am hesitant to discard the flat wiki namespace all too quickly.

Gabriel

Gryllida

27 Sep 27 Sep

3:03 a.m.

New subject: [RFC]: Clean URLs- dropping /wiki/ and /w/index.php?title=..

On Tue, 17 Sep 2013, at 7:51, Tyler Romeo wrote:

...

On Mon, Sep 16, 2013 at 6:12 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...

use simple action urls https://en.wikipedia.org/Foo?action=history instead of https://en.wikipedia.org/w/index.php?title=Foo&action=history

This already works.

I would be concerned about proper work of this feature in wikilinks. [[Main Page?action=history|Foo]] makes a red broken link.

Matthew Flaschen

4:35 a.m.

On 09/27/2013 06:03 AM, Gryllida wrote:

...

I would be concerned about proper work of this feature in wikilinks. [[Main Page?action=history|Foo]] makes a red broken link.

So does:

[[/w/index.php?title=Main Page|Foo]]

Neither would be expected to work. Anything to the left of the pipe in your example is considered a page title. I don't think anything about wikilink parsing (or any parsing) is proposed to change.

Matt Flaschen

Ryan Lane

16 Sep 16 Sep

3:25 p.m.

On Mon, Sep 16, 2013 at 3:12 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...

Hi,

while tinkering with a RESTful content API I was reminded of an old pet peeve of mine: The URLs we use in Wikimedia projects are relatively long and ugly. I believe that we now have the ability to clean this up if we want to.

It would be nice to

drop the /wiki/ prefix https://en.wikipedia.org/Foo instead of https://en.wikipedia.org/wiki/Foo

use simple action urls https://en.wikipedia.org/Foo?action=history instead of https://en.wikipedia.org/w/index.php?title=Foo&action=history

The details of this proposal are discussed in the following RFC:

https://www.mediawiki.org/wiki/Requests_for_comment/Clean_up_URLs

I'm looking forward to your input!

< https://www.mediawiki.org/wiki/Manual:Short_URL#URL_like_-_example.com.2FPag...

...

"*Warning:* this method may create an unstable URL structure and leave some page names unusable on your wiki. See Manual:Wiki in site root directoryhttps://www.mediawiki.org/wiki/Manual:Wiki_in_site_root_directory. Please see the article Cool URIs don't changehttp://www.w3.org/Provider/Style/URIand take a few minutes to devise a stable URL structure for your web site before hopping willy-nilly into rewrites into the URL root."

- Ryan

Gabriel Wicke

4:41 p.m.

On 09/16/2013 03:25 PM, Ryan Lane wrote:

...

https://www.mediawiki.org/wiki/Manual:Short_URL#URL_like_-_example.com.2FPag...

...
"*Warning:* this method may create an unstable URL structure and leave some page names unusable on your wiki. See Manual:Wiki in site root directoryhttps://www.mediawiki.org/wiki/Manual:Wiki_in_site_root_directory. Please see the article Cool URIs don't changehttp://www.w3.org/Provider/Style/URIand take a few minutes to devise a stable URL structure for your web site before hopping willy-nilly into rewrites into the URL root."

That is a very vague warning. So far I have lower-case 'favicon.ico', 'robots.txt' and 'w/' as potential conflicts. Do you see any others?

In general, I see removing /wiki/ as the less important part of the RFC. Using sub-resources rather than the random switch to /w/index.php is more important for caching (promotes deterministic URLs) and does not seem to involve similar trade-offs.

Gabriel

Ryan Lane

4:42 p.m.

On Mon, Sep 16, 2013 at 4:41 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...

On 09/16/2013 03:25 PM, Ryan Lane wrote:

...
https://www.mediawiki.org/wiki/Manual:Short_URL#URL_like_-_example.com.2FPag...

...
...
"*Warning:* this method may create an unstable URL structure and leave

some

...
page names unusable on your wiki. See Manual:Wiki in site root directory<

https://www.mediawiki.org/wiki/Manual:Wiki_in_site_root_directory%3E.

...
Please see the article Cool URIs don't changehttp://www.w3.org/Provider/Style/URIand take a few minutes to devise a stable URL structure for your web site before hopping willy-nilly into rewrites into the URL root."

That is a very vague warning. So far I have lower-case 'favicon.ico', 'robots.txt' and 'w/' as potential conflicts. Do you see any others?

Any of the entry points? Any new entry point? Anything we ever want to put into the root?

- Ryan

Gabriel Wicke

7:09 p.m.

On 09/16/2013 04:42 PM, Ryan Lane wrote:

...

On Mon, Sep 16, 2013 at 4:41 PM, Gabriel Wicke <gwicke@wikimedia.org mailto:gwicke@wikimedia.org> wrote:
That is a very vague warning. So far I have lower-case 'favicon.ico',
'robots.txt' and 'w/' as potential conflicts. Do you see any others?
Any of the entry points? Any new entry point? Anything we ever want to put into the root?

We should be able to avoid most conflicts by picking prefixed entry points. However, as we can't drop the clashing /w/api.php any time soon I have removed the /wiki/ part from the RFC:

https://www.mediawiki.org/wiki/Requests_for_comment/Clean_up_URLs

So now only the conversion from

/w/index.php?title=foo?action=history to /foo?action=history

is under discussion.

Gabriel

Daniel Friesen

7:24 p.m.

On 2013-09-16 7:09 PM, Gabriel Wicke wrote:

...

Any of the entry points? Any new entry point? Anything we ever want to put into the root? We should be able to avoid most conflicts by picking prefixed entry points. However, as we can't drop the clashing /w/api.php any time soon I have removed the /wiki/ part from the RFC:

https://www.mediawiki.org/wiki/Requests_for_comment/Clean_up_URLs

So now only the conversion from

/w/index.php?title=foo?action=history to /foo?action=history

is under discussion.

Gabriel

Has the practice of disallowing /w/ or /index.php inside robots.txt to force search engines to completely ignore search, edit pages, exponential pagination, etc.. been considered?

Btw, side note on root urls. We still have an open bug allowing attacks on wikis using root paths: https://bugzilla.wikimedia.org/show_bug.cgi?id=38048

~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]

Gabriel Wicke

8:01 p.m.

On 09/16/2013 07:24 PM, Daniel Friesen wrote:

...

On 2013-09-16 7:09 PM, Gabriel Wicke wrote:

...
Any of the entry points? Any new entry point? Anything we ever want to put into the root? We should be able to avoid most conflicts by picking prefixed entry points. However, as we can't drop the clashing /w/api.php any time soon I have removed the /wiki/ part from the RFC:

https://www.mediawiki.org/wiki/Requests_for_comment/Clean_up_URLs

So now only the conversion from

/w/index.php?title=foo?action=history to /foo?action=history

is under discussion.

Gabriel

Has the practice of disallowing /w/ or /index.php inside robots.txt to force search engines to completely ignore search, edit pages, exponential pagination, etc.. been considered?

See https://www.mediawiki.org/wiki/Requests_for_comment/Clean_up_URLs#Migration

...

Btw, side note on root urls. We still have an open bug allowing attacks on wikis using root paths: https://bugzilla.wikimedia.org/show_bug.cgi?i

That looks like a fixable bug. In Parsoid for example all internal links are relative, which avoids the protocol-relative URL issue you reported there.

Gabriel

Jon Robson

8:59 p.m.

I would suggest taking a look at the number of 404s caused by people trying to access pages without the wiki prefix.... This would be interesting data to go alongside this interesting proposal... On 16 Sep 2013 20:01, "Gabriel Wicke" gwicke@wikimedia.org wrote:

...

On 09/16/2013 07:24 PM, Daniel Friesen wrote:

...
On 2013-09-16 7:09 PM, Gabriel Wicke wrote:

...
Any of the entry points? Any new entry point? Anything we ever want to put into the root? We should be able to avoid most conflicts by picking prefixed entry points. However, as we can't drop the clashing /w/api.php any time soon I have removed the /wiki/ part from the RFC:

https://www.mediawiki.org/wiki/Requests_for_comment/Clean_up_URLs

So now only the conversion from

/w/index.php?title=foo?action=history to /foo?action=history

is under discussion.

Gabriel

Has the practice of disallowing /w/ or /index.php inside robots.txt to force search engines to completely ignore search, edit pages, exponential pagination, etc.. been considered?

See https://www.mediawiki.org/wiki/Requests_for_comment/Clean_up_URLs#Migration

...
Btw, side note on root urls. We still have an open bug allowing attacks on wikis using root paths: https://bugzilla.wikimedia.org/show_bug.cgi?i

That looks like a fixable bug. In Parsoid for example all internal links are relative, which avoids the protocol-relative URL issue you reported there.

Gabriel

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Tim Starling

18 Sep 18 Sep

12:07 a.m.

On 17/09/13 13:59, Jon Robson wrote:

...

I would suggest taking a look at the number of 404s caused by people trying to access pages without the wiki prefix.... This would be interesting data to go alongside this interesting proposal...

There are lots of different sorts of 404s, so it's necessary to do some filtering. For example:

* double-slashes, due to bug 52253 * sitemap.xml * Apple touch icons * bullet.gif in various directories * vulnerability scanning, e.g. xmlrpc.php * BlueCoat verify/notify, as described in http://www.webmasterworld.com/search_engine_spiders/3859463.htm * Serial numbers like http://en.wikipedia.org/B008NAYASM .

I filtered out everything with a dot or slash in the prospective article title, as well as the BlueCoat URLs and the UAs responsible for serial number URLs. To simplify analysis, I took log lines from the English Wikipedia only.

Most of the remaining log entries were search engine crawlers, so I took those out too.

The result was 149 log entries at a 1/1000 sample rate, for the week of September 8-14, implying a request rate of about 639,000 per month. This is about 0.006% of the English Wikipedia's page view rate.

The 149 URLs are at http://paste.tstarling.com/p/uhtFqg.html

-- Tim Starling

C. Scott Ananian

7:46 a.m.

Note also that zhwiki (and others?) profitably uses the first part of the path to do variant selection.

https://zh.wikipedia.org/wiki/User:Cscott uses the wiki default variant (if logged in, uses the variant from the user's preferences) https://zh.wikipedia.org/zh-hans/User:Cscott https://zh.wikipedia.org/zh-hk/User:Cscott etc use the specified variant.

I have a dream to eventually enable https://en.wikipedia.org/en-gb/Football in a similar fashion. --scott

Jon Robson

19 Sep 19 Sep

10:04 a.m.

Thanks Tim for running those data. That seems to suggest the URL structure works for the most case.

On Wed, Sep 18, 2013 at 12:07 AM, Tim Starling tstarling@wikimedia.org wrote:

...

On 17/09/13 13:59, Jon Robson wrote:

...
I would suggest taking a look at the number of 404s caused by people trying to access pages without the wiki prefix.... This would be interesting data to go alongside this interesting proposal...

There are lots of different sorts of 404s, so it's necessary to do some filtering. For example:

double-slashes, due to bug 52253

sitemap.xml

Apple touch icons

bullet.gif in various directories

vulnerability scanning, e.g. xmlrpc.php

BlueCoat verify/notify, as described in

http://www.webmasterworld.com/search_engine_spiders/3859463.htm

Serial numbers like http://en.wikipedia.org/B008NAYASM .

I filtered out everything with a dot or slash in the prospective article title, as well as the BlueCoat URLs and the UAs responsible for serial number URLs. To simplify analysis, I took log lines from the English Wikipedia only.

Most of the remaining log entries were search engine crawlers, so I took those out too.

The result was 149 log entries at a 1/1000 sample rate, for the week of September 8-14, implying a request rate of about 639,000 per month. This is about 0.006% of the English Wikipedia's page view rate.

The 149 URLs are at http://paste.tstarling.com/p/uhtFqg.html

-- Tim Starling

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Jon Robson http://jonrobson.me.uk @rakugojon

Tim Starling

6:23 p.m.

On 20/09/13 03:04, Jon Robson wrote:

...

Thanks Tim for running those data. That seems to suggest the URL structure works for the most case.

I think the request rate for actual articles in the root is very, very low. And if you look at the paste I gave earlier:

http://paste.tstarling.com/p/uhtFqg.html

there's reason to think that the amount of traffic that comes from naive readers typing URLs and expecting an article is much smaller than even 149k per week. A naive user would be more likely to type a URL starting with a lower-case letter, and if you take those entries, and filter out the obvious client bugs and typos, that leaves only 39 log entries. If we filter out some more log entries that are unlikely search terms for Wikipedia articles ("enregistrement-audio-musique", "is", "unlimited_data_plan", etc.), that leaves maybe 30. http://paste.tstarling.com/p/KWuHif.html

Of these, only 12 actually correspond to Wikipedia articles or redirects:

abolition addicting_games apple_inc carnaval dreamshade facade girls insidious karthik online_coupons snam walkabout

So the number of naive readers actually helped by our 404 Refresh to /wiki/ is probably closer to 12k per week than 149k per week.

Personally, I think the refresh is annoying, since it makes it much more difficult to correct typos in manually-typed URLs. If you actually meant to type some non-article URL like a CSS resource, and make a typo which causes it to hit the refresh, the URL you typed is erased from your browser's address bar and history, making correction of the typo much more difficult. Maybe we should just include a link to the search page, rather than redirect or refresh.

-- Tim Starling

MZMcBride

6:56 p.m.

New subject: [RFC]: Clean URLs- dropping /wiki/ and /w/index.php?title=..

Tim Starling wrote:

...

Personally, I think the refresh is annoying, since it makes it much more difficult to correct typos in manually-typed URLs. If you actually meant to type some non-article URL like a CSS resource, and make a typo which causes it to hit the refresh, the URL you typed is erased from your browser's address bar and history, making correction of the typo much more difficult. Maybe we should just include a link to the search page, rather than redirect or refresh.

Mark Ryan redesigned the 404 page in 2009 and specifically removed the meta refresh tag (cf. https://bugs.wikimedia.org/17316#c0).

The redesigned page eventually got deployed, but the client-side refresh very sneakily moved from the HTML output to a Refresh header (cf. https://bugs.wikimedia.org/35052#c0).

Neither bug is resolved, if anyone is interested in helping out. :-)

MZMcBride

Jon Robson

9:06 p.m.

On 19 Sep 2013 18:23, "Tim Starling" tstarling@wikimedia.org wrote:

...

On 20/09/13 03:04, Jon Robson wrote:

...
Thanks Tim for running those data. That seems to suggest the URL structure works for the most case.

I think the request rate for actual articles in the root is very, very low.

I agree.. Sorry I guess my message wasn't so clear. I meant "existing" URL structure :)

And if you look at the paste I gave earlier:

...

http://paste.tstarling.com/p/uhtFqg.html

there's reason to think that the amount of traffic that comes from naive readers typing URLs and expecting an article is much smaller than even 149k per week. A naive user would be more likely to type a URL starting with a lower-case letter, and if you take those entries, and filter out the obvious client bugs and typos, that leaves only 39 log entries. If we filter out some more log entries that are unlikely search terms for Wikipedia articles ("enregistrement-audio-musique", "is", "unlimited_data_plan", etc.), that leaves maybe 30. http://paste.tstarling.com/p/KWuHif.html

Of these, only 12 actually correspond to Wikipedia articles or redirects:

abolition addicting_games apple_inc carnaval dreamshade facade girls insidious karthik online_coupons snam walkabout

So the number of naive readers actually helped by our 404 Refresh to /wiki/ is probably closer to 12k per week than 149k per week.

Personally, I think the refresh is annoying, since it makes it much more difficult to correct typos in manually-typed URLs. If you actually meant to type some non-article URL like a CSS resource, and make a typo which causes it to hit the refresh, the URL you typed is erased from your browser's address bar and history, making correction of the typo much more difficult. Maybe we should just include a link to the search page, rather than redirect or refresh.

-- Tim Starling

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Gabriel Wicke

8:01 p.m.

New subject: [RFC]: Clean URLs- dropping /w/index.php?title=..

On 09/19/2013 10:04 AM, Jon Robson wrote:

...

Thanks Tim for running those data. That seems to suggest the URL structure works for the most case.

It certainly confirms that search engines link to working links, and users typing URLs manually are rare and (eventually) learn to prefix /wiki/. I am not that convinced that the current number of 404s says that much about the user-friendliness or aesthetics of different URL schemes, but that is besides the point (and subjective).

I see /w/index.php?title=.. as the more important clean-up, which is why the RFC is only about that aspect.

Gabriel

Daniel Friesen

17 Sep 17 Sep

3:06 a.m.

On 2013-09-16 8:01 PM, Gabriel Wicke wrote:

...

On 09/16/2013 07:24 PM, Daniel Friesen wrote:

...
On 2013-09-16 7:09 PM, Gabriel Wicke wrote:

...
Any of the entry points? Any new entry point? Anything we ever want to put into the root? We should be able to avoid most conflicts by picking prefixed entry points. However, as we can't drop the clashing /w/api.php any time soon I have removed the /wiki/ part from the RFC:

https://www.mediawiki.org/wiki/Requests_for_comment/Clean_up_URLs

So now only the conversion from

/w/index.php?title=foo?action=history to /foo?action=history

is under discussion.

Gabriel

Has the practice of disallowing /w/ or /index.php inside robots.txt to force search engines to completely ignore search, edit pages, exponential pagination, etc.. been considered?

See https://www.mediawiki.org/wiki/Requests_for_comment/Clean_up_URLs#Migration

Ok. Though even assuming the * and Allow: non-standard features are supported by all bots we want to target I actually don't like the idea of blacklisting /wiki/*? in this way.

I don't think that every url with a query in it qualifies as something we want to blacklist from search engines. There are plenty but sometimes there is content that's served with a query which could otherwise be a good idea to index.

For example the non-first pages on long categories and Special:Allpages' pagination. The latter has robots=noindex – though I think we may want to reconsider that – but the former is not noindexed and with the introduction of rel="next", etc... would be pretty reasonable to index but is currently blacklisted by robots.txt. Additionally while we normally want to noindex edit pages. This isn't true of redlinks in every case. Take redlinked category links for example. These link to an action=edit&redlink=1 which for a search engine would then redirect back to the pretty url for the category. But because of robots.txt this link is masked because the intermediate redirect cannot be read by the search engine.

The idea I had to fix that naturally was to make MediaWiki aware of this and whether by a new routing system or simply filters for specific simple queries make it output /wiki/title?query urls for those cases where it's a query we would want indexed and leave robots blacklisted stuff under /w/ (though I did also consider a separate short url path like /w/page/$1 to make internal/robots blacklisted urls pretty). However adding Disallow: /wiki/*? to robots.txt will preclude the ability to do that.

~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]

Jeremy Baron

16 Sep 16 Sep

8:48 p.m.

On Tue, Sep 17, 2013 at 2:09 AM, Gabriel Wicke gwicke@wikimedia.org wrote:

...

So now only the conversion from

/w/index.php?title=foo?action=history to /foo?action=history

Do you mean:

to /wiki/foo?action=history

...

is under discussion.

See also https://gerrit.wikimedia.org/r/51595 and RT# 864 (aka https://bugzilla.wikimedia.org/21919 ) which all seem to prefer docroot verification rather than DNS.

-Jeremy

Gabriel Wicke

9:03 p.m.

On 09/16/2013 08:48 PM, Jeremy Baron wrote:

...

On Tue, Sep 17, 2013 at 2:09 AM, Gabriel Wicke gwicke@wikimedia.org wrote:

...
/w/index.php?title=foo?action=history to /foo?action=history

Do you mean:

to /wiki/foo?action=history

Yes, sorry. The RFC had it right, in case you read that ;)

Gabriel

Brad Jorsch (Anomie)

17 Sep 17 Sep

8:40 a.m.

On Mon, Sep 16, 2013 at 7:41 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...

Using sub-resources rather than the random switch to /w/index.php is more important for caching (promotes deterministic URLs) and does not seem to involve similar trade-offs.

Note that "promotes deterministic URLs" applies only to cases where only one parameter other than 'title' is provided to index.php (usually this parameter is 'action'). If the URL has more than one parameter other than 'title', you're still out of luck.

"But you can turn on $wgActionPaths to remove 'action' from the query string too!" you say? But then you're still stuck if the URL has two parameters other than 'action' and 'title'. Such as "offset" and "limit", for example.

-- Brad Jorsch (Anomie) Software Engineer Wikimedia Foundation

Gabriel Wicke

9:27 a.m.

On 09/17/2013 08:40 AM, Brad Jorsch (Anomie) wrote:

...

On Mon, Sep 16, 2013 at 7:41 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...
Using sub-resources rather than the random switch to /w/index.php is more important for caching (promotes deterministic URLs) and does not seem to involve similar trade-offs.

Note that "promotes deterministic URLs" applies only to cases where only one parameter other than 'title' is provided to index.php (usually this parameter is 'action'). If the URL has more than one parameter other than 'title', you're still out of luck.

An end point that wants to be cacheable should only use one query parameter, which might well be a path. Hypothetical examples:

http://wiki.org/wiki/Foo?r=latest/html http://wiki.org/wiki/Foo?r=123456/wikitext

An alternative solution would be to specify a list of required query parameters and a canonical ordering, and to reject (or redirect) requests not conforming to this spec. The problem I see with this approach is that many client libraries don't provide control over the order of query parameters, which would make such an interface hard to use.

Gabriel

Brad Jorsch (Anomie)

11:24 a.m.

On Tue, Sep 17, 2013 at 12:27 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...

An end point that wants to be cacheable should only use one query parameter, which might well be a path. Hypothetical examples:

http://wiki.org/wiki/Foo?r=latest/html http://wiki.org/wiki/Foo?r=123456/wikitext

So now you're cramming multiple parameters, ordered, into one parameter? Why not go all the way and do http://wiki.org/wiki/123456/wikitext/Foo then?

But IMO, that's ridiculous.

...

An alternative solution would be to specify a list of required query parameters and a canonical ordering, and to reject (or redirect) requests not conforming to this spec.

"reject" is even more ridiculous. "redirect" is less ridiculous, but is strange and will increase latency and number-of-requests for clients that don't know the magic order.

What is the actual benefit we're trying to get here? All I've gotten so far along those lines is "improve cacheability", but it doesn't seem to have been established whether caching even needs improving in this area.

-- Brad Jorsch (Anomie) Software Engineer Wikimedia Foundation

Gabriel Wicke

11:42 a.m.

On 09/17/2013 11:24 AM, Brad Jorsch (Anomie) wrote:

...

On Tue, Sep 17, 2013 at 12:27 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...
An end point that wants to be cacheable should only use one query parameter, which might well be a path. Hypothetical examples:

http://wiki.org/wiki/Foo?r=latest/html http://wiki.org/wiki/Foo?r=123456/wikitext

So now you're cramming multiple parameters, ordered, into one parameter? Why not go all the way and do http://wiki.org/wiki/123456/wikitext/Foo then?

I consider the article to be the main resource we are interested in, with a revision and then a specific part (format) of that revision as a sub-resource. As our titles can contain slashes we need to delimit the main resource from the sub-resource part. A single query parameter that specifies the sub-resource path achieves that.

...

What is the actual benefit we're trying to get here? All I've gotten so far along those lines is "improve cacheability", but it doesn't seem to have been established whether caching even needs improving in this area.

A heavily-used content API will perform better and use less resources when it is cacheable. This will become more important over time, so I believe it is worth spending a small amount of effort on now.

Gabriel

MZMcBride

6:20 p.m.

New subject: [RFC]: Clean URLs- dropping /wiki/ and /w/index.php?title=..

Gabriel Wicke wrote:

...

A heavily-used content API will perform better and use less resources when it is cacheable. This will become more important over time, so I believe it is worth spending a small amount of effort on now.

Sure, I think everyone agrees that a heavily used Web resource will perform better with caching. I'm just not sure futzing around with path names is the best way to try to ensure sustainable cacheability.

Is there a breakdown of what in a typical MediaWiki API request takes the most time or uses the most resources (i.e., profiling a local request)? I imagine there are multiple caching opportunities at other layers that don't rely on path name, but it's difficult to say where you might see the most gains without further data.

MZMcBride

Chad

16 Sep 16 Sep

3:26 p.m.

On Mon, Sep 16, 2013 at 3:12 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...

Hi,

while tinkering with a RESTful content API I was reminded of an old pet peeve of mine: The URLs we use in Wikimedia projects are relatively long and ugly. I believe that we now have the ability to clean this up if we want to.

It would be nice to

drop the /wiki/ prefix https://en.wikipedia.org/Foo instead of https://en.wikipedia.org/wiki/Foo

use simple action urls https://en.wikipedia.org/Foo?action=history instead of https://en.wikipedia.org/w/index.php?title=Foo&action=history

The details of this proposal are discussed in the following RFC:

https://www.mediawiki.org/wiki/Requests_for_comment/Clean_up_URLs

I'm looking forward to your input!

Even better would be getting rid of action urls entirely.

-Chad

MZMcBride

3:36 p.m.

New subject: [RFC]: Clean URLs- dropping /wiki/ and /w/index.php?title=..

Chad wrote:

...

On Mon, Sep 16, 2013 at 3:12 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...
It would be nice to

drop the /wiki/ prefix https://en.wikipedia.org/Foo instead of https://en.wikipedia.org/wiki/Foo

use simple action urls https://en.wikipedia.org/Foo?action=history instead of https://en.wikipedia.org/w/index.php?title=Foo&action=history

The details of this proposal are discussed in the following RFC:

https://www.mediawiki.org/wiki/Requests_for_comment/Clean_up_URLs

Even better would be getting rid of action urls entirely.

In favor of what? Special page URLs?

A variant on https://en.wikipedia.org/Foo?action=history is https://en.wikipedia.org/history/Foo (using $wgActionPaths).

The RFC currently seems to gloss over what problem is attempting to be solved here and what benefits a new URL structure might bring. I'd like to see a clearer statement of a problem and benefits to a switch, taking into account, for example, the overarching goal of making URLs fully localized.

MZMcBride

Jay Ashworth

4:09 p.m.

New subject: [RFC]: Clean URLs- dropping /wiki/ and /w/index.php?title=..

----- Original Message -----

...

From: "MZMcBride" z@mzmcbride.com

...

The RFC currently seems to gloss over what problem is attempting to be solved here and what benefits a new URL structure might bring. I'd like to see a clearer statement of a problem and benefits to a switch, taking into account, for example, the overarching goal of making URLs fully localized.

Concur, especially in light of the face that *this does not permit you to break the old URLs*. They are everywhere, *and they must continue to work forever*.

I hope I don't even have to justify why.

Cheers, -- jra

-- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com 2000 Land Rover DII St Petersburg FL USA #natog +1 727 647 1274

Steven Walling

5:20 p.m.

On Mon, Sep 16, 2013 at 3:36 PM, MZMcBride z@mzmcbride.com wrote:

...

The RFC currently seems to gloss over what problem is attempting to be solved here and what benefits a new URL structure might bring. I'd like to see a clearer statement of a problem and benefits to a switch, taking into account, for example, the overarching goal of making URLs fully localized.

How about the following?

Our current URL structure is extremely obtuse for non-technical users, and generally defies their expectations. To most people, en.wikipedia.org/Dogor even wikipedia.org/Dog should work just fine, not produce a 404.

Tyler Romeo

5:23 p.m.

On Mon, Sep 16, 2013 at 8:20 PM, Steven Walling steven.walling@gmail.comwrote:

...

Our current URL structure is extremely obtuse for non-technical users, and generally defies their expectations. To most people, en.wikipedia.org/Dogor even wikipedia.org/Dog should work just fine, not produce a 404.

To be fair, both of those links redirect to the proper URL anyway. It wouldn't be hard to just change that from 404 to a redirect. Nonetheless the canonical URI should still be /wiki/Article_title.

*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com

Jay Ashworth

5:29 p.m.

New subject: [RFC]: Clean URLs- dropping /wiki/ and /w/index.php?title=..

----- Original Message -----

...

From: "Steven Walling" steven.walling@gmail.com

...

How about the following?

Our current URL structure is extremely obtuse for non-technical users, and generally defies their expectations. To most people, en.wikipedia.org/Dogor even wikipedia.org/Dog should work just fine, not produce a 404.

Any collection of "most people" large enough to justify a change like this is, I assert, too technically unsophisticated to be attempting to construct URLs by hand (rather than by copy/pasta).

Do you propose to "fix" also the capitalization and spacing and URLescaping rules, which are much more complicated than that?

My considered reaction, now after several hours, is that this is fixing a problem which is not really broken for *anyone* except those who are OCD about hiding the "tech-y" look in the Location box. No offense. :-)

Cheers, -- jra

-- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com 2000 Land Rover DII St Petersburg FL USA #natog +1 727 647 1274

Brian Wolff

4:34 p.m.

On 2013-09-16 7:12 PM, "Gabriel Wicke" gwicke@wikimedia.org wrote:

...

Hi,

while tinkering with a RESTful content API I was reminded of an old pet peeve of mine: The URLs we use in Wikimedia projects are relatively long and ugly. I believe that we now have the ability to clean this up if we want to.

It would be nice to

drop the /wiki/ prefix https://en.wikipedia.org/Foo instead of https://en.wikipedia.org/wiki/Foo

use simple action urls https://en.wikipedia.org/Foo?action=history instead of https://en.wikipedia.org/w/index.php?title=Foo&action=history

The details of this proposal are discussed in the following RFC:

I'm looking forward to your input!

Gabriel

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Well I'm not particularly fond of this idea (probably because im stuck in my ways more than anything else), I do think that making the en.wikipedia.org/foo be an instant http redirect instead of "did you mean/redirecting in 5 seconds" message we currently have might make sense.

Additionally there is some security issues in ie6 when doing foo?action=raw if I recall.

-bawolff

Gabriel Wicke

6:08 p.m.

On 09/16/2013 04:34 PM, Brian Wolff wrote:

...

Additionally there is some security issues in ie6 when doing foo?action=raw if I recall.

Yes, IIRC some version of IE disregarded the Content-type header and guessed the content type based on the URL and the content. If the URL contained .php (only outside the query string?), it disabled this behavior.

Tim mentions in https://www.mediawiki.org/wiki/Special:Code/MediaWiki/49833#c3561 that this only applied to IE3 and earlier, and IE4 respects the Content-type header. As the market share of IE <= 3 is probably non-existent we could probably blacklist it from logging in and content API access altogether.

According to [1] and [2] there is also a 'X-Content-Type-Options: nosniff' header that disables this behavior for IE and Chrome. I doubt that it works in IE3 though. Anybody up for some testing with an ancient IE3 install?

Gabriel

[1]: http://msdn.microsoft.com/en-us/library/dd565661(v=vs.85).aspx [2]: https://www.owasp.org/index.php/List_of_useful_HTTP_headers

Tim Starling

7:48 p.m.

On 17/09/13 11:08, Gabriel Wicke wrote:

...

On 09/16/2013 04:34 PM, Brian Wolff wrote:

...
Additionally there is some security issues in ie6 when doing foo?action=raw if I recall.

Yes, IIRC some version of IE disregarded the Content-type header and guessed the content type based on the URL and the content. If the URL contained .php (only outside the query string?), it disabled this behavior.

Tim mentions in https://www.mediawiki.org/wiki/Special:Code/MediaWiki/49833#c3561 that this only applied to IE3 and earlier, and IE4 respects the Content-type header. As the market share of IE <= 3 is probably non-existent we could probably blacklist it from logging in and content API access altogether.

This issue affects IE at least up to IE 6, possibly later, see bug 28235.

-- Tim Starling

Gabriel Wicke

9:01 p.m.

On 09/16/2013 07:48 PM, Tim Starling wrote:

...

On 17/09/13 11:08, Gabriel Wicke wrote:

...
Tim mentions in https://www.mediawiki.org/wiki/Special:Code/MediaWiki/49833#c3561 that this only applied to IE3 and earlier, and IE4 respects the Content-type header. As the market share of IE <= 3 is probably non-existent we could probably blacklist it from logging in and content API access altogether.

This issue affects IE at least up to IE 6, possibly later, see bug 28235.

Thanks for the pointer! It is sad that IE6 (and likely IE7) is still haunting us. IE8+ is covered by the X-Content-Type-Options header.

It sounds like your Content-Disposition solution [1] should still work for IE6/7 where that header is not used otherwise. The existing users of that header all seem to be file-related. Did I miss any use in action handlers?

Gabriel

[1]: https://bugzilla.wikimedia.org/show_bug.cgi?id=28235#c6

Tim Starling

11:42 p.m.

On 17/09/13 14:01, Gabriel Wicke wrote:

...

On 09/16/2013 07:48 PM, Tim Starling wrote:

...
On 17/09/13 11:08, Gabriel Wicke wrote:

...
Tim mentions in https://www.mediawiki.org/wiki/Special:Code/MediaWiki/49833#c3561 that this only applied to IE3 and earlier, and IE4 respects the Content-type header. As the market share of IE <= 3 is probably non-existent we could probably blacklist it from logging in and content API access altogether.

This issue affects IE at least up to IE 6, possibly later, see bug 28235.

Thanks for the pointer! It is sad that IE6 (and likely IE7) is still haunting us. IE8+ is covered by the X-Content-Type-Options header.

It sounds like your Content-Disposition solution [1] should still work for IE6/7 where that header is not used otherwise. The existing users of that header all seem to be file-related. Did I miss any use in action handlers?

I'm assuming you can grep for Content-Disposition as well as I can. IIRC, the difficulty with Content-Disposition, in the context of a security patch, was the need to abstract handling of the header out of the various places that send it, so that it would be consistent and demonstrably secure. That would have made the security patch larger and more complex than it needed to be, which would have been a problem for backporters. That shouldn't be a concern for your feature.

-- Tim Starling

Tim Starling

7:17 p.m.

On 17/09/13 09:34, Brian Wolff wrote:

...

Well I'm not particularly fond of this idea (probably because im stuck in my ways more than anything else), I do think that making the en.wikipedia.org/foo be an instant http redirect instead of "did you mean/redirecting in 5 seconds" message we currently have might make sense.

The technical situation has not changed since that meta refresh was introduced, and the same rationale still applies. See e.g.

February 2005: http://article.gmane.org/gmane.science.linguistics.wikipedia.technical/15711

August 2006: http://article.gmane.org/gmane.science.linguistics.wikipedia.technical/25605

-- Tim Starling

4111

Age (days ago)

4122

Last active (days ago)

wikitech-l@lists.wikimedia.org

50 comments

20 participants

tags (0)

participants (20)

Brad Jorsch (Anomie)
Brian Wolff
C. Scott Ananian
Chad
Daniel Friesen
Daniel Kinzler
Gabriel Wicke
Gryllida
Jay Ashworth
Jeremy Baron
Jon Robson
K. Peachey
Matthew Flaschen
MZMcBride
Nikola Smolenski
Petr Onderka
Ryan Lane
Steven Walling
Tim Starling
Tyler Romeo