Hi all,
In diving into a problem with logging[1], we discovered that we were unintentionally treating several special page accesses (in this case, containing included Javascript) as normal pageviews, thus throwing our pageview statistics way off. The proposed solution involves changing the way we access those Javascript requests from this form: http://en.wikipedia.org/wiki/Special:BannerController
...to this form: http://en.wikipedia.org/w/index.php?title=Special:BannerController
I'm assuming this convention isn't documented anywhere (other than earlier today on the wikitech wiki[2]). Before we run off and document this as something code reviewers need to look out for, I'd like to make sure this is really how we'd like to make the distinction.
Is this a sensible convention, or is there a different convention we should implement? Note that any changes to the convention would need to be implemented here: http://svn.wikimedia.org/viewvc/mediawiki/trunk/webstatscollector/filter.c?v...
...so futzing with the convention isn't free, but *may* be worth it if we have arrive at a vastly superior convention.
Rob [1] https://bugzilla.wikimedia.org/show_bug.cgi?id=25564 [2] http://wikitech.wikimedia.org/view/Squid_logging#Inflated_Stats
2010/10/19 Rob Lanphier robla@wikimedia.org:
Is this a sensible convention, or is there a different convention we should implement? Note that any changes to the convention would need to be implemented here: http://svn.wikimedia.org/viewvc/mediawiki/trunk/webstatscollector/filter.c?v...
Never before did we load JS through a special page like that, and with the resource loader coming up it will never be needed ever again, cause we can and will run everything through load.php . It's a one-time anomaly, so no need for any convention.
Roan Kattouw (Catrope)
On 10/19/10 1:29 PM, Roan Kattouw wrote:
2010/10/19 Rob Lanphierrobla@wikimedia.org:
Is this a sensible convention, or is there a different convention we should implement? Note that any changes to the convention would need to be implemented here: http://svn.wikimedia.org/viewvc/mediawiki/trunk/webstatscollector/filter.c?v...
Never before did we load JS through a special page like that, and with the resource loader coming up it will never be needed ever again, cause we can and will run everything through load.php . It's a one-time anomaly, so no need for any convention.
Isn't it fairly common to load data or other such page fragments in this way, though? Or does it only seem common to me because I commonly work with Commons?
If I've understood you correctly, your suggestion is that, to make logging easier, we should adopt a convention of how we call certain web resources.
I can't imagine that you will ever be able to get all the programmers to agree not to use URLs that way. It's not like we can mark the URL as being dangerous somehow. As long as the URL works, they'll want to use it... and really, why shouldn't they?
Is there some other way we could achieve those objectives? Are there other patterns that already exist that we could use to notice when it's not a full page request?
On 10/19/10 1:15 PM, Rob Lanphier wrote:
Hi all,
In diving into a problem with logging[1], we discovered that we were unintentionally treating several special page accesses (in this case, containing included Javascript) as normal pageviews, thus throwing our pageview statistics way off. The proposed solution involves changing the way we access those Javascript requests from this form: http://en.wikipedia.org/wiki/Special:BannerController
...to this form: http://en.wikipedia.org/w/index.php?title=Special:BannerController
I'm assuming this convention isn't documented anywhere (other than earlier today on the wikitech wiki[2]). Before we run off and document this as something code reviewers need to look out for, I'd like to make sure this is really how we'd like to make the distinction.
Is this a sensible convention, or is there a different convention we should implement? Note that any changes to the convention would need to be implemented here: http://svn.wikimedia.org/viewvc/mediawiki/trunk/webstatscollector/filter.c?v...
...so futzing with the convention isn't free, but *may* be worth it if we have arrive at a vastly superior convention.
Rob [1] https://bugzilla.wikimedia.org/show_bug.cgi?id=25564 [2] http://wikitech.wikimedia.org/view/Squid_logging#Inflated_Stats
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Tue, Oct 19, 2010 at 1:57 PM, Neil Kandalgaonkar neilk@wikimedia.org wrote:
If I've understood you correctly, your suggestion is that, to make logging easier, we should adopt a convention of how we call certain web resources.
I'm not so much suggesting it as I am stating the status quo, and asking whether we should document it well or change the code.
I can't imagine that you will ever be able to get all the programmers to agree not to use URLs that way. It's not like we can mark the URL as being dangerous somehow. As long as the URL works, they'll want to use it... and really, why shouldn't they?
And they did. See https://bugzilla.wikimedia.org/show_bug.cgi?id=25564
On 10/19/10 1:29 PM, Roan Kattouw wrote:
Never before did we load JS through a special page like that, and with the resource loader coming up it will never be needed ever again, cause we can and will run everything through load.php . It's a one-time anomaly, so no need for any convention.
I guess I'm not quite so confident this problem won't rear it's head again, but since it's a theoretical problem at this point, and we have enough actual problems to deal with, I'm happy to drop it for now.
Rob
Rob Lanphier wrote:
I guess I'm not quite so confident this problem won't rear it's head again, but since it's a theoretical problem at this point, and we have enough actual problems to deal with, I'm happy to drop it for now.
This thread reminded me of the old Webalizer hack of including "&dontcountme=s" in URLs to avoid things like JavaScript loads inflating the stats it collected. (Or at least I think that's the issue the URL parameter was trying to solve.)
The URL trick crept into all sorts of places, including site-wide JavaScript pages, article text, and even core code.[1]
I think setting a standard is a good idea in general, though I worry about making it prefix-based (like all URLs starting with "http://foo.wikiproject.org/wiki/"). I can see potential problems with counting hits to the secure server and I can see potential problems if the URL structure changes in the future (possibly to something more sensible like "http://foo.wikiproject.org/view/"). These problems might be non-existent or unavoidable, I'm not completely sure.
MZMcBride
[1] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/35103
Assuming that both
http://en.wikipedia.org/wiki/Special:BannerController http://en.wikipedia.org/w/index.php?title=Special:BannerController
will still return the same results, wouldn't it make more sense to teach the stat's logger to ignore both? Or is there a reason that we actually want to track one and not the other?
It seems like an awful lot of trouble to teach every software author that they need to follow a particular convention just so the stats engine will work as intended. It would seem like it would be much simpler to teach the stats engine to simply detect and ignore this special case. Or is there a reason that doing so is not possible?
-Robert Rohde
On Tue, Oct 19, 2010 at 1:15 PM, Rob Lanphier robla@wikimedia.org wrote:
Hi all,
In diving into a problem with logging[1], we discovered that we were unintentionally treating several special page accesses (in this case, containing included Javascript) as normal pageviews, thus throwing our pageview statistics way off. The proposed solution involves changing the way we access those Javascript requests from this form: http://en.wikipedia.org/wiki/Special:BannerController
...to this form: http://en.wikipedia.org/w/index.php?title=Special:BannerController
I'm assuming this convention isn't documented anywhere (other than earlier today on the wikitech wiki[2]). Before we run off and document this as something code reviewers need to look out for, I'd like to make sure this is really how we'd like to make the distinction.
Is this a sensible convention, or is there a different convention we should implement? Note that any changes to the convention would need to be implemented here: http://svn.wikimedia.org/viewvc/mediawiki/trunk/webstatscollector/filter.c?v...
...so futzing with the convention isn't free, but *may* be worth it if we have arrive at a vastly superior convention.
Rob [1] https://bugzilla.wikimedia.org/show_bug.cgi?id=25564 [2] http://wikitech.wikimedia.org/view/Squid_logging#Inflated_Stats
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
2010/10/19 Robert Rohde rarohde@gmail.com:
It seems like an awful lot of trouble to teach every software author that they need to follow a particular convention just so the stats engine will work as intended. It would seem like it would be much simpler to teach the stats engine to simply detect and ignore this special case. Or is there a reason that doing so is not possible?
As Domas pointed out to RobLa and myself, special page names are internationalized, so you'd have to teach the stats logger about every translation of it, which is impractical compared to using a different URL.
Roan Kattouw (Catrope)
Hi!
will still return the same results, wouldn't it make more sense to teach the stat's logger to ignore both? Or is there a reason that we actually want to track one and not the other?
Pretty URLs are for being pretty URLs (e.g. in your address bar). That leads to very easy assumption, that if there's a pretty URL, it probably indicates a pageview :-) We quite like other pretty URLs for Special pages e.g. Watchlist or Recentchanges - as we track their accesses.
It seems like an awful lot of trouble to teach every software author that they need to follow a particular convention just so the stats engine will work as intended. It would seem like it would be much simpler to teach the stats engine to simply detect and ignore this special case. Or is there a reason that doing so is not possible?
Heh, apparently stats became a big deal lately, so one with powers to change that can feel important! ;-)
Anyway, there're few choices to resolve it on the stats side:
1) Implement pulling of a namespace map for each project, build out an efficient rules engine (in C) for dealing with this (do note, every project will have different namespace for this URL). Also, make it extensible, so each developer tells about which names will be not-a-pageview ;-) There's nothing as fun as writing that kind of code, and do note, it won't be just five (or fifty) lines.
2) Add additional internal header (X-Pageview: true!), that would be logged by squids inside the stream :) That probably asks for large review inside MediaWiki, as well as squid code changes (and of course, rollout of new binary). Would be nice inter-group effort.
3) Not care about inflated per-project numbers, or have people adjust the numbers, as the source data is there (They can filter out banner loader themselves!)
You can pick any of these, make sure it gets into strategy plan, as we don't decide things on wikitech-l anymore :) I prefer, hehehe, not doing anything, and just having pretty URLs just for pageviews ;-)
Domas
On Wed, Oct 20, 2010 at 5:51 AM, Domas Mituzas midom.lists@gmail.com wrote:
It seems like an awful lot of trouble to teach every software author that they need to follow a particular convention just so the stats engine will work as intended. It would seem like it would be much simpler to teach the stats engine to simply detect and ignore this special case. Or is there a reason that doing so is not possible?
Heh, apparently stats became a big deal lately, so one with powers to change that can feel important! ;-)
Anyway, there're few choices to resolve it on the stats side:
- Implement pulling of a namespace map for each project, build out
an efficient rules engine (in C) for dealing with this (do note, every project will have different namespace for this URL). Also, make it extensible, so each developer tells about which names will be not-a-pageview ;-) There's nothing as fun as writing that kind of code, and do note, it won't be just five (or fifty) lines.
<snip>
- Not care about inflated per-project numbers, or have people adjust
the numbers, as the source data is there (They can filter out banner loader themselves!)
I think my comment about "stats engine" may have been confusing. I tend to think of the entire process chain as part of the stats engine, even though it is implemented as distinct collection and interpretation bits.
There is no reason that the filtering has to be done in the stats collector. It could be done there, but given the language variants that is likely to be hard to code and slow, as you rightly point out. I think I had more in mind that it be filtered at the interpretation side of the stats process. In other words, that Zachte (or whoever) generate a list of pages that are ignored for the purposes of counting stats. That would seem to be an easier place to deal with an exclusion list and to pull all language versions of those page names, and such. Having such an exclusion list for interpretation will be necessary anyway if we plan to reprocess the existing logs that don't follow the suggested convention. (I'm assuming we don't want to simply throw out three weeks of logs.)
-Robert Rohde
Rob Lanphier wrote:
Hi all,
In diving into a problem with logging[1], we discovered that we were unintentionally treating several special page accesses (in this case, containing included Javascript) as normal pageviews, thus throwing our pageview statistics way off. The proposed solution involves changing the way we access those Javascript requests from this form: http://en.wikipedia.org/wiki/Special:BannerController
...to this form: http://en.wikipedia.org/w/index.php?title=Special:BannerController
I'm assuming this convention isn't documented anywhere (other than earlier today on the wikitech wiki[2]). Before we run off and document this as something code reviewers need to look out for, I'd like to make sure this is really how we'd like to make the distinction.
I think the anomally is to have a Special page that is javascript.
A special page should look like a wiki page.
In your case, I would append ctype=text/javascript to the query string, so it a) Looks more like something that will give out javascript. b) Forces it to use the long style.
On Tue, Oct 19, 2010 at 11:41 PM, Platonides Platonides@gmail.com wrote:
Rob Lanphier wrote:
Hi all,
In diving into a problem with logging[1], we discovered that we were unintentionally treating several special page accesses (in this case, containing included Javascript) as normal pageviews, thus throwing our pageview statistics way off. The proposed solution involves changing the way we access those Javascript requests from this form: http://en.wikipedia.org/wiki/Special:BannerController
...to this form: http://en.wikipedia.org/w/index.php?title=Special:BannerController
I'm assuming this convention isn't documented anywhere (other than earlier today on the wikitech wiki[2]). Before we run off and document this as something code reviewers need to look out for, I'd like to make sure this is really how we'd like to make the distinction.
I think the anomally is to have a Special page that is javascript.
A special page should look like a wiki page.
In your case, I would append ctype=text/javascript to the query string, so it a) Looks more like something that will give out javascript. b) Forces it to use the long style.
Nope, appending parameters works also in the short form: http://en.wikipedia.org/wiki/Special:BannerController?ctype=text/javascript
Works also for ?action=edit etc.
Marco
Op 20 okt 2010, om 00:09 heeft Marco Schuster het volgende geschreven:
On Tue, Oct 19, 2010 at 11:41 PM, Platonides Platonides@gmail.com wrote:
Rob Lanphier wrote:
Hi all,
In diving into a problem with logging[1], we discovered that we were unintentionally treating several special page accesses (in this case, containing included Javascript) as normal pageviews, thus throwing our pageview statistics way off. The proposed solution involves changing the way we access those Javascript requests from this form: http://en.wikipedia.org/wiki/Special:BannerController
...to this form: http://en.wikipedia.org/w/index.php?title=Special:BannerController
I'm assuming this convention isn't documented anywhere (other than earlier today on the wikitech wiki[2]). Before we run off and document this as something code reviewers need to look out for, I'd like to make sure this is really how we'd like to make the distinction.
I think the anomally is to have a Special page that is javascript.
A special page should look like a wiki page.
In your case, I would append ctype=text/javascript to the query string, so it a) Looks more like something that will give out javascript. b) Forces it to use the long style.
Nope, appending parameters works also in the short form: http://en.wikipedia.org/wiki/Special:BannerController?ctype=text/javascript
Works also for ?action=edit etc.
Marco
-- VMSoft GbR Nabburger Str. 15 81737 München Geschäftsführer: Marco Schuster, Volker Hemmert http://vmsoft-gbr.de
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
But the short version without /w/index.php but with direct ?parameters doensn't for for action=raw (&ctype=text/javascript)
See the errror on: http://meta.wikimedia.org/wiki/User:Krinkle/global.js?action=raw
Nor does (atleast did) the software never point to a non-viewing page in the short form.
-- Krinkle
On Wed, Oct 20, 2010 at 12:49 AM, Krinkle krinklemail@gmail.com wrote:
But the short version without /w/index.php but with direct ?parameters doensn't for for action=raw (&ctype=text/javascript)
See the errror on: http://meta.wikimedia.org/wiki/User:Krinkle/global.js?action=raw
Strange. I'm sure this is to prevent users from using Wikipedia as spy-javascript-hoster, but why does http://meta.wikimedia.org/w/index.php?title=User:Krinkle/global.js&actio... work then?
Marco
On Tue, Oct 19, 2010 at 4:15 PM, Marco Schuster < marco@harddisk.is-a-geek.org> wrote:
On Wed, Oct 20, 2010 at 12:49 AM, Krinkle krinklemail@gmail.com wrote:
But the short version without /w/index.php but with direct ?parameters doensn't for for action=raw (&ctype=text/javascript)
See the errror on:
http://meta.wikimedia.org/wiki/User:Krinkle/global.js?action=raw
Strange. I'm sure this is to prevent users from using Wikipedia as spy-javascript-hoster, but why does
http://meta.wikimedia.org/w/index.php?title=User:Krinkle/global.js&actio... work then?
Internet Explorer, at least until recently (might finally be fixed?), would sometimes interpret "file extensions" on the end of a URL's path component as if they were meaningful file type information, especially when combined with actual content-type headers it considered "ambiguous".
A pretty URL such as " http://meta.wikimedia.org/wiki/Something.html?action=raw" would thus be dangerous, as the ".html" on the end of the wiki page -- a completely meaningless piece of an opaque URL path -- could trigger interpretation of the file's content as actual HTML, etc, thus become a vector for JavaScript injection into the wiki's same-origin security context.
To keep that nailed down, we forbade access to action=raw unless the URL's path portion matched the wiki's core entry point exactly. There may be nicer ways to do this now. :)
Back to the original issue -- I agree with Roan that the best way to go is to make sure most such things as the BannerLoader get converted to use the ResourceLoader interface, which eliminates the need to create and manage as many JS/CSS special-page points like this.
I think BannerLoader is part of CentralNotice, which is Scary Code and may or may not fit in nicely though. *shudder* If making short-term tweaks to it without redoing it, be very careful about caching!
-- brion
Marco Schuster wrote:
In your case, I would append ctype=text/javascript to the query string, so it a) Looks more like something that will give out javascript. b) Forces it to use the long style.
Nope, appending parameters works also in the short form: http://en.wikipedia.org/wiki/Special:BannerController?ctype=text/javascript
Works also for ?action=edit etc.
Marco
You could do that. But using the appropiate functions for creating the link, you will be given the "ugly url". That's what I referred to.
Rob Lanphier <robla <at> wikimedia.org> writes:
In diving into a problem with logging[1], we discovered that we were unintentionally treating several special page accesses (in this case, containing included Javascript) as normal pageviews, thus throwing our pageview statistics way off. The proposed solution involves changing the way we access those Javascript requests from this form: http://en.wikipedia.org/wiki/Special:BannerController
...to this form: http://en.wikipedia.org/w/index.php?title=Special:BannerController
The problem with that is that most of the time, URLs like that *should* be logged - they are simply the result of someone using a special page. For example, search page loads (about 3% of all page loads!) go completely under the radar this way, and while some wikipedias use hacks like [1] to avoid that, it really isn't an ideal situation. Also, page edits and other actions are not logged, nor page loads for old versions of pages, or for pages linked from recentchanges, or unstable versions where FlaggedRevs are enabled.
[1] http://de.wiktionary.org/w/index.php?title=MediaWiki:If-search.js&action...
wikitech-l@lists.wikimedia.org