Date: Thu, 4 Feb 2016 08:22:01 +0100
From: "Federico Leva (Nemo)" <nemowiki(a)gmail.com>
To: A mailing list for the Analytics Team at WMF and everybody who has
an interest in Wikipedia and "analytics."
<analytics(a)lists.wikimedia.org>
Subject: Re: [Analytics] Pagecounts dumps page title UTF-8 escaping
Message-ID: <56B2FC19.6090105(a)gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Bo Han, 04/02/2016 00:40:
Is the logic for the escaping available
somewhere?
MediaWiki API does
https://phabricator.wikimedia.org/T29849
For the new pageviews API I got this reply on Unicode normalisation:
https://phabricator.wikimedia.org/T44259#1351880
(Phabricator is down right now; wait a couple hours or check
web.archive.org.)
Nemo
Thanks for the reply Nemo. I read over the two links but am still a
little confused about the case for "Мстители (фильм, 2012)" on domain
ru, which is escaped as:
"%D0%9C%D1%81%D1%82%D0%B8%D1%82%D0%B5%D0%BB%D0%B8_%28%D1%84%D0%B8%D0%BB%D1%8C%D0%BC,_2012%29"
(everything but comma escaped)
"%D0%9C%D1%81%D1%82%D0%B8%D1%82%D0%B5%D0%BB%D0%B8_(%D1%84%D0%B8%D0%BB%D1%8C%D0%BC,_2012)"
(everything but comma+parens escaped)
"Мстители_(фильм,_2012)" (nothing escaped)
Shouldn't the comma and parens be escaped as well, or is there a
special case for reserved characters? If so, why are parens sometimes
escaped and sometimes not? Maybe some of the variation has to do with
how browsers encode/send the request?
Bo