Hello,
I have a question about how page titles are escaped in the pagecounts dumps as found at http://dumps.wikimedia.org/other/pagecounts-all-sites/ and http://dumps.wikimedia.org/other/pagecounts-raw/.
I'm wondering for a particular page title, what is the set of escaped page titles in the dumps that I should look for? I searched for all possible combinations of unescaped and escaped characters in the page title and found the ones with non-zero counts.
The examples below are from the 2015-01-01T02 dump.
On the "ru" domain, the page "Путин, Владимир Владимирович" for has an encoded page title: "%D0%9F%D1%83%D1%82%D0%B8%D0%BD,_%D0%92%D0%BB%D0%B0%D0%B4%D0%B8%D0%BC%D0%B8%D1%80_%D0%92%D0%BB%D0%B0%D0%B4%D0%B8%D0%BC%D0%B8%D1%80%D0%BE%D0%B2%D0%B8%D1%87" (underscores replace spaces and every character but the comma is escaped).
On the "ru" domain, the page "Мстители (фильм, 2012)" has escaped page titles: "%D0%9C%D1%81%D1%82%D0%B8%D1%82%D0%B5%D0%BB%D0%B8_%28%D1%84%D0%B8%D0%BB%D1%8C%D0%BC,_2012%29" (everything except comma escaped) "%D0%9C%D1%81%D1%82%D0%B8%D1%82%D0%B5%D0%BB%D0%B8_(%D1%84%D0%B8%D0%BB%D1%8C%D0%BC,_2012)" (everything except comma and parens escaped) "Мстители_(фильм,_2012) (nothing escaped)".
On the "en" domain, the page "Spider-Man (2002 film)" has escaped page titles: "Spider-Man_%282002_film%29" (everything except parens escaped) "Spider-Man_(2002_film)" (nothing escaped)
Is the logic for the escaping available somewhere?
Thanks, Bo
Bo Han, 04/02/2016 00:40:
Is the logic for the escaping available somewhere?
MediaWiki API does https://phabricator.wikimedia.org/T29849 For the new pageviews API I got this reply on Unicode normalisation: https://phabricator.wikimedia.org/T44259#1351880
(Phabricator is down right now; wait a couple hours or check web.archive.org.)
Nemo
Hi all,
I have a similar question: why MediaWiki API and new pageviews API send different content-type responses headers? MediaWiki API sends ' content-type: text/html; charset=UTF-8 ' and new pageviews API sends only ' content-type: application/json ' without explicitly setting UTF-8. For example I see in my Crome browser "Goiânia_accident" correctly in MediaWiki responses and "Goiânia_accident" in pageviews API responses. Was it done intentionally or just a bug?
Thanks in advance! Alex
On Thu, Feb 4, 2016 at 8:22 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Bo Han, 04/02/2016 00:40:
Is the logic for the escaping available somewhere?
MediaWiki API does https://phabricator.wikimedia.org/T29849 For the new pageviews API I got this reply on Unicode normalisation: https://phabricator.wikimedia.org/T44259#1351880
(Phabricator is down right now; wait a couple hours or check web.archive.org.)
Nemo
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics