Pagecounts dumps page title UTF-8 escaping - Analytics

3 Feb 2016

Hello,

I have a question about how page titles are escaped in the pagecounts
dumps as found at
http://dumps.wikimedia.org/other/pagecounts-all-sites/ and
http://dumps.wikimedia.org/other/pagecounts-raw/.

I'm wondering for a particular page title, what is the set of escaped
page titles in the dumps that I should look for? I searched for all
possible combinations of unescaped and escaped characters in the page
title and found the ones with non-zero counts.

The examples below are from the 2015-01-01T02 dump.

On the "ru" domain, the page "Путин, Владимир Владимирович" for has
an
encoded page title:
"%D0%9F%D1%83%D1%82%D0%B8%D0%BD,_%D0%92%D0%BB%D0%B0%D0%B4%D0%B8%D0%BC%D0%B8%D1%80_%D0%92%D0%BB%D0%B0%D0%B4%D0%B8%D0%BC%D0%B8%D1%80%D0%BE%D0%B2%D0%B8%D1%87"
(underscores replace spaces and every character but the comma is
escaped).

On the "ru" domain, the page "Мстители (фильм, 2012)" has escaped page
titles:
"%D0%9C%D1%81%D1%82%D0%B8%D1%82%D0%B5%D0%BB%D0%B8_%28%D1%84%D0%B8%D0%BB%D1%8C%D0%BC,_2012%29"
(everything except comma escaped)
"%D0%9C%D1%81%D1%82%D0%B8%D1%82%D0%B5%D0%BB%D0%B8_(%D1%84%D0%B8%D0%BB%D1%8C%D0%BC,_2012)"
(everything except comma and parens escaped)
"Мстители_(фильм,_2012) (nothing escaped)".

On the "en" domain, the page "Spider-Man (2002 film)" has escaped page
titles:
"Spider-Man_%282002_film%29" (everything except parens escaped)
"Spider-Man_(2002_film)" (nothing escaped)

Is the logic for the escaping available somewhere?

Thanks,
Bo