Hello,
I have a question about how page titles are escaped in the pagecounts dumps as found at http://dumps.wikimedia.org/other/pagecounts-all-sites/ and http://dumps.wikimedia.org/other/pagecounts-raw/.
I'm wondering for a particular page title, what is the set of escaped page titles in the dumps that I should look for? I searched for all possible combinations of unescaped and escaped characters in the page title and found the ones with non-zero counts.
The examples below are from the 2015-01-01T02 dump.
On the "ru" domain, the page "Путин, Владимир Владимирович" for has an encoded page title: "%D0%9F%D1%83%D1%82%D0%B8%D0%BD,_%D0%92%D0%BB%D0%B0%D0%B4%D0%B8%D0%BC%D0%B8%D1%80_%D0%92%D0%BB%D0%B0%D0%B4%D0%B8%D0%BC%D0%B8%D1%80%D0%BE%D0%B2%D0%B8%D1%87" (underscores replace spaces and every character but the comma is escaped).
On the "ru" domain, the page "Мстители (фильм, 2012)" has escaped page titles: "%D0%9C%D1%81%D1%82%D0%B8%D1%82%D0%B5%D0%BB%D0%B8_%28%D1%84%D0%B8%D0%BB%D1%8C%D0%BC,_2012%29" (everything except comma escaped) "%D0%9C%D1%81%D1%82%D0%B8%D1%82%D0%B5%D0%BB%D0%B8_(%D1%84%D0%B8%D0%BB%D1%8C%D0%BC,_2012)" (everything except comma and parens escaped) "Мстители_(фильм,_2012) (nothing escaped)".
On the "en" domain, the page "Spider-Man (2002 film)" has escaped page titles: "Spider-Man_%282002_film%29" (everything except parens escaped) "Spider-Man_(2002_film)" (nothing escaped)
Is the logic for the escaping available somewhere?
Thanks, Bo