[Analytics] Pagecounts dumps page title UTF-8 escaping

4 Feb 2016


      Hello,
I have a question about how page titles are escaped in the pagecounts
dumps as found at
http://dumps.wikimedia.org/other/pagecounts-all-sites/ and
http://dumps.wikimedia.org/other/pagecounts-raw/.
I'm wondering for a particular page title, what is the set of escaped
page titles in the dumps that I should look for? I searched for all
possible combinations of unescaped and escaped characters in the page
title and found the ones with non-zero counts.
The examples below are from the 2015-01-01T02 dump.
On the "ru" domain, the page "Путин, Владимир Владимирович" for has an
encoded page title:
"%D0%9F%D1%83%D1%82%D0%B8%D0%BD,_%D0%92%D0%BB%D0%B0%D0%B4%D0%B8%D0%BC%D0%B8%D1%80_%D0%92%D0%BB%D0%B0%D0%B4%D0%B8%D0%BC%D0%B8%D1%80%D0%BE%D0%B2%D0%B8%D1%87"
(underscores replace spaces and every character but the comma is
escaped).
On the "ru" domain, the page "Мстители (фильм, 2012)" has escaped page titles:
"%D0%9C%D1%81%D1%82%D0%B8%D1%82%D0%B5%D0%BB%D0%B8_%28%D1%84%D0%B8%D0%BB%D1%8C%D0%BC,_2012%29"
(everything except comma escaped)
"%D0%9C%D1%81%D1%82%D0%B8%D1%82%D0%B5%D0%BB%D0%B8_(%D1%84%D0%B8%D0%BB%D1%8C%D0%BC,_2012)"
(everything except comma and parens escaped)
"Мстители_(фильм,_2012) (nothing escaped)".
On the "en" domain, the page "Spider-Man (2002 film)" has escaped page titles:
"Spider-Man_%282002_film%29" (everything except parens escaped)
"Spider-Man_(2002_film)" (nothing escaped)
Is the logic for the escaping available somewhere?
Thanks,
Bo

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

[Analytics] Pagecounts dumps page title UTF-8 escaping