Hi Bruno,
Actually I'm not going to answer your question and leave it for others who have developed tools to parse the pagecount files, but while we're on the topic just wanted to point out the "redirects" and title changes. This is something that a good number of people who work with the viewership data overlook. If the title of a page is changed, the history of the page will be moved under the new title and the old title will become a redirect page (normally). But the viewership data will be split. So if you want to, for example, now the viewership of a page with current title B and old title A, you have to add up the viewership to both pages within the period under study. Just something to note... and sorry if you're already doing this!
Good luck, Taha
On Thu, Jul 28, 2016 at 9:00 PM, Bruno Goncalves bgoncalves@gmail.com wrote:
Hi,
I've been trying to match edit activity with pagecounts but I've encountered a couple of problems. The amazing pagecounts dumps ( https://dumps.wikimedia.org/other/pagecounts-raw/) use the page url to identify the individual page:
fr.b Special:Recherche/Achille_Baraguey_d%5C%27Hilliers 1 624
while the stub-meta-history uses the "raw" title:
<page> <title>Wikipedia:Community Portal</title> <ns>4</ns> <id>1270</id>
so I need an easy way to map title to url. I imagine there some rules on how this "translation" is done? My google-fu has failed to encounter them.
Also, are is timezones mentioned in the meta-history files:
<timestamp>2006-02-18T19:29:10Z</timestamp>
the same as the one used in the pagecount filenames:
pagecounts-20140725-070000.gz
Best,
B
Bruno Miguel Tavares Gonçalves, PhD Homepage: www.bgoncalves.com Email: bgoncalves@gmail.com
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l