Hi,
I've been trying to match edit activity with pagecounts but I've encountered a couple of problems. The amazing pagecounts dumps ( https://dumps.wikimedia.org/other/pagecounts-raw/) use the page url to identify the individual page:
fr.b Special:Recherche/Achille_Baraguey_d%5C%27Hilliers 1 624
while the stub-meta-history uses the "raw" title:
<page> <title>Wikipedia:Community Portal</title> <ns>4</ns> <id>1270</id>
so I need an easy way to map title to url. I imagine there some rules on how this "translation" is done? My google-fu has failed to encounter them.
Also, are is timezones mentioned in the meta-history files:
<timestamp>2006-02-18T19:29:10Z</timestamp>
the same as the one used in the pagecount filenames:
pagecounts-20140725-070000.gz
Best,
B
******************************************* Bruno Miguel Tavares Gonçalves, PhD Homepage: www.bgoncalves.com Email: bgoncalves@gmail.com *******************************************
Hi Bruno,
Actually I'm not going to answer your question and leave it for others who have developed tools to parse the pagecount files, but while we're on the topic just wanted to point out the "redirects" and title changes. This is something that a good number of people who work with the viewership data overlook. If the title of a page is changed, the history of the page will be moved under the new title and the old title will become a redirect page (normally). But the viewership data will be split. So if you want to, for example, now the viewership of a page with current title B and old title A, you have to add up the viewership to both pages within the period under study. Just something to note... and sorry if you're already doing this!
Good luck, Taha
On Thu, Jul 28, 2016 at 9:00 PM, Bruno Goncalves bgoncalves@gmail.com wrote:
Hi,
I've been trying to match edit activity with pagecounts but I've encountered a couple of problems. The amazing pagecounts dumps ( https://dumps.wikimedia.org/other/pagecounts-raw/) use the page url to identify the individual page:
fr.b Special:Recherche/Achille_Baraguey_d%5C%27Hilliers 1 624
while the stub-meta-history uses the "raw" title:
<page> <title>Wikipedia:Community Portal</title> <ns>4</ns> <id>1270</id>
so I need an easy way to map title to url. I imagine there some rules on how this "translation" is done? My google-fu has failed to encounter them.
Also, are is timezones mentioned in the meta-history files:
<timestamp>2006-02-18T19:29:10Z</timestamp>
the same as the one used in the pagecount filenames:
pagecounts-20140725-070000.gz
Best,
B
Bruno Miguel Tavares Gonçalves, PhD Homepage: www.bgoncalves.com Email: bgoncalves@gmail.com
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Definitely consider the redirect :) https://mako.cc/copyrighteous/consider-the-redirect
Bruno Goncalves, 28/07/2016 22:00:
I've been trying to match edit activity with pagecounts
The first question is how much data you need. If a few months are enough, https://wikitech.wikimedia.org/wiki/Pageviews_API may be easier.
Otherwise... https://www.mediawiki.org/wiki/Manual:PAGENAMEE_encoding
Nemo
Thank you for the heads up Taha and Federico. I'm not entirely sure redirect will be a big factor in what I'm playing with but I'll definitely keep an eye out for it. From what I gather, there is no simple way to check if there is a redirect page pointing to my page of interest?
I have about 400k pages I'm looking at across multiple editions so Pageviews_API might not be enough :) I'll just try my hand at implementing a PAGENAMEE encoding in Python unless there's some library out there that I'm missing.
Best,
B
******************************************* Bruno Miguel Tavares Gonçalves, PhD Homepage: www.bgoncalves.com Email: bgoncalves@gmail.com *******************************************
On Thu, Jul 28, 2016 at 4:48 PM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Definitely consider the redirect :) https://mako.cc/copyrighteous/consider-the-redirect
Bruno Goncalves, 28/07/2016 22:00:
I've been trying to match edit activity with pagecounts
The first question is how much data you need. If a few months are enough, https://wikitech.wikimedia.org/wiki/Pageviews_API may be easier.
Otherwise... https://www.mediawiki.org/wiki/Manual:PAGENAMEE_encoding
Nemo
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
To de-mystify the dumps: pagecounts-raw was a very buggy dataset that had data loss, didn't filter out hits from spiders, etc. The pageviews dataset takes care of those problems and more, and is the same as the data behind the pageview API.
On Fri, Jul 29, 2016 at 3:17 PM, Bruno Goncalves bgoncalves@gmail.com wrote:
Thank you for the heads up Taha and Federico. I'm not entirely sure redirect will be a big factor in what I'm playing with but I'll definitely keep an eye out for it. From what I gather, there is no simple way to check if there is a redirect page pointing to my page of interest?
I have about 400k pages I'm looking at across multiple editions so Pageviews_API might not be enough :) I'll just try my hand at implementing a PAGENAMEE encoding in Python unless there's some library out there that I'm missing.
Best,
B
Bruno Miguel Tavares Gonçalves, PhD Homepage: www.bgoncalves.com Email: bgoncalves@gmail.com
On Thu, Jul 28, 2016 at 4:48 PM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Definitely consider the redirect :) https://mako.cc/copyrighteous/consider-the-redirect
Bruno Goncalves, 28/07/2016 22:00:
I've been trying to match edit activity with pagecounts
The first question is how much data you need. If a few months are enough, https://wikitech.wikimedia.org/wiki/Pageviews_API may be easier.
Otherwise... https://www.mediawiki.org/wiki/Manual:PAGENAMEE_encoding
Nemo
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hi Bruno, look into the redirects table. It's dumped along stub-meta-history. You'll need to join it with page if you want a title -> title edge list.
I can share code if you need.
Cheers, Giovanni
-- Typed with my thumbs
On Jul 29, 2016 15:18, "Bruno Goncalves" bgoncalves@gmail.com wrote:
Thank you for the heads up Taha and Federico. I'm not entirely sure redirect will be a big factor in what I'm playing with but I'll definitely keep an eye out for it. From what I gather, there is no simple way to check if there is a redirect page pointing to my page of interest?
I have about 400k pages I'm looking at across multiple editions so Pageviews_API might not be enough :) I'll just try my hand at implementing a PAGENAMEE encoding in Python unless there's some library out there that I'm missing.
Best,
B
Bruno Miguel Tavares Gonçalves, PhD Homepage: www.bgoncalves.com Email: bgoncalves@gmail.com
On Thu, Jul 28, 2016 at 4:48 PM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Definitely consider the redirect :) https://mako.cc/copyrighteous/consider-the-redirect
Bruno Goncalves, 28/07/2016 22:00:
I've been trying to match edit activity with pagecounts
The first question is how much data you need. If a few months are enough, https://wikitech.wikimedia.org/wiki/Pageviews_API may be easier.
Otherwise... https://www.mediawiki.org/wiki/Manual:PAGENAMEE_encoding
Nemo
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hi Dan and Giovanni,
Thank you for the pointers. I think I figured most of it out. The only remaining question is about the timezone consistency:
Also, are is timezones mentioned in the meta-history files:
<timestamp>2006-02-18T19:29:10Z</timestamp>
the same as the one used in the pagecount filenames:
pagecounts-20140725-070000.gz
Best,
B
******************************************* Bruno Miguel Tavares Gonçalves, PhD Homepage: www.bgoncalves.com Email: bgoncalves@gmail.com *******************************************
On Sat, Jul 30, 2016 at 10:25 AM, Giovanni Luca Ciampaglia < glciampagl@gmail.com> wrote:
Hi Bruno, look into the redirects table. It's dumped along stub-meta-history. You'll need to join it with page if you want a title -> title edge list.
I can share code if you need.
Cheers, Giovanni
-- Typed with my thumbs
On Jul 29, 2016 15:18, "Bruno Goncalves" bgoncalves@gmail.com wrote:
Thank you for the heads up Taha and Federico. I'm not entirely sure redirect will be a big factor in what I'm playing with but I'll definitely keep an eye out for it. From what I gather, there is no simple way to check if there is a redirect page pointing to my page of interest?
I have about 400k pages I'm looking at across multiple editions so Pageviews_API might not be enough :) I'll just try my hand at implementing a PAGENAMEE encoding in Python unless there's some library out there that I'm missing.
Best,
B
Bruno Miguel Tavares Gonçalves, PhD Homepage: www.bgoncalves.com Email: bgoncalves@gmail.com
On Thu, Jul 28, 2016 at 4:48 PM, Federico Leva (Nemo) <nemowiki@gmail.com
wrote:
Definitely consider the redirect :) https://mako.cc/copyrighteous/consider-the-redirect
Bruno Goncalves, 28/07/2016 22:00:
I've been trying to match edit activity with pagecounts
The first question is how much data you need. If a few months are enough, https://wikitech.wikimedia.org/wiki/Pageviews_API may be easier.
Otherwise... https://www.mediawiki.org/wiki/Manual:PAGENAMEE_encoding
Nemo
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
I believe all timestamps in the database dumps are GMT.
Regarding the timestamps of pageview dumps, I have always assumed they would be GMT too, and what I suppose is the source code of the old dumper seems to confirm my intuition:
https://phabricator.wikimedia.org/diffusion/ANWC/browse/master/collector.c;8...
G
Giovanni Luca Ciampaglia http://glciampaglia.com *∙* Assistant Research Scientist, Indiana University
On Sat, Jul 30, 2016 at 1:40 PM, Bruno Goncalves bgoncalves@gmail.com wrote:
Hi Dan and Giovanni,
Thank you for the pointers. I think I figured most of it out. The only remaining question is about the timezone consistency:
Also, are is timezones mentioned in the meta-history files:
<timestamp>2006-02-18T19:29:10Z</timestamp>
the same as the one used in the pagecount filenames:
pagecounts-20140725-070000.gz
Best,
B
Bruno Miguel Tavares Gonçalves, PhD Homepage: www.bgoncalves.com Email: bgoncalves@gmail.com
On Sat, Jul 30, 2016 at 10:25 AM, Giovanni Luca Ciampaglia < glciampagl@gmail.com> wrote:
Hi Bruno, look into the redirects table. It's dumped along stub-meta-history. You'll need to join it with page if you want a title -> title edge list.
I can share code if you need.
Cheers, Giovanni
-- Typed with my thumbs
On Jul 29, 2016 15:18, "Bruno Goncalves" bgoncalves@gmail.com wrote:
Thank you for the heads up Taha and Federico. I'm not entirely sure redirect will be a big factor in what I'm playing with but I'll definitely keep an eye out for it. From what I gather, there is no simple way to check if there is a redirect page pointing to my page of interest?
I have about 400k pages I'm looking at across multiple editions so Pageviews_API might not be enough :) I'll just try my hand at implementing a PAGENAMEE encoding in Python unless there's some library out there that I'm missing.
Best,
B
Bruno Miguel Tavares Gonçalves, PhD Homepage: www.bgoncalves.com Email: bgoncalves@gmail.com
On Thu, Jul 28, 2016 at 4:48 PM, Federico Leva (Nemo) < nemowiki@gmail.com> wrote:
Definitely consider the redirect :) https://mako.cc/copyrighteous/consider-the-redirect
Bruno Goncalves, 28/07/2016 22:00:
I've been trying to match edit activity with pagecounts
The first question is how much data you need. If a few months are enough, https://wikitech.wikimedia.org/wiki/Pageviews_API may be easier.
Otherwise... https://www.mediawiki.org/wiki/Manual:PAGENAMEE_encoding
Nemo
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
wiki-research-l@lists.wikimedia.org