Hi there,
Hope my email finds you well. My name is Nima Dashtban and I'm a student of computer science in Ca'foscari University of Venice / Italy.
I am investigating these access logs of wikipedia pages: https://dumps.wikimedia.org/other/pagecounts-raw/
In particular I would like to build up an DB of the time series of accesses to (Italian) pages of wikipedia that have a GPS position, i.e. wikipedia page that refer to geographical point of interests. I think that such data could be useful as predictive signal of interest of potential visitors of such geographical places.
Any help of you whether you say it is possible or not would be huge for me.
Sincerely and Regards, Nima Dashtban
Hi Nima,
It should be possible, and it is interesting to merge geodata with pageviews. Newer pageview data may be easier to work with: https://dumps.wikimedia.org/other/analytics/
I wonder if the timing when GPS data became available in an article has any impact on pageviews. It may be easier to assume that is not the case so you don't have to look at article's history as well.
Wikidata will also be an easy way to query for GPS data. Check out this mapping of data with coordinates: https://ddll.inf.tu-dresden.de/web/Wikidata/Maps-06-2015/en
On Tue, Apr 5, 2016 at 4:07 AM, Nima Dashtban nima.dashtban@gmail.com wrote:
Hi there,
Hope my email finds you well. My name is Nima Dashtban and I'm a student of computer science in Ca'foscari University of Venice / Italy.
I am investigating these access logs of wikipedia pages: https://dumps.wikimedia.org/other/pagecounts-raw/
In particular I would like to build up an DB of the time series of accesses to (Italian) pages of wikipedia that have a GPS position, i.e. wikipedia page that refer to geographical point of interests. I think that such data could be useful as predictive signal of interest of potential visitors of such geographical places.
Any help of you whether you say it is possible or not would be huge for me.
Sincerely and Regards, Nima Dashtban
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Nuria/Kevin,
If I understand the request correctly, it seems to be asking for data of this form for pages in the Italian wikipedia that are about places of interest.
*Timestamp, Geo data of the request - Country, city etc obtained from geolocating the IP, Page Title.*
Nima, if I understand this right, we have this data available in the internal webrequest logs - however it is highly sensitive and I don't think can be published as a dataset publicly. Getting access to work with this type of data (involving geo-data) usually involves an NDA process etc - which I'm not an expert on and will let others who know better help with.
On Thu, Apr 7, 2016 at 9:16 AM, Kevin Leduc kevin@wikimedia.org wrote:
Hi Nima,
It should be possible, and it is interesting to merge geodata with pageviews. Newer pageview data may be easier to work with: https://dumps.wikimedia.org/other/analytics/
I wonder if the timing when GPS data became available in an article has any impact on pageviews. It may be easier to assume that is not the case so you don't have to look at article's history as well.
Wikidata will also be an easy way to query for GPS data. Check out this mapping of data with coordinates: https://ddll.inf.tu-dresden.de/web/Wikidata/Maps-06-2015/en
On Tue, Apr 5, 2016 at 4:07 AM, Nima Dashtban nima.dashtban@gmail.com wrote:
Hi there,
Hope my email finds you well. My name is Nima Dashtban and I'm a student of computer science in Ca'foscari University of Venice / Italy.
I am investigating these access logs of wikipedia pages: https://dumps.wikimedia.org/other/pagecounts-raw/
In particular I would like to build up an DB of the time series of accesses to (Italian) pages of wikipedia that have a GPS position, i.e. wikipedia page that refer to geographical point of interests. I think that such data could be useful as predictive signal of interest of potential visitors of such geographical places.
Any help of you whether you say it is possible or not would be huge for me.
Sincerely and Regards, Nima Dashtban
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I think Nima was referring to articles of monuments / places of interest that have GPS coordinates in them. For example, the Trevi Fountain is at these coordinates: 41.902773°N 12.485952°E
by joining pageviews and coordinate data, you could create heat maps that may correlate with actual tourist traffic.
[1] https://it.wikipedia.org/wiki/Trevi_(rione_di_Roma)
On Tue, Apr 5, 2016 at 4:07 AM, Nima Dashtban nima.dashtban@gmail.com wrote:
Hi there,
Hope my email finds you well. My name is Nima Dashtban and I'm a student of computer science in Ca'foscari University of Venice / Italy.
I am investigating these access logs of wikipedia pages: https://dumps.wikimedia.org/other/pagecounts-raw/
In particular I would like to build up an DB of the time series of accesses to (Italian) pages of wikipedia that have a GPS position, i.e. wikipedia page that refer to geographical point of interests. I think that such data could be useful as predictive signal of interest of potential visitors of such geographical places.
Any help of you whether you say it is possible or not would be huge for me.
Sincerely and Regards, Nima Dashtban
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I second Kevin in the understanding of the problem. I think one approach could be: - Parse current version of Italian Wikipedia dump (no need to go for revisions history, only current version should be enough) and extract pages info (id and title) which contain GPS info (Since I don't know how GPS coordinates are repsented in wiki pages, I can't really help on that side). - Once the list (page_title - GPS point) is built, depending on the size of the lsit, either request the pageview API or ask the analytics team for data extraction for the given pages over a time period. Cheers Joseph
On Tue, Apr 12, 2016 at 1:21 AM, Kevin Leduc kevin@wikimedia.org wrote:
I think Nima was referring to articles of monuments / places of interest that have GPS coordinates in them. For example, the Trevi Fountain is at these coordinates: 41.902773°N 12.485952°E
by joining pageviews and coordinate data, you could create heat maps that may correlate with actual tourist traffic.
[1] https://it.wikipedia.org/wiki/Trevi_(rione_di_Roma)
On Tue, Apr 5, 2016 at 4:07 AM, Nima Dashtban nima.dashtban@gmail.com wrote:
Hi there,
Hope my email finds you well. My name is Nima Dashtban and I'm a student of computer science in Ca'foscari University of Venice / Italy.
I am investigating these access logs of wikipedia pages: https://dumps.wikimedia.org/other/pagecounts-raw/
In particular I would like to build up an DB of the time series of accesses to (Italian) pages of wikipedia that have a GPS position, i.e. wikipedia page that refer to geographical point of interests. I think that such data could be useful as predictive signal of interest of potential visitors of such geographical places.
Any help of you whether you say it is possible or not would be huge for me.
Sincerely and Regards, Nima Dashtban
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I believe there's a parsed dump which already covers this -
https://dumps.wikimedia.org/itwiki/20160407/itwiki-20160407-geo_tags.sql.gz
It seems to have ~260k items with 'earth' coordinates, which is about one in five pages on itwp.
You can use this to skip the first step and go straight to matching up with the pageview API.
This will only catch pages where the coordinates are recorded on itwp; if you wanted to be clever, you could also pull the comparable dump for Wikidata, match it up with Italian articles, and find any where coordinates are known on Wikidata but not yet recorded on Wikipedia. (There's bound to be a few)
Andrew.
On 12 April 2016 at 11:25, Joseph Allemandou jallemandou@wikimedia.org wrote:
I second Kevin in the understanding of the problem. I think one approach could be:
- Parse current version of Italian Wikipedia dump (no need to go for
revisions history, only current version should be enough) and extract pages info (id and title) which contain GPS info (Since I don't know how GPS coordinates are repsented in wiki pages, I can't really help on that side).
- Once the list (page_title - GPS point) is built, depending on the size of
the lsit, either request the pageview API or ask the analytics team for data extraction for the given pages over a time period. Cheers Joseph
On Tue, Apr 12, 2016 at 1:21 AM, Kevin Leduc kevin@wikimedia.org wrote:
I think Nima was referring to articles of monuments / places of interest that have GPS coordinates in them. For example, the Trevi Fountain is at these coordinates: 41.902773°N 12.485952°E
by joining pageviews and coordinate data, you could create heat maps that may correlate with actual tourist traffic.
[1] https://it.wikipedia.org/wiki/Trevi_(rione_di_Roma)
On Tue, Apr 5, 2016 at 4:07 AM, Nima Dashtban nima.dashtban@gmail.com wrote:
Hi there,
Hope my email finds you well. My name is Nima Dashtban and I'm a student of computer science in Ca'foscari University of Venice / Italy.
I am investigating these access logs of wikipedia pages: https://dumps.wikimedia.org/other/pagecounts-raw/
In particular I would like to build up an DB of the time series of accesses to (Italian) pages of wikipedia that have a GPS position, i.e. wikipedia page that refer to geographical point of interests. I think that such data could be useful as predictive signal of interest of potential visitors of such geographical places.
Any help of you whether you say it is possible or not would be huge for me.
Sincerely and Regards, Nima Dashtban
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics