+Analytics list so they can comment.
I don't have such a script. It's a pretty intensive job to compile top articles especially over a month. The pageview API was supposed to have top articles per month per wiki but the job is so massive that it failed to run in Hive. Analytics knows there are better algorithms out there to solve this problem. So the pageview API just has top per day per wiki.
I imagine that you are looking at some very specific wikis and countries... not all of them. Maybe someone on the list can make an example hive script (given a wiki and country) that gives the top for a day.
On Wed, Jan 20, 2016 at 12:23 PM, Dan Foy dfoy@wikimedia.org wrote:
Hi Kevin,
In your collection of scripts for Hive, do you have one that can act as a starting point for me to get the top N articles / URLs for Wikipedia in a country?
Thanks, Dan
Below is an example Hive query yielding the 50 most viewed pages in India during December 2015. It took less than 10 minutes of wall clock time to complete.
SELECT CONCAT('https://%27,project,%27.org/wiki/%27,page_title), SUM(view_count) AS views FROM wmf.pageview_hourly WHERE year = 2015 AND month = 12 AND country = "India" AND agent_type = "user" GROUP BY project, page_title ORDER BY views DESC LIMIT 50;
... Total MapReduce CPU Time Spent: 0 days 19 hours 13 minutes 2 seconds 930 msec OK _c0 views https://en.wikipedia.org/wiki/Main_Page 43515253 https://en.wikipedia.org/wiki/Special:Search 4818687 https://en.wikipedia.org/wiki/- 2650346 https://en.wikipedia.org/wiki/Bajirao_I 1414810 https://en.wikipedia.org/wiki/Dilwale_(2015_film) 1410015 https://en.wikipedia.org/wiki/Mastani 1232964 https://en.wikipedia.org/wiki/Bajirao_Mastani_(film) 1133261 https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2015 632890 https://en.wikipedia.org/wiki/Hate_Story_3 582816 https://en.wikipedia.org/wiki/Special:MobileMenu 499379 https://en.wikipedia.org/wiki/Star_Wars:_The_Force_Awakens 438113 https://en.wikipedia.org/wiki/Tamasha_(film) 390519 https://en.wikipedia.org/wiki/Prem_Ratan_Dhan_Payo 378133 https://en.wikipedia.org/wiki/India 368946 https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2016 335547 https://en.wikipedia.org/wiki/Star_Wars 334326 https://en.wikipedia.org/wiki/Sunny_Leone 333848 https://en.wikipedia.org/wiki/Sundar_Pichai 329264 https://en.wikipedia.org/wiki/Special:Book 324255 https://en.wikipedia.org/wiki/List_of_highest-grossing_Bollywood_films 321418 https://en.wikipedia.org/wiki/Salman_Khan 309113 https://en.wikipedia.org/wiki/%27Tis_the_Season 308221 https://en.wikipedia.org/wiki/Mandana_Karimi 289662 https://en.wikipedia.org/wiki/Kyaa_Kool_Hain_Hum_3 281801 https://en.wikipedia.org/wiki/Kashibai 272673 https://en.wikipedia.org/wiki/Bigg_Boss_9 272203 https://en.wikipedia.org/wiki/Kriti_Sanon 266773 https://en.wikipedia.org/wiki/2012_Delhi_gang_rape 265296 https://en.wikipedia.org/wiki/Shah_Rukh_Khan 263729 https://en.wikipedia.org/wiki/Neerja_Bhanot 259410 https://en.wikipedia.org/wiki/Nora_Fatehi 252085 https://en.wikipedia.org/wiki/Ashoka 250255 https://en.wikipedia.org/wiki/B._K._S._Iyengar 248422 https://en.wikipedia.org/wiki/2015_South_Indian_floods 246377 https://en.wikipedia.org/wiki/Baahubali:_The_Beginning 244281 https://en.wikipedia.org/wiki/Shamsher_Bahadur_I_(Krishna_Rao) 232122 https://en.wikipedia.org/wiki/Christmas 228278 https://en.wikipedia.org/wiki/Thanga_Magan_(2015_film) 222373 https://en.wikipedia.org/wiki/Ranveer_Singh 221010 https://en.wikipedia.org/wiki/A._P._J._Abdul_Kalam 220612 https://en.wikipedia.org/wiki/Shivaji 218245 https://en.wikipedia.org/wiki/Deepika_Padukone 218242 https://en.wikipedia.org/wiki/TLC:_Tables,_Ladders_and_Chairs_(2015) 211920 https://en.wikipedia.org/wiki/Gizele_Thakral 206585 https://en.wikipedia.org/wiki/Urvashi_Rautela 204305 https://en.wikipedia.org/wiki/Peshwa 194957 https://en.wikipedia.org/wiki/Kajol 192044 https://hi.wikipedia.org/wiki/%E0%A4%AE%E0%A5%81%E0%A4%96%E0%A4%AA%E0%A5%83%... 184274 https://en.wikipedia.org/wiki/Quantico_(TV_series) 183112 https://en.wikipedia.org/wiki/Mahatma_Gandhi 182336 Time taken: 562.621 seconds, Fetched: 50 row(s)
See also the discussion at https://phabricator.wikimedia.org/T120113 (As mentioned there, a while ago I retrieved the global top 200 pages for a timespan of almost six months, with some wait time but no major issues. It's not quite clear to me why the "brute force" approach mentioned in the ticket failed, but I guess it had to do with the difficulty of repeating such a query for all projects - or countries - to generate top lists for every one of them.)
On Wed, Jan 20, 2016 at 12:42 PM, Kevin Leduc kevin@wikimedia.org wrote:
+Analytics list so they can comment.
I don't have such a script. It's a pretty intensive job to compile top articles especially over a month. The pageview API was supposed to have top articles per month per wiki but the job is so massive that it failed to run in Hive. Analytics knows there are better algorithms out there to solve this problem. So the pageview API just has top per day per wiki.
I imagine that you are looking at some very specific wikis and countries... not all of them. Maybe someone on the list can make an example hive script (given a wiki and country) that gives the top for a day.
On Wed, Jan 20, 2016 at 12:23 PM, Dan Foy dfoy@wikimedia.org wrote:
Hi Kevin,
In your collection of scripts for Hive, do you have one that can act as a starting point for me to get the top N articles / URLs for Wikipedia in a country?
Thanks, Dan
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Any idea why the most popular article in India is "-"? CCing Dan Garry of Discovery team.
On Fri, Jan 22, 2016 at 5:13 PM, Tilman Bayer tbayer@wikimedia.org wrote:
Below is an example Hive query yielding the 50 most viewed pages in India during December 2015. It took less than 10 minutes of wall clock time to complete.
SELECT CONCAT('https://%27,project,%27.org/wiki/%27,page_title), SUM(view_count) AS views FROM wmf.pageview_hourly WHERE year = 2015 AND month = 12 AND country = "India" AND agent_type = "user" GROUP BY project, page_title ORDER BY views DESC LIMIT 50;
... Total MapReduce CPU Time Spent: 0 days 19 hours 13 minutes 2 seconds 930 msec OK _c0 views https://en.wikipedia.org/wiki/Main_Page 43515253 https://en.wikipedia.org/wiki/Special:Search 4818687 https://en.wikipedia.org/wiki/- 2650346 https://en.wikipedia.org/wiki/Bajirao_I 1414810 https://en.wikipedia.org/wiki/Dilwale_(2015_film) 1410015 https://en.wikipedia.org/wiki/Mastani 1232964 https://en.wikipedia.org/wiki/Bajirao_Mastani_(film) 1133261 https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2015 632890 https://en.wikipedia.org/wiki/Hate_Story_3 582816 https://en.wikipedia.org/wiki/Special:MobileMenu 499379 https://en.wikipedia.org/wiki/Star_Wars:_The_Force_Awakens 438113 https://en.wikipedia.org/wiki/Tamasha_(film) 390519 https://en.wikipedia.org/wiki/Prem_Ratan_Dhan_Payo 378133 https://en.wikipedia.org/wiki/India 368946 https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2016 335547 https://en.wikipedia.org/wiki/Star_Wars 334326 https://en.wikipedia.org/wiki/Sunny_Leone 333848 https://en.wikipedia.org/wiki/Sundar_Pichai 329264 https://en.wikipedia.org/wiki/Special:Book 324255 https://en.wikipedia.org/wiki/List_of_highest-grossing_Bollywood_films 321418 https://en.wikipedia.org/wiki/Salman_Khan 309113 https://en.wikipedia.org/wiki/%27Tis_the_Season 308221 https://en.wikipedia.org/wiki/Mandana_Karimi 289662 https://en.wikipedia.org/wiki/Kyaa_Kool_Hain_Hum_3 281801 https://en.wikipedia.org/wiki/Kashibai 272673 https://en.wikipedia.org/wiki/Bigg_Boss_9 272203 https://en.wikipedia.org/wiki/Kriti_Sanon 266773 https://en.wikipedia.org/wiki/2012_Delhi_gang_rape 265296 https://en.wikipedia.org/wiki/Shah_Rukh_Khan 263729 https://en.wikipedia.org/wiki/Neerja_Bhanot 259410 https://en.wikipedia.org/wiki/Nora_Fatehi 252085 https://en.wikipedia.org/wiki/Ashoka 250255 https://en.wikipedia.org/wiki/B._K._S._Iyengar 248422 https://en.wikipedia.org/wiki/2015_South_Indian_floods 246377 https://en.wikipedia.org/wiki/Baahubali:_The_Beginning 244281 https://en.wikipedia.org/wiki/Shamsher_Bahadur_I_(Krishna_Rao) 232122 https://en.wikipedia.org/wiki/Christmas 228278 https://en.wikipedia.org/wiki/Thanga_Magan_(2015_film) 222373 https://en.wikipedia.org/wiki/Ranveer_Singh 221010 https://en.wikipedia.org/wiki/A._P._J._Abdul_Kalam 220612 https://en.wikipedia.org/wiki/Shivaji 218245 https://en.wikipedia.org/wiki/Deepika_Padukone 218242 https://en.wikipedia.org/wiki/TLC:_Tables,_Ladders_and_Chairs_(2015) 211920 https://en.wikipedia.org/wiki/Gizele_Thakral 206585 https://en.wikipedia.org/wiki/Urvashi_Rautela 204305 https://en.wikipedia.org/wiki/Peshwa 194957 https://en.wikipedia.org/wiki/Kajol 192044 https://hi.wikipedia.org/wiki/%E0%A4%AE%E0%A5%81%E0%A4%96%E0%A4%AA%E0%A5%83%... 184274 https://en.wikipedia.org/wiki/Quantico_(TV_series) 183112 https://en.wikipedia.org/wiki/Mahatma_Gandhi 182336 Time taken: 562.621 seconds, Fetched: 50 row(s)
See also the discussion at https://phabricator.wikimedia.org/T120113 (As mentioned there, a while ago I retrieved the global top 200 pages for a timespan of almost six months, with some wait time but no major issues. It's not quite clear to me why the "brute force" approach mentioned in the ticket failed, but I guess it had to do with the difficulty of repeating such a query for all projects - or countries - to generate top lists for every one of them.)
On Wed, Jan 20, 2016 at 12:42 PM, Kevin Leduc kevin@wikimedia.org wrote:
+Analytics list so they can comment.
I don't have such a script. It's a pretty intensive job to compile top articles especially over a month. The pageview API was supposed to have
top
articles per month per wiki but the job is so massive that it failed to
run
in Hive. Analytics knows there are better algorithms out there to solve this problem. So the pageview API just has top per day per wiki.
I imagine that you are looking at some very specific wikis and
countries...
not all of them. Maybe someone on the list can make an example hive
script
(given a wiki and country) that gives the top for a day.
On Wed, Jan 20, 2016 at 12:23 PM, Dan Foy dfoy@wikimedia.org wrote:
Hi Kevin,
In your collection of scripts for Hive, do you have one that can act as
a
starting point for me to get the top N articles / URLs for Wikipedia in
a
country?
Thanks, Dan
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
See also https://phabricator.wikimedia.org/T117945 and https://phabricator.wikimedia.org/T108867 for possibly related oddities in the top viewed pages. (And https://phabricator.wikimedia.org/T104755 : "Wikimedia's URL-routing logic straddles five layers ...")
(switching CC to the intended Dan)
On Fri, Jan 22, 2016 at 3:17 PM, Ryan Kaldari rkaldari@wikimedia.org wrote:
Any idea why the most popular article in India is "-"? CCing Dan Garry of Discovery team.
On Fri, Jan 22, 2016 at 5:13 PM, Tilman Bayer tbayer@wikimedia.org wrote:
Below is an example Hive query yielding the 50 most viewed pages in India during December 2015. It took less than 10 minutes of wall clock time to complete.
SELECT CONCAT('https://%27,project,%27.org/wiki/%27,page_title), SUM(view_count) AS views FROM wmf.pageview_hourly WHERE year = 2015 AND month = 12 AND country = "India" AND agent_type = "user" GROUP BY project, page_title ORDER BY views DESC LIMIT 50;
... Total MapReduce CPU Time Spent: 0 days 19 hours 13 minutes 2 seconds 930 msec OK _c0 views https://en.wikipedia.org/wiki/Main_Page 43515253 https://en.wikipedia.org/wiki/Special:Search 4818687 https://en.wikipedia.org/wiki/- 2650346 https://en.wikipedia.org/wiki/Bajirao_I 1414810 https://en.wikipedia.org/wiki/Dilwale_(2015_film) 1410015 https://en.wikipedia.org/wiki/Mastani 1232964 https://en.wikipedia.org/wiki/Bajirao_Mastani_(film) 1133261 https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2015 632890 https://en.wikipedia.org/wiki/Hate_Story_3 582816 https://en.wikipedia.org/wiki/Special:MobileMenu 499379 https://en.wikipedia.org/wiki/Star_Wars:_The_Force_Awakens 438113 https://en.wikipedia.org/wiki/Tamasha_(film) 390519 https://en.wikipedia.org/wiki/Prem_Ratan_Dhan_Payo 378133 https://en.wikipedia.org/wiki/India 368946 https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2016 335547 https://en.wikipedia.org/wiki/Star_Wars 334326 https://en.wikipedia.org/wiki/Sunny_Leone 333848 https://en.wikipedia.org/wiki/Sundar_Pichai 329264 https://en.wikipedia.org/wiki/Special:Book 324255 https://en.wikipedia.org/wiki/List_of_highest-grossing_Bollywood_films 321418 https://en.wikipedia.org/wiki/Salman_Khan 309113 https://en.wikipedia.org/wiki/%27Tis_the_Season 308221 https://en.wikipedia.org/wiki/Mandana_Karimi 289662 https://en.wikipedia.org/wiki/Kyaa_Kool_Hain_Hum_3 281801 https://en.wikipedia.org/wiki/Kashibai 272673 https://en.wikipedia.org/wiki/Bigg_Boss_9 272203 https://en.wikipedia.org/wiki/Kriti_Sanon 266773 https://en.wikipedia.org/wiki/2012_Delhi_gang_rape 265296 https://en.wikipedia.org/wiki/Shah_Rukh_Khan 263729 https://en.wikipedia.org/wiki/Neerja_Bhanot 259410 https://en.wikipedia.org/wiki/Nora_Fatehi 252085 https://en.wikipedia.org/wiki/Ashoka 250255 https://en.wikipedia.org/wiki/B._K._S._Iyengar 248422 https://en.wikipedia.org/wiki/2015_South_Indian_floods 246377 https://en.wikipedia.org/wiki/Baahubali:_The_Beginning 244281 https://en.wikipedia.org/wiki/Shamsher_Bahadur_I_(Krishna_Rao) 232122 https://en.wikipedia.org/wiki/Christmas 228278 https://en.wikipedia.org/wiki/Thanga_Magan_(2015_film) 222373 https://en.wikipedia.org/wiki/Ranveer_Singh 221010 https://en.wikipedia.org/wiki/A._P._J._Abdul_Kalam 220612 https://en.wikipedia.org/wiki/Shivaji 218245 https://en.wikipedia.org/wiki/Deepika_Padukone 218242 https://en.wikipedia.org/wiki/TLC:_Tables,_Ladders_and_Chairs_(2015) 211920 https://en.wikipedia.org/wiki/Gizele_Thakral 206585 https://en.wikipedia.org/wiki/Urvashi_Rautela 204305 https://en.wikipedia.org/wiki/Peshwa 194957 https://en.wikipedia.org/wiki/Kajol 192044 https://hi.wikipedia.org/wiki/%E0%A4%AE%E0%A5%81%E0%A4%96%E0%A4%AA%E0%A5%83%... 184274 https://en.wikipedia.org/wiki/Quantico_(TV_series) 183112 https://en.wikipedia.org/wiki/Mahatma_Gandhi 182336 Time taken: 562.621 seconds, Fetched: 50 row(s)
See also the discussion at https://phabricator.wikimedia.org/T120113 (As mentioned there, a while ago I retrieved the global top 200 pages for a timespan of almost six months, with some wait time but no major issues. It's not quite clear to me why the "brute force" approach mentioned in the ticket failed, but I guess it had to do with the difficulty of repeating such a query for all projects - or countries - to generate top lists for every one of them.)
On Wed, Jan 20, 2016 at 12:42 PM, Kevin Leduc kevin@wikimedia.org wrote:
+Analytics list so they can comment.
I don't have such a script. It's a pretty intensive job to compile top articles especially over a month. The pageview API was supposed to have top articles per month per wiki but the job is so massive that it failed to run in Hive. Analytics knows there are better algorithms out there to solve this problem. So the pageview API just has top per day per wiki.
I imagine that you are looking at some very specific wikis and countries... not all of them. Maybe someone on the list can make an example hive script (given a wiki and country) that gives the top for a day.
On Wed, Jan 20, 2016 at 12:23 PM, Dan Foy dfoy@wikimedia.org wrote:
Hi Kevin,
In your collection of scripts for Hive, do you have one that can act as a starting point for me to get the top N articles / URLs for Wikipedia in a country?
Thanks, Dan
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On 22 January 2016 at 15:17, Ryan Kaldari rkaldari@wikimedia.org wrote:
Any idea why the most popular article in India is "-"?
That specific article often sees a lot of traffic. This is normally caused by a bot, spider, or other automaton. Unfortunately, by definition no method of detecting automated traffic is perfect, so things like this often slip through.
Dan
Actually - is Hadoop's "nothing was provided in this field!" making it doubly confusing :/
On 22 January 2016 at 22:06, Dan Garry dgarry@wikimedia.org wrote:
On 22 January 2016 at 15:17, Ryan Kaldari rkaldari@wikimedia.org wrote:
Any idea why the most popular article in India is "-"?
That specific article often sees a lot of traffic. This is normally caused by a bot, spider, or other automaton. Unfortunately, by definition no method of detecting automated traffic is perfect, so things like this often slip through.
Dan
-- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Is that a bug in the ETL?
On Friday, January 22, 2016, Oliver Keyes okeyes@wikimedia.org wrote:
Actually - is Hadoop's "nothing was provided in this field!" making it doubly confusing :/
On 22 January 2016 at 22:06, Dan Garry <dgarry@wikimedia.org javascript:;> wrote:
On 22 January 2016 at 15:17, Ryan Kaldari <rkaldari@wikimedia.org
javascript:;> wrote:
Any idea why the most popular article in India is "-"?
That specific article often sees a lot of traffic. This is normally
caused
by a bot, spider, or other automaton. Unfortunately, by definition no
method
of detecting automated traffic is perfect, so things like this often slip through.
Dan
-- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org javascript:; https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org javascript:; https://lists.wikimedia.org/mailman/listinfo/analytics
Yes and no, it kind of depends whether we want to lose data. We've been talking about better ways to say "Unknown" but /wiki/Unknown is a page too :) We're just not focusing on this level of detail yet, bigger fish to fry, caveat emptor, etc.
On Saturday, January 23, 2016, Toby Negrin tnegrin@wikimedia.org wrote:
Is that a bug in the ETL?
On Friday, January 22, 2016, Oliver Keyes <okeyes@wikimedia.org javascript:_e(%7B%7D,'cvml','okeyes@wikimedia.org');> wrote:
Actually - is Hadoop's "nothing was provided in this field!" making it doubly confusing :/
On 22 January 2016 at 22:06, Dan Garry dgarry@wikimedia.org wrote:
On 22 January 2016 at 15:17, Ryan Kaldari rkaldari@wikimedia.org
wrote:
Any idea why the most popular article in India is "-"?
That specific article often sees a lot of traffic. This is normally
caused
by a bot, spider, or other automaton. Unfortunately, by definition no
method
of detecting automated traffic is perfect, so things like this often
slip
through.
Dan
-- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Thanks Dan -- I'm just concerned that we might be missing something (like the central notice banners back in the day) with a fairly large magnitude.
-Toby
On Sat, Jan 23, 2016 at 3:36 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
Yes and no, it kind of depends whether we want to lose data. We've been talking about better ways to say "Unknown" but /wiki/Unknown is a page too :) We're just not focusing on this level of detail yet, bigger fish to fry, caveat emptor, etc.
On Saturday, January 23, 2016, Toby Negrin tnegrin@wikimedia.org wrote:
Is that a bug in the ETL?
On Friday, January 22, 2016, Oliver Keyes okeyes@wikimedia.org wrote:
Actually - is Hadoop's "nothing was provided in this field!" making it doubly confusing :/
On 22 January 2016 at 22:06, Dan Garry dgarry@wikimedia.org wrote:
On 22 January 2016 at 15:17, Ryan Kaldari rkaldari@wikimedia.org
wrote:
Any idea why the most popular article in India is "-"?
That specific article often sees a lot of traffic. This is normally
caused
by a bot, spider, or other automaton. Unfortunately, by definition no
method
of detecting automated traffic is perfect, so things like this often
slip
through.
Dan
-- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
+1. Could we look at the pageIDs rather than titles? Is that being passed through uniformly yet?
On 23 January 2016 at 13:08, Toby Negrin tnegrin@wikimedia.org wrote:
Thanks Dan -- I'm just concerned that we might be missing something (like the central notice banners back in the day) with a fairly large magnitude.
-Toby
On Sat, Jan 23, 2016 at 3:36 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
Yes and no, it kind of depends whether we want to lose data. We've been talking about better ways to say "Unknown" but /wiki/Unknown is a page too :) We're just not focusing on this level of detail yet, bigger fish to fry, caveat emptor, etc.
On Saturday, January 23, 2016, Toby Negrin tnegrin@wikimedia.org wrote:
Is that a bug in the ETL?
On Friday, January 22, 2016, Oliver Keyes okeyes@wikimedia.org wrote:
Actually - is Hadoop's "nothing was provided in this field!" making it doubly confusing :/
On 22 January 2016 at 22:06, Dan Garry dgarry@wikimedia.org wrote:
On 22 January 2016 at 15:17, Ryan Kaldari rkaldari@wikimedia.org wrote:
Any idea why the most popular article in India is "-"?
That specific article often sees a lot of traffic. This is normally caused by a bot, spider, or other automaton. Unfortunately, by definition no method of detecting automated traffic is perfect, so things like this often slip through.
Dan
-- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Page Ids are still not coming in uniformly, but in this case should be there enough to figure out what - is, maybe. That's a good idea.
On Saturday, January 23, 2016, Oliver Keyes okeyes@wikimedia.org wrote:
+1. Could we look at the pageIDs rather than titles? Is that being passed through uniformly yet?
On 23 January 2016 at 13:08, Toby Negrin <tnegrin@wikimedia.org javascript:;> wrote:
Thanks Dan -- I'm just concerned that we might be missing something (like the central notice banners back in the day) with a fairly large
magnitude.
-Toby
On Sat, Jan 23, 2016 at 3:36 AM, Dan Andreescu <dandreescu@wikimedia.org
wrote:
Yes and no, it kind of depends whether we want to lose data. We've been talking about better ways to say "Unknown" but /wiki/Unknown is a page
too
:) We're just not focusing on this level of detail yet, bigger fish to
fry,
caveat emptor, etc.
On Saturday, January 23, 2016, Toby Negrin <tnegrin@wikimedia.org
javascript:;> wrote:
Is that a bug in the ETL?
On Friday, January 22, 2016, Oliver Keyes <okeyes@wikimedia.org
javascript:;> wrote:
Actually - is Hadoop's "nothing was provided in this field!" making it doubly confusing :/
On 22 January 2016 at 22:06, Dan Garry <dgarry@wikimedia.org
javascript:;> wrote:
On 22 January 2016 at 15:17, Ryan Kaldari <rkaldari@wikimedia.org
wrote: > > Any idea why the most popular article in India is "-"?
That specific article often sees a lot of traffic. This is normally caused by a bot, spider, or other automaton. Unfortunately, by definition
no
method of detecting automated traffic is perfect, so things like this often slip through.
Dan
-- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org javascript:; https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org javascript:; https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org javascript:; https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org javascript:; https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org javascript:; https://lists.wikimedia.org/mailman/listinfo/analytics