Adding Simon Back as he might not be in the list.
On Mon, May 13, 2019 at 5:58 PM Joseph Allemandou <jallemandou(a)wikimedia.org>
wrote:
Hi Simon,
Thanks for reaching out :)
I tried a similar analysis on our cluster with the same original files as
the ones in
dumps.wikimedia.org, using Spark to speed up computation.
I ended up with coherent results for both the examples you gave:
Sum - count Data
date Climate_change --> Global_warming Global_warming --> Climate_change *Total
Result*
2017-11 3904 950 *4854*
2017-12 3549 780 *4329*
2018-01 4508 1011 *5519*
2018-02 3548 998 *4546*
2018-03 3462 745 *4207*
2018-04 3726 755 *4481*
2018-05 3730 810 *4540*
2018-06 2971 862 *3833*
2018-07 3500 1602 *5102*
2018-08 4546 1644 *6190*
2018-09 3962 1472 *5434*
2018-10 6155 3048 *9203*
2018-11 5865 2617 *8482*
2018-12 5491 2227 *7718*
2019-01 5774 2911 *8685*
2019-02 6311 2845 *9156*
2019-03 6858 2514 *9372*
2019-04 6824 2199 *9023*
Sum - count Data
date Air_pollution --> Smog Smog --> Air_pollution *Total Result*
2017-11 82 263 *345*
2017-12 200 184 *384*
2018-01 65 140 *205*
2018-02 82 98 *180*
2018-03 418 149 *567*
2018-04 295 137 *432*
2018-05 215 95 *310*
2018-06 245 85 *330*
2018-07 233 70 *303*
2018-08 36 62 *98*
2018-09 45 81 *126*
2018-10 66 96 *162*
2018-11 128 135 *263*
2018-12 50 90 *140*
2019-01 68 92 *160*
2019-02 50 68 *118*
2019-03 49 72 *121*
2019-04 33 51 *84*
*Total Result* *2360* *1968* *4328*
Maybe there is an issue in the way you process the data?
Best
Joseph
On Mon, May 13, 2019 at 3:38 PM Simon Munzert <
simon.munzert(a)googlemail.com> wrote:
Hi all,
I've got a question on the completeness of the clickstream dataset. I
downloaded the dumps for 2018 from
https://dumps.wikimedia.org/other/clickstream/ (English Wikipedia only).
When I filter for the article pair "Climate change" and "Global
warming"
(either one being either prev or curr) for all of 2018, this is what I get:
prev curr type n month
<chr> <chr> <chr> <dbl> <chr>
1 Global_warming Climate_change link 755 2018-04
2 Global_warming Climate_change link 810 2018-05
3 Climate_change Global_warming link 3730 2018-05
4 Climate_change Global_warming link 3962 2018-09
5 Climate_change Global_warming link 5865 2018-11
6 Climate_change Global_warming link 5491 2018-12
7 Global_warming Climate_change link 2227 2018-12
The visit numbers seem plausible. But why is there no data on, e.g.,
January to March? And why is there data for both directions in May and
December, but not for the others? This seems implausible given the
popularity of the articles.
Here's another example:
prev curr type n month
<chr> <chr> <chr> <dbl> <chr>
1 Smog Air_pollution link 140 2018-01
2 Air_pollution Smog link 82 2018-02
3 Air_pollution Smog link 295 2018-04
4 Air_pollution Smog link 215 2018-05
5 Smog Air_pollution link 85 2018-06
6 Air_pollution Smog link 233 2018-07
7 Air_pollution Smog link 45 2018-09
8 Smog Air_pollution link 96 2018-10
9 Smog Air_pollution link 90 2018-12
Am I missing something here?
Thanks in advance,
Simon
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Joseph Allemandou (joal) (he / him)
Sr Data Engineer
Wikimedia Foundation
--
Joseph Allemandou (joal) (he / him)
Sr Data Engineer
Wikimedia Foundation