Hi Simon,Thanks for reaching out :)I tried a similar analysis on our cluster with the same original files as the ones in dumps.wikimedia.org, using Spark to speed up computation.I ended up with coherent results for both the examples you gave:
Sum - count Data date Climate_change --> Global_warming Global_warming --> Climate_change Total Result 2017-11 3904 950 4854 2017-12 3549 780 4329 2018-01 4508 1011 5519 2018-02 3548 998 4546 2018-03 3462 745 4207 2018-04 3726 755 4481 2018-05 3730 810 4540 2018-06 2971 862 3833 2018-07 3500 1602 5102 2018-08 4546 1644 6190 2018-09 3962 1472 5434 2018-10 6155 3048 9203 2018-11 5865 2617 8482 2018-12 5491 2227 7718 2019-01 5774 2911 8685 2019-02 6311 2845 9156 2019-03 6858 2514 9372 2019-04 6824 2199 9023
Sum - count Data date Air_pollution --> Smog Smog --> Air_pollution Total Result 2017-11 82 263 345 2017-12 200 184 384 2018-01 65 140 205 2018-02 82 98 180 2018-03 418 149 567 2018-04 295 137 432 2018-05 215 95 310 2018-06 245 85 330 2018-07 233 70 303 2018-08 36 62 98 2018-09 45 81 126 2018-10 66 96 162 2018-11 128 135 263 2018-12 50 90 140 2019-01 68 92 160 2019-02 50 68 118 2019-03 49 72 121 2019-04 33 51 84 Total Result 2360 1968 4328 Maybe there is an issue in the way you process the data?BestJosephOn Mon, May 13, 2019 at 3:38 PM Simon Munzert <simon.munzert@googlemail.com> wrote:Hi all,
I've got a question on the completeness of the clickstream dataset. I downloaded the dumps for 2018 from https://dumps.wikimedia.org/other/clickstream/ (English Wikipedia only). When I filter for the article pair "Climate change" and "Global warming" (either one being either prev or curr) for all of 2018, this is what I get:
prev curr type n month
<chr> <chr> <chr> <dbl> <chr>
1 Global_warming Climate_change link 755 2018-04
2 Global_warming Climate_change link 810 2018-05
3 Climate_change Global_warming link 3730 2018-05
4 Climate_change Global_warming link 3962 2018-09
5 Climate_change Global_warming link 5865 2018-11
6 Climate_change Global_warming link 5491 2018-12
7 Global_warming Climate_change link 2227 2018-12
The visit numbers seem plausible. But why is there no data on, e.g., January to March? And why is there data for both directions in May and December, but not for the others? This seems implausible given the popularity of the articles.
Here's another example:
prev curr type n month
<chr> <chr> <chr> <dbl> <chr>
1 Smog Air_pollution link 140 2018-01
2 Air_pollution Smog link 82 2018-02
3 Air_pollution Smog link 295 2018-04
4 Air_pollution Smog link 215 2018-05
5 Smog Air_pollution link 85 2018-06
6 Air_pollution Smog link 233 2018-07
7 Air_pollution Smog link 45 2018-09
8 Smog Air_pollution link 96 2018-10
9 Smog Air_pollution link 90 2018-12
Am I missing something here?
Thanks in advance,
Simon
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--Joseph Allemandou (joal) (he / him)Sr Data EngineerWikimedia Foundation