Hi Simon,
Thanks for reaching out :)

I tried a similar analysis on our cluster with the same original files as the ones in dumps.wikimedia.org, using Spark to speed up computation.
I ended up with coherent results for both the examples you gave:

Sum - count Data

date Climate_change --> Global_warming Global_warming --> Climate_change Total Result
2017-11 3904 950 4854
2017-12 3549 780 4329
2018-01 4508 1011 5519
2018-02 3548 998 4546
2018-03 3462 745 4207
2018-04 3726 755 4481
2018-05 3730 810 4540
2018-06 2971 862 3833
2018-07 3500 1602 5102
2018-08 4546 1644 6190
2018-09 3962 1472 5434
2018-10 6155 3048 9203
2018-11 5865 2617 8482
2018-12 5491 2227 7718
2019-01 5774 2911 8685
2019-02 6311 2845 9156
2019-03 6858 2514 9372
2019-04 6824 2199 9023


Sum - count Data

date Air_pollution --> Smog Smog --> Air_pollution Total Result
2017-11 82 263 345
2017-12 200 184 384
2018-01 65 140 205
2018-02 82 98 180
2018-03 418 149 567
2018-04 295 137 432
2018-05 215 95 310
2018-06 245 85 330
2018-07 233 70 303
2018-08 36 62 98
2018-09 45 81 126
2018-10 66 96 162
2018-11 128 135 263
2018-12 50 90 140
2019-01 68 92 160
2019-02 50 68 118
2019-03 49 72 121
2019-04 33 51 84
Total Result 2360 1968 4328

Maybe there is an issue in the way you process the data?
Best
Joseph




On Mon, May 13, 2019 at 3:38 PM Simon Munzert <simon.munzert@googlemail.com> wrote:
Hi all,

I've got a question on the completeness of the clickstream dataset. I downloaded the dumps for 2018 from https://dumps.wikimedia.org/other/clickstream/ (English Wikipedia only). When I filter for the article pair "Climate change" and "Global warming" (either one being either prev or curr) for all of 2018, this is what I get:

  prev           curr           type      n month 
  <chr>          <chr>          <chr> <dbl> <chr> 
1 Global_warming Climate_change link    755 2018-04
2 Global_warming Climate_change link    810 2018-05
3 Climate_change Global_warming link   3730 2018-05
4 Climate_change Global_warming link   3962 2018-09
5 Climate_change Global_warming link   5865 2018-11
6 Climate_change Global_warming link   5491 2018-12
7 Global_warming Climate_change link   2227 2018-12

The visit numbers seem plausible. But why is there no data on, e.g., January to March? And why is there data for both directions in May and December, but not for the others? This seems implausible given the popularity of the articles.

Here's another example:

  prev          curr          type      n month 
  <chr>         <chr>         <chr> <dbl> <chr> 
1 Smog          Air_pollution link    140 2018-01
2 Air_pollution Smog          link     82 2018-02
3 Air_pollution Smog          link    295 2018-04
4 Air_pollution Smog          link    215 2018-05
5 Smog          Air_pollution link     85 2018-06
6 Air_pollution Smog          link    233 2018-07
7 Air_pollution Smog          link     45 2018-09
8 Smog          Air_pollution link     96 2018-10
9 Smog          Air_pollution link     90 2018-12

Am I missing something here?

Thanks in advance,
Simon
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


--
Joseph Allemandou (joal) (he / him)
Sr Data Engineer
Wikimedia Foundation