Thank you for your email and thoughtful analysis, I just wanted to say I saw it but got buried with other work.  I'll try and reply early next week.

On Thu, Mar 11, 2021 at 03:50 Ogier Maitre <ogier.maitre@unil.ch> wrote:
Hello everybody,

We are currently working on a wikipedia visualisation tool (which is presented here: http://www.wikimaps.io/).  We use several pageview statistics to generate time series for each page from 2008 to 2020. (we use pagecounts, pageviews and pageview_complete). This last format is great for our work compared to previous format, and we use it for our data from 2016 to 2020. (Thank to the analytics team for that).

We aggregate redirections as one page, identified by the page_id (as it is done in the pageview_complete files).
But when we compare with the wikimedia API, we have some small differences.

I think this problem comes from the fact that wikimedia API (and pageviews.toolforge.org) uses page_title to get the time series, and I saw that pageview_complete files contain entries where the page_title is missing (replaced by a "-"). As we are using page_id to do the aggregation whenever it is possible, we aggregate these "-" entries, but pageviews.toolforge.org probably does not.

For example for the page Barack_Obama in French, and the file `pageviews-20200112-user.bz2`, I get several relevant entries.


fr.wikipedia - 167398 mobile-web 1 B1
fr.wikipedia Barack 167398 mobile-web 1 X1
fr.wikipedia Barack_Hussein_Obama 167398 mobile-web 1 J1
fr.wikipedia Barack_Obama 167398 desktop 748 A18B10C5D8E3F3G8H6I18J36K41L37M35N37O55P76Q65R57S48T29U56V42W23X32
fr.wikipedia Barack_Obama 167398 mobile-app 10 A1L1O1Q1T3U2V1
fr.wikipedia Barack_Obama 167398 mobile-web 1732 A62B38C28D17E24F10G16H43I40J56K65L78M87N100O95P100Q93R127S84T128U124V184W84X49
fr.wikipedia Natasha_Obama 167398 desktop 3 Q1R2
fr.wikipedia Obama 167398 desktop 11 J2K1M1O1Q2R1S1U1W1
fr.wikipedia Obama 167398 mobile-web 2 R1V1
fr.wikipedia Obama_Barack 167398 desktop 3 N1P2
fr.wikipedia Sacha_Obama 167398 desktop 3 J1O2
fr.wikipedia Sacha_Obama 167398 mobile-web 1 C1

fr.wikipedia Barack_Obama mobile-app 29 B1C1H4J1L1M2N3O3P1R3S5V1W2X1


That is 12 entries that use the page_id, and one that does not.

I have two questions about that result.

What kind of query can cause theses "-" entries ?
Why the entry "Barack_Obama mobile-app" appears two times ?

Sorry for the long introduction and thank you for your time.

Regards,
Ogier
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics