_______________________________________________Hello Joseph,
Thank you for your detailed response.We suspected curid could be part of the equation here, but it is nice to have it confirmed here (at least for a part of the answer).
> The entry appears two times because for one of them there is no page_id defined in the request, therefore it is categorised as different from the one having a page_id defined.
I don't exactly understand the part, about the page_id being defined in the request. I thought the page_id was "resolved" based on the page_title being in the uri_query.But this is more to satisfy my curiosity has I'm currently bundling these entries with the one having a page_id, thanks to the page.sql table. I was mainly asking this, in hope to see these kind of entry disappear in the future, which could simplify my aggregation process.
Thank you again for your answer.Regards,Ogier
Le 15 mars 2021 à 14:10, Joseph Allemandou <jallemandou@wikimedia.org> a écrit :
Hello Ogier,Thank you a lot for the wikimaps work, and your thorough analysis on the pageviews :)
Here is what I found on your two questions, investigating one day of `user` visited pageviews recent data (we keep detailed data for 90 days only and I needed those detailed for the analysis).
> What kind of query can cause theses "-" entries ?Pages with a defined page_id and an undefined title ('-') were representing 0.04%, a bit more than 227k hits.Among those, 152K requests were having a `curid=NUMBER` in their uri_query (meaning they were specifying the page to view only by id, and we don't extract page_title from ids).More than 65K don't have any page-title nor page-id specified in the URLs, but have one specified in HTTP headers. This feels like either a bug or an unexpected user behavior.And more than 10k are using a `diff=` uri pattern, providing diff between revisions for a given page, but not providing the page in the URL.I also found, for mobile-app' cases, that some page-titles were incorrectly rejected as invalid for chinese wikipedia. This happens on a very small number of lines (less than 10 per day from my findings).
> Why the entry "Barack_Obama mobile-app" appears two times ?The entry appears two times because for one of them there is no page_id defined in the request, therefore it is categorised as different from the one having a page_id defined. While it could be possible to bundle all rows with the same title to have a page_id if one of the rows have the page_id defined, we could also have problems for hours where a rename occurs (two different page_ids for the same title). I'll bring the concern to the team, but given the relatively small number of views impacted by this case, there are chances we will not prioritise it soon.
Please let us know if you have other questions :)BestJoseph
On Sun, Mar 14, 2021 at 1:53 AM Dan Andreescu <dandreescu@wikimedia.org> wrote:
Thank you for your email and thoughtful analysis, I just wanted to say I saw it but got buried with other work. I'll try and reply early next week._______________________________________________
On Thu, Mar 11, 2021 at 03:50 Ogier Maitre <ogier.maitre@unil.ch> wrote:
Hello everybody,
We are currently working on a wikipedia visualisation tool (which is presented here: http://www.wikimaps.io/). We use several pageview statistics to generate time series for each page from 2008 to 2020. (we use pagecounts, pageviews and pageview_complete). This last format is great for our work compared to previous format, and we use it for our data from 2016 to 2020. (Thank to the analytics team for that).
We aggregate redirections as one page, identified by the page_id (as it is done in the pageview_complete files).
But when we compare with the wikimedia API, we have some small differences.
I think this problem comes from the fact that wikimedia API (and pageviews.toolforge.org) uses page_title to get the time series, and I saw that pageview_complete files contain entries where the page_title is missing (replaced by a "-"). As we are using page_id to do the aggregation whenever it is possible, we aggregate these "-" entries, but pageviews.toolforge.org probably does not.
For example for the page Barack_Obama in French, and the file `pageviews-20200112-user.bz2`, I get several relevant entries.
fr.wikipedia - 167398 mobile-web 1 B1
fr.wikipedia Barack 167398 mobile-web 1 X1
fr.wikipedia Barack_Hussein_Obama 167398 mobile-web 1 J1
fr.wikipedia Barack_Obama 167398 desktop 748 A18B10C5D8E3F3G8H6I18J36K41L37M35N37O55P76Q65R57S48T29U56V42W23X32
fr.wikipedia Barack_Obama 167398 mobile-app 10 A1L1O1Q1T3U2V1
fr.wikipedia Barack_Obama 167398 mobile-web 1732 A62B38C28D17E24F10G16H43I40J56K65L78M87N100O95P100Q93R127S84T128U124V184W84X49
fr.wikipedia Natasha_Obama 167398 desktop 3 Q1R2
fr.wikipedia Obama 167398 desktop 11 J2K1M1O1Q2R1S1U1W1
fr.wikipedia Obama 167398 mobile-web 2 R1V1
fr.wikipedia Obama_Barack 167398 desktop 3 N1P2
fr.wikipedia Sacha_Obama 167398 desktop 3 J1O2
fr.wikipedia Sacha_Obama 167398 mobile-web 1 C1
fr.wikipedia Barack_Obama mobile-app 29 B1C1H4J1L1M2N3O3P1R3S5V1W2X1
That is 12 entries that use the page_id, and one that does not.
I have two questions about that result.
What kind of query can cause theses "-" entries ?
Why the entry "Barack_Obama mobile-app" appears two times ?
Sorry for the long introduction and thank you for your time.
Regards,
Ogier
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
_______________________________________________Joseph Allemandou (joal) (he / him)Staff Data EngineerWikimedia Foundation
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics