Hi Joseph,
I see. Obviously the process is more complex that I thought.
Thanks you for the help.
Regards,
Ogier
Le 15 mars 2021 à 18:09, Joseph Allemandou
<jallemandou@wikimedia.org<mailto:jallemandou@wikimedia.org>> a écrit :
Hi again Ogier,
I don't exactly understand the part, about the
page_id being defined in the request. I thought the page_id was "resolved" based
on the page_title being in the uri_query.
This is not how the page_id is set in our traffic datasets :)
We receive the page_id in HTTP-Header, set by the UIs.
We have historically received the values for `desktop` and `mobile-web` pretty
consistently, but the fact that we receive them for `mobile-app` is new to me :)
I assume that getting data consistently will then be a matter of mobile-app updates.
I hope this helps :)
Cheers
Joseph
On Mon, Mar 15, 2021 at 4:29 PM Ogier Maitre
<ogier.maitre@unil.ch<mailto:ogier.maitre@unil.ch>> wrote:
Hello Joseph,
Thank you for your detailed response.
We suspected curid could be part of the equation here, but it is nice to have it confirmed
here (at least for a part of the answer).
The entry appears two times because for one of them
there is no page_id defined in the request, therefore it is categorised as different from
the one having a page_id defined.
I don't exactly understand the part, about the page_id being defined in the request. I
thought the page_id was "resolved" based on the page_title being in the
uri_query.
But this is more to satisfy my curiosity has I'm currently bundling these entries with
the one having a page_id, thanks to the page.sql table. I was mainly asking this, in hope
to see these kind of entry disappear in the future, which could simplify my aggregation
process.
Thank you again for your answer.
Regards,
Ogier
Le 15 mars 2021 à 14:10, Joseph Allemandou
<jallemandou@wikimedia.org<mailto:jallemandou@wikimedia.org>> a écrit :
Hello Ogier,
Thank you a lot for the wikimaps work, and your thorough analysis on the pageviews :)
Here is what I found on your two questions, investigating one day of `user` visited
pageviews recent data (we keep detailed data for 90 days only and I needed those detailed
for the analysis).
What kind of query can cause theses "-"
entries ?
Pages with a defined page_id and an undefined title ('-') were representing
0.04%, a bit more than 227k hits.
Among those, 152K requests were having a `curid=NUMBER` in their uri_query (meaning they
were specifying the page to view only by id, and we don't extract page_title from
ids).
More than 65K don't have any page-title nor page-id specified in the URLs, but have
one specified in HTTP headers. This feels like either a bug or an unexpected user
behavior.
And more than 10k are using a `diff=` uri pattern, providing diff between revisions for a
given page, but not providing the page in the URL.
I also found, for mobile-app' cases, that some page-titles were incorrectly rejected
as invalid for chinese wikipedia. This happens on a very small number of lines (less than
10 per day from my findings).
Why the entry "Barack_Obama mobile-app"
appears two times ?
The entry appears two times because for one of them there is no page_id defined in
the request, therefore it is categorised as different from the one having a page_id
defined. While it could be possible to bundle all rows with the same title to have a
page_id if one of the rows have the page_id defined, we could also have problems for hours
where a rename occurs (two different page_ids for the same title). I'll bring the
concern to the team, but given the relatively small number of views impacted by this case,
there are chances we will not prioritise it soon.
Please let us know if you have other questions :)
Best
Joseph
On Sun, Mar 14, 2021 at 1:53 AM Dan Andreescu
<dandreescu@wikimedia.org<mailto:dandreescu@wikimedia.org>> wrote:
Thank you for your email and thoughtful analysis, I just wanted to say I saw it but got
buried with other work. I'll try and reply early next week.
On Thu, Mar 11, 2021 at 03:50 Ogier Maitre
<ogier.maitre@unil.ch<mailto:ogier.maitre@unil.ch>> wrote:
Hello everybody,
We are currently working on a wikipedia visualisation tool (which is presented here:
http://www.wikimaps.io/). We use several pageview statistics to generate time series for
each page from 2008 to 2020. (we use pagecounts, pageviews and pageview_complete). This
last format is great for our work compared to previous format, and we use it for our data
from 2016 to 2020. (Thank to the analytics team for that).
We aggregate redirections as one page, identified by the page_id (as it is done in the
pageview_complete files).
But when we compare with the wikimedia API, we have some small differences.
I think this problem comes from the fact that wikimedia API (and
pageviews.toolforge.org<http://pageviews.toolforge.org/>) uses page_title to get the
time series, and I saw that pageview_complete files contain entries where the page_title
is missing (replaced by a "-"). As we are using page_id to do the aggregation
whenever it is possible, we aggregate these "-" entries, but
pageviews.toolforge.org<http://pageviews.toolforge.org/> probably does not.
For example for the page Barack_Obama in French, and the file
`pageviews-20200112-user.bz2`, I get several relevant entries.
fr.wikipedia - 167398 mobile-web 1 B1
fr.wikipedia Barack 167398 mobile-web 1 X1
fr.wikipedia Barack_Hussein_Obama 167398 mobile-web 1 J1
fr.wikipedia Barack_Obama 167398 desktop 748
A18B10C5D8E3F3G8H6I18J36K41L37M35N37O55P76Q65R57S48T29U56V42W23X32
fr.wikipedia Barack_Obama 167398 mobile-app 10 A1L1O1Q1T3U2V1
fr.wikipedia Barack_Obama 167398 mobile-web 1732
A62B38C28D17E24F10G16H43I40J56K65L78M87N100O95P100Q93R127S84T128U124V184W84X49
fr.wikipedia Natasha_Obama 167398 desktop 3 Q1R2
fr.wikipedia Obama 167398 desktop 11 J2K1M1O1Q2R1S1U1W1
fr.wikipedia Obama 167398 mobile-web 2 R1V1
fr.wikipedia Obama_Barack 167398 desktop 3 N1P2
fr.wikipedia Sacha_Obama 167398 desktop 3 J1O2
fr.wikipedia Sacha_Obama 167398 mobile-web 1 C1
fr.wikipedia Barack_Obama mobile-app 29 B1C1H4J1L1M2N3O3P1R3S5V1W2X1
That is 12 entries that use the page_id, and one that does not.
I have two questions about that result.
What kind of query can cause theses "-" entries ?
Why the entry "Barack_Obama mobile-app" appears two times ?
Sorry for the long introduction and thank you for your time.
Regards,
Ogier
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org<mailto:Analytics@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org<mailto:Analytics@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Joseph Allemandou (joal) (he / him)
Staff Data Engineer
Wikimedia Foundation
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org<mailto:Analytics@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org<mailto:Analytics@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Joseph Allemandou (joal) (he / him)
Staff Data Engineer
Wikimedia Foundation
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org<mailto:Analytics@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/analytics