Hi again Ogier,
I don't exactly understand the part, about the
page_id being defined in
the request. I thought the page_id was "resolved"
based on the page_title
being in the uri_query.
This is not how the page_id is set in our traffic datasets :)
We receive the page_id in HTTP-Header, set by the UIs.
We have historically received the values for `desktop` and `mobile-web`
pretty consistently, but the fact that we receive them for `mobile-app` is
new to me :)
I assume that getting data consistently will then be a matter of mobile-app
updates.
I hope this helps :)
Cheers
Joseph
On Mon, Mar 15, 2021 at 4:29 PM Ogier Maitre <ogier.maitre(a)unil.ch> wrote:
> Hello Joseph,
>
> Thank you for your detailed response.
> We suspected curid could be part of the equation here, but it is nice to
> have it confirmed here (at least for a part of the answer).
>
> > The entry appears two times because for one of them there is no page_id
> defined in the request, therefore it is categorised as different from the
> one having a page_id defined.
>
I don't exactly understand the part, about the
page_id being defined in
> the request. I thought the page_id was
"resolved" based on the page_title
> being in the uri_query.
> But this is more to satisfy my curiosity has I'm currently bundling these
> entries with the one having a page_id, thanks to the page.sql table. I was
> mainly asking this, in hope to see these kind of entry disappear in the
> future, which could simplify my aggregation process.
>
> Thank you again for your answer.
> Regards,
> Ogier
>
>
> Le 15 mars 2021 à 14:10, Joseph Allemandou <jallemandou(a)wikimedia.org> a
> écrit :
>
> Hello Ogier,
> Thank you a lot for the wikimaps work, and your thorough analysis on the
> pageviews :)
>
> Here is what I found on your two questions, investigating one day of
> `user` visited pageviews recent data (we keep detailed data for 90 days
> only and I needed those detailed for the analysis).
>
> > What kind of query can cause theses "-" entries ?
> Pages with a defined page_id and an undefined title ('-') were
> representing 0.04%, a bit more than 227k hits.
> Among those, 152K requests were having a `curid=NUMBER` in their uri_query
> (meaning they were specifying the page to view only by id, and we don't
> extract page_title from ids).
> More than 65K don't have any page-title nor page-id specified in the URLs,
> but have one specified in HTTP headers. This feels like either a bug or an
> unexpected user behavior.
> And more than 10k are using a `diff=` uri pattern, providing diff between
> revisions for a given page, but not providing the page in the URL.
> I also found, for mobile-app' cases, that some page-titles were
> incorrectly rejected as invalid for chinese wikipedia. This happens on a
> very small number of lines (less than 10 per day from my findings).
>
> > Why the entry "Barack_Obama mobile-app" appears two times ?
> The entry appears two times because for one of them there is no page_id
> defined in the request, therefore it is categorised as different from the
> one having a page_id defined. While it could be possible to bundle all rows
> with the same title to have a page_id if one of the rows have the page_id
> defined, we could also have problems for hours where a rename occurs (two
> different page_ids for the same title). I'll bring the concern to the team,
> but given the relatively small number of views impacted by this case, there
> are chances we will not prioritise it soon.
>
> Please let us know if you have other questions :)
> Best
> Joseph
>
>
>
>
>
> On Sun, Mar 14, 2021 at 1:53 AM Dan Andreescu <dandreescu(a)wikimedia.org>
> wrote:
>
>> Thank you for your email and thoughtful analysis, I just wanted to say I
>> saw it but got buried with other work. I'll try and reply early next week.
>>
>> On Thu, Mar 11, 2021 at 03:50 Ogier Maitre <ogier.maitre(a)unil.ch> wrote:
>>
>>> Hello everybody,
>>>
>>> We are currently working on a wikipedia visualisation tool (which is
>>> presented here:
http://www.wikimaps.io/). We use several pageview
>>> statistics to generate time series for each page from 2008 to 2020. (we use
>>> pagecounts, pageviews and pageview_complete). This last format is great for
>>> our work compared to previous format, and we use it for our data from 2016
>>> to 2020. (Thank to the analytics team for that).
>>>
>>> We aggregate redirections as one page, identified by the page_id (as it
>>> is done in the pageview_complete files).
>>> But when we compare with the wikimedia API, we have some small
>>> differences.
>>>
>>> I think this problem comes from the fact that wikimedia API (and
>>>
pageviews.toolforge.org) uses page_title to get the time series, and I
>>> saw that pageview_complete files contain entries where the page_title is
>>> missing (replaced by a "-"). As we are using page_id to do the
aggregation
>>> whenever it is possible, we aggregate these "-" entries, but
>>>
pageviews.toolforge.org probably does not.
>>>
>>> For example for the page Barack_Obama in French, and the file
>>> `pageviews-20200112-user.bz2`, I get several relevant entries.
>>>
>>>
>>> fr.wikipedia - 167398 mobile-web 1 B1
>>> fr.wikipedia Barack 167398 mobile-web 1 X1
>>> fr.wikipedia Barack_Hussein_Obama 167398 mobile-web 1 J1
>>> fr.wikipedia Barack_Obama 167398 desktop 748
>>> A18B10C5D8E3F3G8H6I18J36K41L37M35N37O55P76Q65R57S48T29U56V42W23X32
>>> fr.wikipedia Barack_Obama 167398 mobile-app 10 A1L1O1Q1T3U2V1
>>> fr.wikipedia Barack_Obama 167398 mobile-web 1732
>>>
A62B38C28D17E24F10G16H43I40J56K65L78M87N100O95P100Q93R127S84T128U124V184W84X49
>>> fr.wikipedia Natasha_Obama 167398 desktop 3 Q1R2
>>> fr.wikipedia Obama 167398 desktop 11 J2K1M1O1Q2R1S1U1W1
>>> fr.wikipedia Obama 167398 mobile-web 2 R1V1
>>> fr.wikipedia Obama_Barack 167398 desktop 3 N1P2
>>> fr.wikipedia Sacha_Obama 167398 desktop 3 J1O2
>>> fr.wikipedia Sacha_Obama 167398 mobile-web 1 C1
>>>
>>> fr.wikipedia Barack_Obama mobile-app 29 B1C1H4J1L1M2N3O3P1R3S5V1W2X1
>>>
>>>
>>> That is 12 entries that use the page_id, and one that does not.
>>>
>>> I have two questions about that result.
>>>
>>> What kind of query can cause theses "-" entries ?
>>> Why the entry "Barack_Obama mobile-app" appears two times ?
>>>
>>> Sorry for the long introduction and thank you for your time.
>>>
>>> Regards,
>>> Ogier
>>> _______________________________________________
>>> Analytics mailing list
>>> Analytics(a)lists.wikimedia.org
>>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics(a)lists.wikimedia.org
>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
>
> --
> Joseph Allemandou (joal) (he / him)
> Staff Data Engineer
> Wikimedia Foundation
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/analytics
>
--
Joseph Allemandou (joal) (he / him)
Staff Data Engineer
Wikimedia Foundation