_______________________________________________hm, you know, maybe it's not such a great idea to show all these small files in the mediarequests/top endpoint. I imagine everyone trying to use it would have the same problems you are. Maybe we can brainstorm together on a way to filter out results you might not want. If that top 1000 list included only images you found interesting, would that solve your problem? If so, let's brainstorm.So the schema of the data we have available is this.base_name string COMMENT 'Base name of media file',
media_classification string COMMENT 'General classification of media (image, video, audio, data, document or other)',
file_type string COMMENT 'Extension or suffix of the file (e.g. jpg, wav, pdf)',
total_bytes bigint COMMENT 'Total number of bytes',
request_count bigint COMMENT 'Total number of requests',
transcoding string COMMENT 'Transcoding that the file was requested with, e.g. resized photo or image preview of a video',
agent_type string COMMENT 'Agent accessing the media files, can be spider or user',
referer string COMMENT 'Wiki project that the request was refered from. If project is not available, it will be either internal, external, or unknown',
dt string COMMENT 'UTC timestamp in ISO 8601 format (e.g. 2019-08-27T14:00:00Z)'And here's some sample data (request count > 50000 for privacy)."/wikipedia/commons/c/ca/Wiki_Loves_Monuments_Logo_notext.svg","image","svg","486642310","119697","image_0_199","user","en.wikipedia","2022-09-09T06:00:00Z","2022","9","9","6"
"/wikipedia/commons/d/d4/Button_hide.png","image","png","26477640","93145","original","user","en.wikipedia","2022-09-09T23:00:00Z","2022","9","9","23"
"/wikipedia/commons/c/ca/Wiki_Loves_Monuments_Logo_notext.svg","image","svg","300264742","73620","image_0_199","user","en.wikipedia","2022-09-09T05:00:00Z","2022","9","9","5"
"/wikipedia/commons/2/23/Icons-mini-file_acrobat.gif","image","gif","27279795","93779","original","user","ja.wikipedia","2022-09-09T03:00:00Z","2022","9","9","3"
"/wikipedia/en/a/ae/Flag_of_the_United_Kingdom.svg","image","svg","86260254","130257","image_0_199","user","en.wikipedia","2022-09-09T03:00:00Z","2022","9","9","3"
"/wikipedia/commons/f/fa/Wikiquote-logo.svg","image","svg","254832231","83127","image_0_199","user","en.wikipedia","2022-09-09T03:00:00Z","2022","9","9","3"
"/wikipedia/en/a/a4/Flag_of_the_United_States.svg","image","svg","76327061","90739","image_0_199","user","en.wikipedia","2022-09-09T03:00:00Z","2022","9","9","3"
"/wikipedia/commons/b/b6/Queen_Elizabeth_II_in_March_2015.jpg","image","jpeg","1156030104","58651","image_200_399","user","en.wikipedia","2022-09-09T05:00:00Z","2022","9","9","5"
"/wikipedia/commons/2/28/Aaj_tak_logo.png","image","png","57716837856","469335","original","user","external","2022-09-09T02:00:00Z","2022","9","9","2"
"/wikipedia/commons/c/ca/Wiki_Loves_Monuments_Logo_notext.svg","image","svg","682088336","168655","image_0_199","user","en.wikipedia","2022-09-09T22:00:00Z","2022","9","9","22"Can you do some poking around to see if there's a size in bytes that would be a good threshold, or a standard transcoding that is most used on articles, or anything that would allow us to filter to only the kinds of images you're interested in? If we find that, my thought is we can just update the data behind the top 1000 endpoint. Then, if people want it unfiltered, they can download the dumps, but that seems like the exceptional case.(note: you would divide total_bytes by request_count if you want the size of the file)On Fri, Nov 4, 2022 at 11:10 AM Michele Mauri via Analytics <analytics@lists.wikimedia.org> wrote:_______________________________________________Hi! Yes I already tested those two ways. I used the mediarequests api (https://wikimedia.org/api/rest_v1/metrics/mediarequests/top/en.wikipedia.org/image/2022/05/all-days) but since they are just the first 1000 the largest part is composed by icons, buttons ets. While I’d like to focus on the images that illustrate an article.
I wrote a script to download all the dumps, open, sort and filter them to get a longer list, but it’s very time consuming.
I used in the past articles popularity as proxy, but I was looking for a more granular approach and considering the usage of images also across different linguistic versions
Best
Michele
From: Dan Andreescu <dandreescu@wikimedia.org>
Date: Friday, 4 November 2022 at 15:17
To: Michele Mauri <michele.mauri@polimi.it>
Cc: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. <analytics@lists.wikimedia.org>
Subject: [Analytics] Re: Mediacounts fieldsI see. In practice, the mediaviewer instrumentation also had some inaccuracies. For example, the code pre-fetched certain images when opening a gallery even if the viewer never ended up looking at them. I think they adjusted the instrumentation to account for that, but I don't remember the details.
One thought I had is, have you checked the mediarequests API? It's used to power metrics like top media requests (per project per month). And you can query it directly for specific images. It's backed by the same mediacounts data, so you're right, it counts all transfers. But that's a pretty good proxy for what was seen by a user. If you look at the top 1000 files requested I linked, you'll see a lot of icons and flags at the top, which makes sense. But in between all that you'll see real images like Liz Truss's portrait and Socrates and all that. You could filter to only larger images by downloading the image and checking its size.
Or you can go another way and look at the top 1000 articles on a wiki, find all their images, and analyze those.
Take a look around at the APIs and see if there's a way forward through that data (the stats.wikimedia.org site queries the API directly on the client-side, so if you open up your browser's developer tools you can discover the API that way. You can of course also browse the dynamic docs :))
On Thu, Nov 3, 2022 at 5:52 PM Michele Mauri <michele.mauri@polimi.it> wrote:
Thanks. My goal is to understand which are the most viewed images on Commons through Wikipedia. By reading the mediacount description, it is possible to get the number of transfers. But if I got it well it counts all the images transferred to the user, making difficult to understand which have been really “seen” by the user. Furthermore, it provides all the interface images and icons, making difficult to filter only on the images used to illustrate the article.
Focusing only on media viewer clicks seems was a possible solution for solving those issues. If you have other suggestions, they are welcome!
Best
Michele
From: Dan Andreescu <dandreescu@wikimedia.org>
Date: Thursday, 3 November 2022 at 22:30
To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. <analytics@lists.wikimedia.org>
Cc: Michele Mauri <michele.mauri@polimi.it>
Subject: Re: [Analytics] Mediacounts fieldsWe don't have any public data on media viewer interactions specifically. We used to have instrumentation on that feature but we haven't tracked it since last year. To get access to some of the old sanitized data that was retained for research purposes, you'd have to file a formal research proposal, and it doesn't seem likely to get approved, but maybe tell us more about what you're trying to do?
What questions are you hoping to answer, maybe there's another way or another kind of dataset that would serve more use cases?
On Thu, Nov 3, 2022 at 4:12 PM Michele Mauri via Analytics <analytics@lists.wikimedia.org> wrote:
Hello,
For an academic research, I'd like to see which are the most viewed images through the "media viewer".
Do you know if it’s possible to get this information? I looked on the wikitech portal, but I found just the mediacounts (https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Mediacounts) which is not what I’m looking for.
Thank you
Michele
_______________________________________________
Analytics mailing list -- analytics@lists.wikimedia.org
To unsubscribe send an email to analytics-leave@lists.wikimedia.org
Analytics mailing list -- analytics@lists.wikimedia.org
To unsubscribe send an email to analytics-leave@lists.wikimedia.org
Analytics mailing list -- analytics@lists.wikimedia.org
To unsubscribe send an email to analytics-leave@lists.wikimedia.org