Commons-l December 2021

commons-l@lists.wikimedia.org

5 participants
6 discussions

Uplifting the multimedia stack (was: Community Wishlist Survery)
by Samuel Klein 03 Jan '22

03 Jan '22

Separate thread. I'm not sure which list is appropriate. *... but not all the way to sentience <https://en.wikipedia.org/wiki/The_Uplift_War>.* The annual community wishlist survey (implemented by a small team, possibly in isolation?) may not be the mechanism for prioritizing large changes, but the latter also deserves a community-curated priority queue. To complement the staff-maintained priorities in phab ~ For core challenges (like Commons stability and capacity), I'd be surprised if the bottleneck were people or budget. We do need a shared understanding of what issues are most important and most urgent, and how to solve them. For instance, a way to turn Amir's recent email about the problem (and related phab tickets) into a family of persistent, implementable specs and proposals and their articulated obstacles. An issue tracker like phab is good for tracking the progress and dependencies of agreed-upon tasks, but weak for discussing what is important, what we know about it, how to address it. And weak for discussing ecosystem-design issues that are important and need persistent updating but don't have a simple checklist of steps. So where is the best current place to discuss scaling Commons, and all that entails? Some examples from recent discussions (most from the wm-l thread below): - *Uploads*: Support for large file uploads / Keeping bulk upload tools online - *Video*: Debugging + rolling out the videojs <https://phabricator.wikimedia.org/T248418> player - *Formats*: Adding support for CML <https://phabricator.wikimedia.org/T18491> and dozens of other <https://phabricator.wikimedia.org/T297514> common high-demand file formats - *Thumbs*: Updating thumbor <https://phabricator.wikimedia.org/T216815> and librsvg <https://phabricator.wikimedia.org/T193352> - *Search*: WCQS still <https://phabricator.wikimedia.org/T297454> down <https://phabricator.wikimedia.org/T297454>, noauth option <https://phabricator.wikimedia.org/T297995> wanted for tools - *General*: Finish implementing redesign <https://phabricator.wikimedia.org/T28741> of the image table SJ On Wed, Dec 29, 2021 at 6:26 AM Amir Sarabadani <ladsgroup(a)gmail.com> wrote: > I'm not debating your note. It is very valid that we lack proper support > for multimedia stack. I myself wrote a detailed rant on how broken it is > [1] but three notes: > - Fixing something like this takes time, you need to assign the budget > for it (which means it has to be done during the annual planning) and if > gets approved, you need to start it with the fiscal year (meaning July > 2022) and then hire (meaning, write JD, do recruitment, interview lots of > people, get them hired) which can take from several months to years. Once > they are hired, you need to onboard them and let them learn about our > technical infrastructure which takes at least two good months. Software > engineering is not magic, it takes time, blood and sweat. [2] > - Making another team focus on multimedia requires changes in planning, > budget, OKR, etc. etc. Are we sure moving the focus of teams is a good > idea? Most teams are already focusing on vital parts of wikimedia and > changing the focus will turn this into a whack-a-mole game where we fix > multimedia but now we have critical issues in security or performance. > - Voting Wishlist survey is a good band-aid in the meantime. To at least > address the worst parts for now. > > I don't understand your point tbh, either you think it's a good idea to > make requests for improvements in multimedia in the wishlist survey or you > think it's not. If you think it's not, then it's offtopic to this thread. > > [1] > https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org… > [2] There is a classic book in this topic called "The Mythical Man-month" > > On Wed, Dec 29, 2021 at 11:41 AM Gnangarra <gnangarra(a)gmail.com> wrote: > >> we have to vote for regular maintenance and support for >> essential functions like uploading files which is the core mission of >> Wikimedia Commons >> >

2 2

Allowable file types
by Samuel Klein 13 Dec '21

13 Dec '21

This season of giving, consider giving the truest gift of all: file format compatibility. Considering how much time we spend with document and data file formats, I'd like to see those supported at least well enough to have their own place in the search filters. I've compiled an umbrella ticket <https://phabricator.wikimedia.org/T297514> for related open issues; and those that have been discussed but perhaps never filed as tickets before. Please weigh in + add those that I've missed. *Pro forma*t, SJ -- w:user:sj +1 617 529 4266

1 0

Commons media archives
by Samuel Klein 13 Dec '21

13 Dec '21

Dear ones, Where might I get or mirror a dump of Commons media files? > It seems worth mentioning on the front page of https://dumps.wikimedia.org/ > It looks like the compressed XML of the ~50M description pages is ~25GB. > It looks like wiki-team set up a dump script that posted monthly dumps to the internet archive; in 2013 it stopped include the month+year in the title; in 2016 it stopped altogether. https://archive.org/details/wikimediacommons

3 7

Re: [Wikidata] Upcoming Search Platform Office Hours—December 1st, 2021
by Samuel Klein 13 Dec '21

13 Dec '21

Sorry I missed this :) Excited about CQS and wanted to play with it today, but it seems to be down? Is it being updated // are there mirrors // what's the current plan for federation of endpoints like this? SJ On Tue, Nov 30, 2021 at 11:13 AM Trey Jones <tjones(a)wikimedia.org> wrote: > Hi Everyone, > > > The Search Platform Team > <https://www.mediawiki.org/wiki/Wikimedia_Search_Platform> usually holds > office hours the first Wednesday of each month. Come talk to us about > anything related to Wikimedia search, Wikidata Query Service, Wikimedia > Commons Query Service, etc.! > > > Feel free to add your items to the Etherpad Agenda for the next meeting. > > > Details for our next meeting: > > Date: Wednesday, December 1st, 2021 > > Time: 16:00-17:00 GMT / 08:00-09:00 PST / 11:00-12:00 EST / 17:00-18:00 > CET & WAT > > Etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours > > Google Meet link: https://meet.google.com/vgj-bbeb-uyi > > Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927 >

2 1

Re: Commons media archives
by Jaime Crespo 13 Dec '21

13 Dec '21

Hi, Samuel, Sorry for not communicating early, all my work happened in the open[0] but I didn't want to do any public announcements until there was a 100% completed run! :-) > I have the feeling the bulk of Commons media (~300 TB in all) is not mirrored anywhere right now > I saw something related mentioned on phab? within the last year, but can't find it now. So this was/is the state of multimedia storage at the moment: * There are 3 copies of each file on the live OpenStack Swift cluster in WMF's eqiad datacenter in Virginia * There is an almost-real time replication of eqiad's multimedia cluster into the codfw datacenter in Texas, with its own 3 separate copies * Images can and are regularly served from both datacenters, protecting against local disasters like floodings or earthquakes That has been like that for a few years already, the following is new! :-) I (with the assistance of many other WMF engineers) started working on an offline/offsite backup solution for all multimedia files at the end of 2020- one that would save against application bugs, operator mistakes or potential ill-intentioned unauthorized users. The system required a completely different backup workflow than that of our regular -wikitext or otherwise-, backups due to the nature and size of multimedia files (large append-only store). We were also hit with long hardware delays due to supplier shortage for a while. *I advocated at first to solve multimedia backups and dumps at the same time, but this was not possible*- because how wiki file permissions are handled currently on the mediawiki software, it is not just a question of "creating bundles of images". Mediawiki image storage is lacking basic features like a unique identifier for each file uploaded, and still uses sha-1 hashing, which is known to generate false collisions. This doesn't impact full backups, which is just "copying everything" privately (although I had to reimplement some of that functionality myself), but doesn't make it easy to identify individual files to update the status of already publicly available files. Because of that, we (data persistence team) decided to solve the backups first, and then it will be possible to use the backup metadata to generate dumps in the future (reusing much of the work already done). My team is not in charge of xmldumps, so maybe a workmate will be able to update you more accurately on the priority of that- but I really think the work I've done will speed up dump production by a lot- e.g. dumps could be (maybe?) generated more easily from the backup data. So I can announce that *the first full (non-public) offline backup of Commons on eqiad datacenter finished in September* (it took around 20 days to run), and *a second offline and remote copy is happening right now on codfw datacenter* and will likely finish before the end of this year. You can see the hosts containing the backup here: [4] [5] [6] [7] These hosts are not connected to the wikis/Internet, so if a vulnerability caused data loss on Swift, we will be able to recover from the backups. Because of privacy and latency -fast recovery- reasons, those copies are hosted within WMF infrastructure (but geographically separate among each other), but *an extra offsite copy, not hosted on WMF hardware is also planned for the short future*. More work will be needed for fast recovery tooling, as well as incremental/streaming backups, too. More information about this will be documented on a wiki soon. Those copies cannot be shared "as is", as they have been optimized for fast recovery to production, not for separation of public and private files (like the rest of our backups). So if the question is, what is the main blocker for faster image dumps? I would say it is the lack of a modern metadata storage model for images [1], one where there is a unique identifier of each uploaded image or a modern hashing (sha256) method is used. There is also some additional legal and technical considerations to make regular public image datasets- those are not impossible to solve but require some solutions. I am also personally heavily delayed by the lack of a dedicated Multimedia Team (I am a system administrator/Site reliability Engineer- in charge of data recovery, not a Mediawiki developer) that can support all the bugs [2] and corruption [3] I find along the way. It is my understanding that, at the moment, there is not any Mediawiki developer in charge of file management code maintenance. [0] <url:https://phabricator.wikimedia.org/T262668> [1] <url:https://phabricator.wikimedia.org/T28741> [2] <url:https://phabricator.wikimedia.org/T290462#7405740> [3] <url:https://phabricator.wikimedia.org/T289996> [4] < https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=28&orgId=… > [5] <url: https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=28&orgId=… > [6] <url: https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=28&orgId=… > [7] <url: https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=28&orgId=… > -- Jaime Crespo <http://wikimedia.org>

1 0

Fwd: Why do you use Creative Commons licenses?
by Federico Leva (Nemo) 12 Dec '21

12 Dec '21

I wonder if Flickr really emailed this to all of the millions users who have Creative Commons photos. Some answers are already nice: https://www.flickr.com/groups/2684497@N24/discuss/72157719635960141/7215772… Federico -------- Messaggio Inoltrato -------- Oggetto: Why do you use Creative Commons licenses? Data: Sun, 12 Dec 2021 02:00:04 +0000 Mittente: The Flickr Team Share with your fellow Flickr members. ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ <https://www.flickr.com/photos/flickr/> Creative Commons Licenses How and why do you share? <https://www.flickr.com/groups/flickrsocialmedia/discuss/72157719635960141/> Earlier this year, we celebrated Flickr members’ contributions to the vast corpus of Creative Commons licensed photography <https://blog.flickr.net/en/2021/09/08/a-look-at-nearly-two-decades-of-creat…> . This includes photos taken from space, underwater scenes, captivating portraits, and much more. Flickr members also explained why they use Creative Commons licenses and how they’ve seen others sharing and adapting their work. We’d love to keep this exciting conversation going. Why do you use Creative Commons licenses and is there a story of how you’ve seen one of your photos shared or adapted? Share your story <https://www.flickr.com/groups/flickrsocialmedia/discuss/72157719635960141/> Share your experience Open Future is a foundation hoping to connect with photographers who use Creative Commons licenses, to better understand why and how they use them.Join this Flickr Social discussion <https://www.flickr.com/groups/flickrsocialmedia/discuss/72157719635960141/> to learn more about their work and take their survey on Creative Commons licenses. Photos above by: ESA/Hubble & NASA <https://www.flickr.com/photos/gsfc/51374852472/>, J. C. Tan <https://www.flickr.com/photos/gsfc/51374852472/>, R. Fedriani <https://www.flickr.com/photos/gsfc/51374852472/>, ESA/Hubble & NASA <https://www.flickr.com/photos/gsfc/51090373931/>, Susan Jane Golding <https://www.flickr.com/photos/sjgolding/29315592472>, Lorie Shaull <https://www.flickr.com/photos/number7cloud/15915890724/>.

1 0

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Commons-l December 2021