Separate thread. I'm not sure which list is appropriate.
*... but not all the way to sentience
The annual community wishlist survey (implemented by a small team, possibly
in isolation?) may not be the mechanism for prioritizing large changes, but
the latter also deserves a community-curated priority queue. To complement
the staff-maintained priorities in phab ~
For core challenges (like Commons stability and capacity), I'd be surprised
if the bottleneck were people or budget. We do need a shared understanding
of what issues are most important and most urgent, and how to solve them.
For instance, a way to turn Amir's recent email about the problem (and
related phab tickets) into a family of persistent, implementable specs and
proposals and their articulated obstacles.
An issue tracker like phab is good for tracking the progress and
dependencies of agreed-upon tasks, but weak for discussing what is
important, what we know about it, how to address it. And weak for
discussing ecosystem-design issues that are important and need persistent
updating but don't have a simple checklist of steps.
So where is the best current place to discuss scaling Commons, and all that
entails? Some examples from recent discussions (most from the wm-l thread
- *Uploads*: Support for large file uploads / Keeping bulk upload tools
- *Video*: Debugging + rolling out the videojs
- *Formats*: Adding support for CML
<https://phabricator.wikimedia.org/T18491> and dozens of other
<https://phabricator.wikimedia.org/T297514> common high-demand file formats
- *Thumbs*: Updating thumbor <https://phabricator.wikimedia.org/T216815>
and librsvg <https://phabricator.wikimedia.org/T193352>
- *Search*: WCQS still <https://phabricator.wikimedia.org/T297454> down
<https://phabricator.wikimedia.org/T297454>, noauth option
<https://phabricator.wikimedia.org/T297995> wanted for tools
- *General*: Finish implementing redesign
<https://phabricator.wikimedia.org/T28741> of the image table
On Wed, Dec 29, 2021 at 6:26 AM Amir Sarabadani <ladsgroup(a)gmail.com> wrote:
> I'm not debating your note. It is very valid that we lack proper support
> for multimedia stack. I myself wrote a detailed rant on how broken it is
>  but three notes:
> - Fixing something like this takes time, you need to assign the budget
> for it (which means it has to be done during the annual planning) and if
> gets approved, you need to start it with the fiscal year (meaning July
> 2022) and then hire (meaning, write JD, do recruitment, interview lots of
> people, get them hired) which can take from several months to years. Once
> they are hired, you need to onboard them and let them learn about our
> technical infrastructure which takes at least two good months. Software
> engineering is not magic, it takes time, blood and sweat. 
> - Making another team focus on multimedia requires changes in planning,
> budget, OKR, etc. etc. Are we sure moving the focus of teams is a good
> idea? Most teams are already focusing on vital parts of wikimedia and
> changing the focus will turn this into a whack-a-mole game where we fix
> multimedia but now we have critical issues in security or performance.
> - Voting Wishlist survey is a good band-aid in the meantime. To at least
> address the worst parts for now.
> I don't understand your point tbh, either you think it's a good idea to
> make requests for improvements in multimedia in the wishlist survey or you
> think it's not. If you think it's not, then it's offtopic to this thread.
>  There is a classic book in this topic called "The Mythical Man-month"
> On Wed, Dec 29, 2021 at 11:41 AM Gnangarra <gnangarra(a)gmail.com> wrote:
>> we have to vote for regular maintenance and support for
>> essential functions like uploading files which is the core mission of
>> Wikimedia Commons
This season of giving, consider giving the truest gift of all: file format
Considering how much time we spend with document and data file formats, I'd
like to see those supported at least well enough to have their own place in
the search filters.
I've compiled an umbrella ticket <https://phabricator.wikimedia.org/T297514>
for related open issues; and those that have been discussed but perhaps
never filed as tickets before. Please weigh in + add those that I've
w:user:sj +1 617 529 4266
Where might I get or mirror a dump of Commons media files?
> It seems worth mentioning on the front page of
> It looks like the compressed XML of the ~50M description pages is ~25GB.
> It looks like wiki-team set up a dump script that posted monthly dumps to
the internet archive; in 2013 it stopped include the month+year in the
title; in 2016 it stopped altogether.
Sorry I missed this :) Excited about CQS and wanted to play with it today,
but it seems to be down?
Is it being updated // are there mirrors // what's the current plan for
federation of endpoints like this?
On Tue, Nov 30, 2021 at 11:13 AM Trey Jones <tjones(a)wikimedia.org> wrote:
> Hi Everyone,
> The Search Platform Team
> <https://www.mediawiki.org/wiki/Wikimedia_Search_Platform> usually holds
> office hours the first Wednesday of each month. Come talk to us about
> anything related to Wikimedia search, Wikidata Query Service, Wikimedia
> Commons Query Service, etc.!
> Feel free to add your items to the Etherpad Agenda for the next meeting.
> Details for our next meeting:
> Date: Wednesday, December 1st, 2021
> Time: 16:00-17:00 GMT / 08:00-09:00 PST / 11:00-12:00 EST / 17:00-18:00
> CET & WAT
> Etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
> Google Meet link: https://meet.google.com/vgj-bbeb-uyi
> Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927
Sorry for not communicating early, all my work happened in the open but
I didn't want to do any public announcements until there was a 100%
completed run! :-)
> I have the feeling the bulk of Commons media (~300 TB in all) is not
mirrored anywhere right now
> I saw something related mentioned on phab? within the last year, but
can't find it now.
So this was/is the state of multimedia storage at the moment:
* There are 3 copies of each file on the live OpenStack Swift cluster in
WMF's eqiad datacenter in Virginia
* There is an almost-real time replication of eqiad's multimedia cluster
into the codfw datacenter in Texas, with its own 3 separate copies
* Images can and are regularly served from both datacenters, protecting
against local disasters like floodings or earthquakes
That has been like that for a few years already, the following is new! :-)
I (with the assistance of many other WMF engineers) started working on an
offline/offsite backup solution for all multimedia files at the end of
2020- one that would save against application bugs, operator mistakes or
potential ill-intentioned unauthorized users. The system required a
completely different backup workflow than that of our regular -wikitext or
otherwise-, backups due to the nature and size of multimedia files (large
append-only store). We were also hit with long hardware delays due to
supplier shortage for a while.
*I advocated at first to solve multimedia backups and dumps at the same
time, but this was not possible*- because how wiki file permissions are
handled currently on the mediawiki software, it is not just a question of
"creating bundles of images". Mediawiki image storage is lacking basic
features like a unique identifier for each file uploaded, and still uses
sha-1 hashing, which is known to generate false collisions. This doesn't
impact full backups, which is just "copying everything" privately (although
I had to reimplement some of that functionality myself), but doesn't make
it easy to identify individual files to update the status of already
publicly available files.
Because of that, we (data persistence team) decided to solve the backups
first, and then it will be possible to use the backup metadata to generate
dumps in the future (reusing much of the work already done). My team is not
in charge of xmldumps, so maybe a workmate will be able to update you more
accurately on the priority of that- but I really think the work I've done
will speed up dump production by a lot- e.g. dumps could be (maybe?)
generated more easily from the backup data.
So I can announce that *the first full (non-public) offline backup of
Commons on eqiad datacenter finished in September* (it took around 20 days
to run), and *a second offline and remote copy is happening right now on
codfw datacenter* and will likely finish before the end of this year. You
can see the hosts containing the backup here:     These hosts
are not connected to the wikis/Internet, so if a vulnerability caused data
loss on Swift, we will be able to recover from the backups.
Because of privacy and latency -fast recovery- reasons, those copies are
hosted within WMF infrastructure (but geographically separate among each
other), but *an extra offsite copy, not hosted on WMF hardware is also
planned for the short future*. More work will be needed for fast recovery
tooling, as well as incremental/streaming backups, too. More information
about this will be documented on a wiki soon.
Those copies cannot be shared "as is", as they have been optimized for fast
recovery to production, not for separation of public and private files
(like the rest of our backups).
So if the question is, what is the main blocker for faster image dumps? I
would say it is the lack of a modern metadata storage model for images ,
one where there is a unique identifier of each uploaded image or a modern
hashing (sha256) method is used. There is also some additional legal and
technical considerations to make regular public image datasets- those are
not impossible to solve but require some solutions. I am also personally
heavily delayed by the lack of a dedicated Multimedia Team (I am a system
administrator/Site reliability Engineer- in charge of data recovery, not a
Mediawiki developer) that can support all the bugs  and corruption  I
find along the way. It is my understanding that, at the moment, there is
not any Mediawiki developer in charge of file management code maintenance.