I think there are different aspects discussed here:
Overriding key behaviour
I think it is legitimate to override the browser behaviour if it aligns with user goals and keeps consistency with default behaviours. For example, in the Gmail list of emails the up/down keys are overridden to select individual messages instead of scrolling (while the rest of scrolling possibilities are kept as default).
Similarly, many carrousels allow to access sequences of images by both right/left arrows and touch gestures as if the user was scrolling. However right/left keys provide access to the next images not producing an increment in horizontal scroll, and that feels natural.
Relying on scroll to support the current behaviour is just an implementation choice. I think it is a good choice since the behaviours align well, but that should not mean that the UI should expose verbatim all the exact behaviour of its underlying components. The fact that it does not feel natural for some users, highlights a different problem (more on this later).
Our current metaphor
Our current metaphor brings us good things such as minimising the moving parts: when you access the metadata the whole screen is not moving, just the metadata info you are interested in. That keeps the image (and controls such as close or fullscreen) anchored and easy to go back to them.
An alternative approach where the whole page moves (as the Medium and old Flickr examples Fabrice commented) will make the whole thing feel as a heavier single page.
Up or down
From my point of view, the big question is: does the up/down arrows align with user expectations? This bring us back to the question of which is the right direction to scroll, or what are you acting on (the viewport or the content). There are many advantages in following a direct manipulation model, but it is true that both scrolling directions (and thus both ways to understand scroll) still exists nowadays.
I think that if we reverse the directions, that will become confusing for other users. So what I initially proposed was to make the panel open by clicking either up or down. In that way, users with different mental models will just get the expected outcome. I still think it is worth a try to check how it feels (we may want to try it with a common.js override).
I'm open to explore other affordances for opening the panel, but I think the problem is not the whole general metaphor.
Pau