Option A - shows only a single image to the user and gets the user to swipe between images. this keeps the user in the context of the image, something that is crucial while describing the image.

Option B - Shows multiple images in a single screen. the user can tap the image to see it better.

I kinda like B but I worry it's going to get very cluttered. A leaves plenty of screen space for the input data, and will stay uncluttered as we add things like geolocation options.

categories and description get copied across images and the user is free to edit them.

Can you clarify this a bit?

I would imagine something like this:
* image 1: title, description, categories are empty
** user inputs a title, a description, and some cats
** user goes on to next image
* image 2: title is empty, but description and categories are set to what image 1 had
** user inputs a title, optionally edits description and categories
** user goes on to next image
* image 3: title is empty, but description and categories are set to what image 2 had
** ....

Or are you thinking something else?

-- brion