On 12/05/2011 08:07 PM, Brion Vibber wrote:
If extracted page text is stored in a better key-value store, we should make sure it doesn't get pulled in to backwards-compatible metadata blobs (if we keep em around as they are now) -- but they should be accessible through some API.
One thing to consider is what happens if a user edits metadata, e.g. adds EXIF data that was lost by cropping, or if a new (cropped) version of the image is uploaded with the same name.
Another thing is image annotations, that are today always added as plain text in the image description page, http://commons.wikimedia.org/wiki/Commons:Image_annotations
A third thing is timed text (video subtitles), which today is added in separate subpages, one for each language, http://commons.wikimedia.org/wiki/Commons:Timed_Text
A fourth thing is proofreading: If OCR text was extracted from a PDF or DJvu and then proofread in Wikisource, shouldn't the next person that downloads the PDF file get the new text?
Perhaps a system for managing image + text, including wiki editing, could address all four things above? In particular, image annotations and OCR text are both tied to coordinates in the image. (And timed text is tied to a time position in a video stream.) So why are they separate systems?