On Thu, Dec 1, 2011 at 8:49 PM, bawolff bawolff+wn@gmail.com wrote:
Thus, just storing a table of key/value pairs is kind of problematic - how do you store an "array" value. Additionally you have to consider finding info. You probably want to efficiently be able to search through lang values in a specific language, or for a specific property and not caring for the language.
Two easiest things based on my previous experience: 1) separate values with \x00, making them easy to split after extracting a row 2) store multiple entries with an index field, making it easy to query for potentially multiples
Also consider how big a metadata field can get. Theoretically it's not really limited, well I don't expect it to be huge, > 255 bytes of utf-8 seems a totally reasonable size for a value of a metadata field.
Last of all, you have to keep in mind all sorts of stuff is stored in the img_metadata. This includes things like the text layer of Djvu files (although arguably that shouldn't be stored there...) and other handler specific things (OggHandler stores some very complex structures in img_metadata). Of course, we could just keep the img_metadata blob there, and simply stop using it for "exif-like" data, but continue using it for handler specific ugly metadata that's generally invisible to user [probably a good idea. The two types of data are actually quite different].
On text: DjVu and PDF files can optionally contain flattened searchable text, which we extract so it can be used for things like Extension:ProofreadPage and, potentially, search indexing:
https://bugzilla.wikimedia.org/showdependencytree.cgi?id=21061&hide_reso...
Currently this gets stuffed into the metadata blob along with the exif data etc, and can make metadata blobs *very* large if there are hundreds of pages of text.
If extracted page text is stored in a better key-value store, we should make sure it doesn't get pulled in to backwards-compatible metadata blobs (if we keep em around as they are now) -- but they should be accessible through some API.
One issue to consider is the file archive. Should we replicate the metadata
table for file archive? Or serialize the data and store it in a new
table
(something like fa_metadata)?
Honestly, I wouldn't worry about that, especially in the beginning. As far as i know, the only place fa_metadata/oi_metadata is used, is that you can request it via api (I suppose it's copied over during file reverts as well). I don't think anyone uses that field on archived images really. (maybe one day bug 26741 will be fixed and this would be less of a concern).
That reminds me: ForeignAPIRepo (InstantCommons) wants to be able to transfer the metadata at least for current versions; API formats should remain compatible if possible in order for data to continue to be transferred to clients running old versions.
-- brion