Hi all,
For those who don't know me, I'm one of the GSOC students this year. My mentor is ^demon, and my project is to enhance support for metadata in uploaded files. Similar to the recent thread on interwiki transclusions, I'd thought I'd ask for comments about what I propose to do.
Currently metadata is stored in img_metadata field of the image table as a serialized php array. Well this works fine for the primary use case - listing the metadata in a little box on the image description page, its not very flexible. Its impossible to do queries like get a list of images with some specific metadata property equal to some specific value, or get a list of images ordered by what software edited them.
So as part of my project I would like to move the metadata to its own table. However I think the structure of the table will need to be a little more complicated then just <page id>, <name>, <value> triples, since ideally it would be able to store XMP metadata, which can contain nested structures. XMP metadata is pretty much the most complex metadata format currently popular (for metadata stored inside images anyways), and can store pretty much all other types of metadata. Its also the only format that can store multi-lingual content, which is a definite plus as those commons folks love their languages. Thus I think it would be wise to make the table store information in a manner that is rather close to the XMP data model.
So basically my proposed metadata table looks like:
*meta_id - primary key, auto-incrementing integer *meta_page - foreign key for page_id - what image is this for *meta_type - type of entry - simple value or some sort of compound structure. XMP supports ordered/unordered lists, associative array type structures, alternate array's (things like arrays listing the value of the property in different languages). *meta_schema - xmp uses different namespaces to prevent name collisions. exif properties have their own namespace, IPTC properties have their own namespace, etc *meta_name - The name of the property *meta_value - the value of the property (or null for some compound things, see below) *meta_ref - a reference to a meta_id of a different row for nested structures, or null if not applicable (or 0 perhaps) *meta_qualifies - boolean to denote if this property is a qualifier (in XMP there are normal properties and qualifiers)
(see http://www.mediawiki.org/wiki/User:Bawolff/metadata_table for a longer explanation of the table structure)
Now, before everyone says eww nested structures in a db are inefficient and what not, I don't think its that bad (however I'm new to the whole scalability thing, so hopefully someone more knowledgeable than me will confirm or deny that).
The XMP specification specifically says that there is no artificial limit on nesting depth, however in general practise its not nested very deeply. Furthermore in most cases the tree structure can be safely ignored. Consider: *Use-case 1 (primary usecase), displaying a metadata info box on an image page. Most of the time that'd be translating specific name and values into html table cells. The tree structure is totally unnecessary. for example the exif property DateTimeOriginal can only appear once per image (also it can only appear at the root of the tree structure but thats beside the point). There is no need to reconstruct the tree, just look through all the props for the one you need. If the tree structure is important it can be reconstructed on the php side, and would typically be only the part of the tree that is relevant, not the entire nested structure. *Use-case 2 (secondary usecase). Get list of images ordered by some property starting at foo. or get list of images where property bar = baz. In this case its a simple select. It does not matter where in the tree structure the property is.
Thus, all the nestedness of XMP is preserved (So we could re-output it into xmp form if we so desired), and there is no evil joining the metadata table with itself over and over again (or at all), which from what i understand, self-joining to reconstruct nested structures is what makes them inefficient in databases.
I also think this schema would be future proof because it can store pretty much all metadata we can think of. We can also extend it with custom properties we make up that are guaranteed to not conflict with anything (The X in xmp is for extensible).
As a side-note, based on my rather informal survey of commons (aka the couple people who happened to be on #wikimedia-commons at that moment) another use-case people think would be cool and useful is metadata intersections, and metadata-category intersections. I'm not planning to do this as part of my project, as I believe that would have performance issues. However doing a metadata table like this does leave the possibility open for people to do such intersection things on the toolserver or in a DPL-like extension.
I'd love to get some feedback on this. Is this a reasonable approach for me to take on this.
Thanks for reading.
-- -bawolff
Hi bawolff,
thanks for your work. I'm not very happy about the name "metadata" for the table. As far as I understand it, this is about file metadata. "metadata" suggests it contains information on pages (e.g. statistics). Please consider using a name that contains 'file', e.g. file_metadata.
Thanks, Church of emacs
On Fri, May 28, 2010 at 10:12 AM, church.of.emacs.ml church.of.emacs.ml@googlemail.com wrote:
Hi bawolff,
thanks for your work. I'm not very happy about the name "metadata" for the table. As far as I understand it, this is about file metadata. "metadata" suggests it contains information on pages (e.g. statistics). Please consider using a name that contains 'file', e.g. file_metadata.
Thanks, Church of emacs
Hi.
Thanks for your response. You make a very good point. Now that you mentioned it, I can very easily see that being confusing. I definitely agree that either file_metadata or image_metadata would be better. (file_metadata would be good because it contains metadata about files that aren't image, and is consistent with the renaming of image namespace to file, but image_metadata is more consistent with the db table naming scheme as other table is the image table. I guess in the end it doesn't really matter either way as long as its clear the tables about uploaded media).
cheers, -bawolff
On 05/28/2010 08:03 AM, bawolff wrote:
Hi all,
For those who don't know me, I'm one of the GSOC students this year. My mentor is ^demon, and my project is to enhance support for metadata in uploaded files. Similar to the recent thread on interwiki transclusions, I'd thought I'd ask for comments about what I propose to do.
Excellent! We're glad to have you on board. (FWIW I'm working on Multimedia Usability, so I'll be watching what you come up with closely).
So as part of my project I would like to move the metadata to its own table.
Great, although perhaps we could consider other things too (see below).
ideally it would be able to store XMP metadata, which can contain nested structures.
Now, before everyone says eww nested structures in a db are inefficient and what not, I don't think its that bad (however I'm new to the whole scalability thing, so hopefully someone more knowledgeable than me will confirm or deny that).
Okay, I just wrote a little novel here, but please take it as just opening a discussion. I think you should try for a simpler design, but I'm open to discussion.
I'm familiar with how MySQL scales (particularly for large image collections). Commons has a respectable collection of > 6.6 million media files, but just to put that into perspective, Facebook gets that in just a few hours. If we're successful in improving Commons' usability we'll probably get some multiple of our current intake rate. So we have to plan for hundreds of millions of media files, at least.
POTENTIAL ISSUES WITH TREE STRUCTURES
Tree structures in MySQL can be deceptively fast especially on single-machine tests. But it tends to be a nightmare in production environments. Turning what could be one query into eight or nine isn't so bad on a single machine, but consider when the database and web server are relatively far apart or loaded and thus and have high latency.
Also, due to its crappy locking, MySQL sucks at keeping tree structures consistent. If we were doing this on Oracle it would be a different story -- they have some fancy features that make trees easy -- but we're on MySQL.
The most scalable architectures use MySQL's strengths. MySQL is weak at storing trees. It's good at querying really simple, flat, schema.
BLOBS OF SERIALIZED PHP ARE GOOD
You should not be afraid of storing (some) data as serialized PHP, *especially* if it's a complex data structure. If the database doesn't need to query or index on a particular field, then it's a huge win NOT to parse it out into columns and reassemble it into PHP data structures on every access.
GO FOR MEANINGFUL DATA, NOT DATA PRESERVATION
Okay onto the next topic -- how you want to parse XMP out into a flat structure, with links between them. I think you were clever in how you tried to make the cost of storing the tree relatively minimal, but I just question whether it's necessary to store it at all, and whether this meets our needs.
It seems to me (correct me if I'm wrong) that your structure is two steps beyond id-key-val in abstractness: it's id-schema-key-type-val. So the meaning of keys depends on the schema? So we might have a set of rows like this, for two images, id 1234 and 5678:
id: 1234 schema: AcmeMetadata key: cameraType type: string val: canon-digital-rebel-xt
id: 1234 schema: AcmeMetadata key: resolution type: string val: 1600x1200
id: 5678 schema: SomeOtherSchema key: recordingDevice type: string val: Canon Digital Rebel XT
id: 5678 schema: SomeOtherSchema key: width type: int val: 1600
id:5678 schema: SomeOtherSchema key: height type: int val: 1200
The point is that between schemas, we'd use different keys and values to represent the same thing. While you've done a good job of preserving the exact data we received, this makes for an impossibly complicated query if we want to learn anything.
When you find yourself defining a 'type' column you should be wary, because you're engaging in Inner-Platform Effect. MySQL has types already.
It seems to me that we have no requirement to preserve exact key and value names in our database. What we need is *meaning* not data.
So we shouldn't attempt to make a meta-metadata-format that has all the features of all possible metadata formats. Instead we should just standardize on one, hardcoded, metadata format that's useful for our purposes, and then translate other formats to that format. The simplest thing is just a flat series of columns. In other words, something like this:
id: 1234 cameraType: canon-digital-rebel-xt width: 1600 height: 1200
id: 5678 cameraType: canon-digital-rebel-xt width: 1600 height: 1200
WHY EVEN HAVE A SEPARATE TABLE?
If we're doing one row per media file (the ideal!) then there is no reason why you can't simply append these new metadata columns onto the existing image table. This would make querying REALLY easy, and it would simplify database management.
And of course metadata formats differ, and not all metadata fields need to be queryable or indexable. It would be perfectly acceptable to parse out some common interesting metadata into columns, and leave all the other random stuff in a serialized PHP blob, much as we have today. That structure could be recursive or whatever floats your boat.
I would even argue that you should combine multilingual descriptions into the same column, possibly as serialized PHP or maybe just a simple string format. At Wikimedia, we do not do MySQL-based text search any more -- and for other MediaWiki installations that do, combining all description strings into one field is arguably a benefit. For Wikimedia projects, we can deserialize that field and break it out into languages when we feed it to Lucene.
There could be good reasons to have separate tables for metadata, but I think we should carefully consider the costs.
another use-case people think would be cool and useful is metadata intersections, and metadata-category intersections.
Sounds like a lot of work to make it efficient. Probably shouldn't be a high priority.
Thanks a lot for presenting your design here in detail. If you want to take it to a wiki I can reiterate some of this debate on your design's talk page.
I'm going to use this message to respond to several people with this email, hopefully it doesn't become confusing.
Markus wrote: [snip]
(1) You use mediumblob for values.
I'll be honest, I chose a type at random for that field. It needed to be long since it should be able to store rather long strings since some metadata formats don't have length limits on strings. (in that version of the new table plan anyway. based on feedback, I think I'll try to make my plan for tables much simpler)
Each row in your table specifies...meta_qualifies
In xmp you can have special type of properties that instead of being a property of the image, modify the meaning of another property. The example given in the spec was if you have a creator property, you could have a qualifer for that property named role that denotes if that author proerty is the singer, the writer, or whatever. Its most common use seems to be in if multiple thumbnails of the image are stored in xmp at different resolutions, it uses qualifiers to specify the resolutions of the different choices (which is a kind of moot example for us, as i don't think we want to be storing embeded thumbnails of the image in the db). the column was meant to be boolean flag to say if this property was a sub-property of the parent, or if it modified the meaning of the parent.
But overall, I am quite excited to see this project progressing. Maybe we could have some more alignment between the projects later on (How about combining image metadata and custom wiki metadata about image pages in queries? :-) but for GSoC you should definitely focus on your core goals and solve this task as good as possible.
Based on the comments I recieved I might be moving towards a more simple table layout which will probably be less aligned with SMW_light's goals, but I'd love to see more alignment where it fits into the goals of my project. Personally I've always thought that a lot of the smw stuff was rather cool.
On Fri, May 28, 2010 at 3:28 PM, Neil Kandalgaonkar neilk@wikimedia.org wrote:
[snip]
Okay, I just wrote a little novel here, but please take it as just opening a discussion. I think you should try for a simpler design, but I'm open to discussion.
After reading the comments so far I tend to agree that perhaps my original design was a bit more complicated than it needed to be. Scalability is pretty much the number one concern, so the simpler the better
BLOBS OF SERIALIZED PHP ARE GOOD
You should not be afraid of storing (some) data as serialized PHP, *especially* if it's a complex data structure. If the database doesn't need to query or index on a particular field, then it's a huge win NOT to parse it out into columns and reassemble it into PHP data structures on every access.
GO FOR MEANINGFUL DATA, NOT DATA PRESERVATION
Okay onto the next topic -- how you want to parse XMP out into a flat structure, with links between them. I think you were clever in how you tried to make the cost of storing the tree relatively minimal, but I just question whether it's necessary to store it at all, and whether this meets our needs.
[snip]
So we shouldn't attempt to make a meta-metadata-format that has all the features of all possible metadata formats. Instead we should just standardize on one, hardcoded, metadata format that's useful for our purposes, and then translate other formats to that format. The simplest thing is just a flat series of columns. In other words, something like this:
[snip]
And of course metadata formats differ, and not all metadata fields need to be queryable or indexable. It would be perfectly acceptable to parse out some common interesting metadata into columns, and leave all the other random stuff in a serialized PHP blob, much as we have today. That structure could be recursive or whatever floats your boat.
Hmm, I like the idea of using the serialized blobs generally, and then exposing some special few interesting properties into another table. I was actually thinking that perhaps page_props could be used for this. Currently all it contains is the hidden category listings (well and theoretically any extension can house stuff there using $wgPagePropLinkInvalidations, but i have yet to see an extension use that, which is a little surprising as it seems like a wonderful way to make really cool extensions really easily). Although it seems as if that table is more meant for properties that change the behaviour of the page they belong to in some way (like __HIDDENCAT__), any metadata stored there would still be a "property", so I don't think thats too abusing its purpose too much. Really there seems no reason to create a new table if that one will do fine.
Thanks a lot for presenting your design here in detail. If you want to take it to a wiki I can reiterate some of this debate on your design's talk page.
Thank you for responding, your post has given me a lot to think about. I still have a lot to learn about databases, and especially scalable databases, and I really appreciate all the comments that you and everyone else on this list have gave me.
Platonides wrote:
Since you are storing in the db the metadata of the images, try to make the schema able to store metadata coming from the page, so it can be used to implement bug 8298 or extensions like ImageFilter.
I think the page_props table would be the best way to implement bug 8298. Actually i was reading up on the page_props table the other day, and I believe that in the commit implementing that table, bug 8298 was given as an example of something cool the table could be used to implement.
However, if I do implement a new table as part of this, it will probably use page_ids to identify the image - I don't see any reason to artificially restrict it to just the file namespace.
Thanks again everyone for all the comments. I really appreciate the great response :) -- -bawolff
2010/5/31 bawolff <bawolff+wn@gmail.com bawolff%2Bwn@gmail.com>
I'm going to use this message to respond to several people with this email, hopefully it doesn't become confusing.
Since you're managinf EXIF data, consider too the possibility to *add* to them some relevant Commons: metadata (name of the File: page, copyright, categories). It would be great, if possible, that downoloaded image would contain them.
I'm far from deep into the matter, it is only an old, layman idea I get when approaching to the magics of EXIF.
Alex
Hi Alex. Thats actually on my list of to do if I have time. Building a metadata editor for files on the wiki (probably in the form of an extension) would be in phase 2 of my project. (In my project proposal it was on the list of things to do if I have extra time).
Cheers, bawolff
On Mon, May 31, 2010 at 5:31 AM, Alex Brollo alex.brollo@gmail.com wrote:
2010/5/31 bawolff bawolff+wn@gmail.com
I'm going to use this message to respond to several people with this email, hopefully it doesn't become confusing.
Since you're managinf EXIF data, consider too the possibility to add to them some relevant Commons: metadata (name of the File: page, copyright, categories). It would be great, if possible, that downoloaded image would contain them.
I'm far from deep into the matter, it is only an old, layman idea I get when approaching to the magics of EXIF.
Alex
On 31 May 2010 14:14, bawolff bawolff+wn@gmail.com wrote:
I'm going to use this message to respond to several people with this email, hopefully it doesn't become confusing.
Hmm, I like the idea of using the serialized blobs generally, and then exposing some special few interesting properties into another table. I was actually thinking that perhaps page_props could be used for this. Currently all it contains is the hidden category listings (well and theoretically any extension can house stuff there using $wgPagePropLinkInvalidations, but i have yet to see an extension use that, which is a little surprising as it seems like a wonderful way to make really cool extensions really easily). Although it seems as if that table is more meant for properties that change the behaviour of the page they belong to in some way (like __HIDDENCAT__), any metadata stored there would still be a "property", so I don't think thats too abusing its purpose too much. Really there seems no reason to create a new table if that one will do fine.
[...]
I think the page_props table would be the best way to implement bug 8298. Actually i was reading up on the page_props table the other day, and I believe that in the commit implementing that table, bug 8298 was given as an example of something cool the table could be used to implement.
I tried to use page_props once. I did end up using my own table, since the parser thinks it owns the page_props table and when page is parsed it happily deletes all values stored in page_props it doesn't know about.
-Niklas
2010/5/31 Niklas Laxström niklas.laxstrom@gmail.com:
I tried to use page_props once. I did end up using my own table, since the parser thinks it owns the page_props table and when page is parsed it happily deletes all values stored in page_props it doesn't know about.
The parser does indeed own the page_props table. It's intended for storing properties that can be derived at parse time and set by the parser itself or a parser hook through $parserOutput->setProperty($name, $value) .
Roan Kattouw (Catrope)
On Mon, May 31, 2010 at 7:04 AM, Roan Kattouw roan.kattouw@gmail.com wrote:
2010/5/31 Niklas Laxström niklas.laxstrom@gmail.com:
I tried to use page_props once. I did end up using my own table, since the parser thinks it owns the page_props table and when page is parsed it happily deletes all values stored in page_props it doesn't know about.
The parser does indeed own the page_props table. It's intended for storing properties that can be derived at parse time and set by the parser itself or a parser hook through $parserOutput->setProperty($name, $value) .
Roan Kattouw (Catrope)
Ah, I was thinking that looked like something too perfectly fitted to the situation to be true. However I still think it may be a good approach to generally keep the metadata as a serialized php blob, and have another table, similar looking to page_props to store specific metadata values of interest.
cheers, bawolff
bawolff wrote:
However, if I do implement a new table as part of this, it will probably use page_ids to identify the image - I don't see any reason to artificially restrict it to just the file namespace.
Do remember that we'll also want to store metadata for old versions of images. And yes, I'm aware that our database schema isn't likely to lend itself to easy and elegant solutions to that. (My advice: try to go for simplicity at the expense of elegance, as long as it doesn't end up _too_ ugly.)
Yes, I'll definitly keep that in mind. Currently I'm thinking that for deleted images and old versions of images I'll keep all the metadata inside img_metadata as serialized php, since most of the use cases for having metadata in a different table really only apply to the current version of images.
Cheers, bawolff
On Tue, Jun 1, 2010 at 1:23 AM, Ilmari Karonen nospam@vyznev.net wrote:
bawolff wrote:
However, if I do implement a new table as part of this, it will probably use page_ids to identify the image - I don't see any reason to artificially restrict it to just the file namespace.
Do remember that we'll also want to store metadata for old versions of images. And yes, I'm aware that our database schema isn't likely to lend itself to easy and elegant solutions to that. (My advice: try to go for simplicity at the expense of elegance, as long as it doesn't end up _too_ ugly.)
-- Ilmari Karonen
Since you are storing in the db the metadata of the images, try to make the schema able to store metadata coming from the page, so it can be used to implement bug 8298 or extensions like ImageFilter.
wikitech-l@lists.wikimedia.org