On 05/28/2010 08:03 AM, bawolff wrote:
Hi all,
For those who don't know me, I'm one of the GSOC students this year. My mentor is ^demon, and my project is to enhance support for metadata in uploaded files. Similar to the recent thread on interwiki transclusions, I'd thought I'd ask for comments about what I propose to do.
Excellent! We're glad to have you on board. (FWIW I'm working on Multimedia Usability, so I'll be watching what you come up with closely).
So as part of my project I would like to move the metadata to its own table.
Great, although perhaps we could consider other things too (see below).
ideally it would be able to store XMP metadata, which can contain nested structures.
Now, before everyone says eww nested structures in a db are inefficient and what not, I don't think its that bad (however I'm new to the whole scalability thing, so hopefully someone more knowledgeable than me will confirm or deny that).
Okay, I just wrote a little novel here, but please take it as just opening a discussion. I think you should try for a simpler design, but I'm open to discussion.
I'm familiar with how MySQL scales (particularly for large image collections). Commons has a respectable collection of > 6.6 million media files, but just to put that into perspective, Facebook gets that in just a few hours. If we're successful in improving Commons' usability we'll probably get some multiple of our current intake rate. So we have to plan for hundreds of millions of media files, at least.
POTENTIAL ISSUES WITH TREE STRUCTURES
Tree structures in MySQL can be deceptively fast especially on single-machine tests. But it tends to be a nightmare in production environments. Turning what could be one query into eight or nine isn't so bad on a single machine, but consider when the database and web server are relatively far apart or loaded and thus and have high latency.
Also, due to its crappy locking, MySQL sucks at keeping tree structures consistent. If we were doing this on Oracle it would be a different story -- they have some fancy features that make trees easy -- but we're on MySQL.
The most scalable architectures use MySQL's strengths. MySQL is weak at storing trees. It's good at querying really simple, flat, schema.
BLOBS OF SERIALIZED PHP ARE GOOD
You should not be afraid of storing (some) data as serialized PHP, *especially* if it's a complex data structure. If the database doesn't need to query or index on a particular field, then it's a huge win NOT to parse it out into columns and reassemble it into PHP data structures on every access.
GO FOR MEANINGFUL DATA, NOT DATA PRESERVATION
Okay onto the next topic -- how you want to parse XMP out into a flat structure, with links between them. I think you were clever in how you tried to make the cost of storing the tree relatively minimal, but I just question whether it's necessary to store it at all, and whether this meets our needs.
It seems to me (correct me if I'm wrong) that your structure is two steps beyond id-key-val in abstractness: it's id-schema-key-type-val. So the meaning of keys depends on the schema? So we might have a set of rows like this, for two images, id 1234 and 5678:
id: 1234 schema: AcmeMetadata key: cameraType type: string val: canon-digital-rebel-xt
id: 1234 schema: AcmeMetadata key: resolution type: string val: 1600x1200
id: 5678 schema: SomeOtherSchema key: recordingDevice type: string val: Canon Digital Rebel XT
id: 5678 schema: SomeOtherSchema key: width type: int val: 1600
id:5678 schema: SomeOtherSchema key: height type: int val: 1200
The point is that between schemas, we'd use different keys and values to represent the same thing. While you've done a good job of preserving the exact data we received, this makes for an impossibly complicated query if we want to learn anything.
When you find yourself defining a 'type' column you should be wary, because you're engaging in Inner-Platform Effect. MySQL has types already.
It seems to me that we have no requirement to preserve exact key and value names in our database. What we need is *meaning* not data.
So we shouldn't attempt to make a meta-metadata-format that has all the features of all possible metadata formats. Instead we should just standardize on one, hardcoded, metadata format that's useful for our purposes, and then translate other formats to that format. The simplest thing is just a flat series of columns. In other words, something like this:
id: 1234 cameraType: canon-digital-rebel-xt width: 1600 height: 1200
id: 5678 cameraType: canon-digital-rebel-xt width: 1600 height: 1200
WHY EVEN HAVE A SEPARATE TABLE?
If we're doing one row per media file (the ideal!) then there is no reason why you can't simply append these new metadata columns onto the existing image table. This would make querying REALLY easy, and it would simplify database management.
And of course metadata formats differ, and not all metadata fields need to be queryable or indexable. It would be perfectly acceptable to parse out some common interesting metadata into columns, and leave all the other random stuff in a serialized PHP blob, much as we have today. That structure could be recursive or whatever floats your boat.
I would even argue that you should combine multilingual descriptions into the same column, possibly as serialized PHP or maybe just a simple string format. At Wikimedia, we do not do MySQL-based text search any more -- and for other MediaWiki installations that do, combining all description strings into one field is arguably a benefit. For Wikimedia projects, we can deserialize that field and break it out into languages when we feed it to Lucene.
There could be good reasons to have separate tables for metadata, but I think we should carefully consider the costs.
another use-case people think would be cool and useful is metadata intersections, and metadata-category intersections.
Sounds like a lot of work to make it efficient. Probably shouldn't be a high priority.
Thanks a lot for presenting your design here in detail. If you want to take it to a wiki I can reiterate some of this debate on your design's talk page.