I'm going to use this message to respond to several people with this email, hopefully it doesn't become confusing.
Markus wrote: [snip]
(1) You use mediumblob for values.
I'll be honest, I chose a type at random for that field. It needed to be long since it should be able to store rather long strings since some metadata formats don't have length limits on strings. (in that version of the new table plan anyway. based on feedback, I think I'll try to make my plan for tables much simpler)
Each row in your table specifies...meta_qualifies
In xmp you can have special type of properties that instead of being a property of the image, modify the meaning of another property. The example given in the spec was if you have a creator property, you could have a qualifer for that property named role that denotes if that author proerty is the singer, the writer, or whatever. Its most common use seems to be in if multiple thumbnails of the image are stored in xmp at different resolutions, it uses qualifiers to specify the resolutions of the different choices (which is a kind of moot example for us, as i don't think we want to be storing embeded thumbnails of the image in the db). the column was meant to be boolean flag to say if this property was a sub-property of the parent, or if it modified the meaning of the parent.
But overall, I am quite excited to see this project progressing. Maybe we could have some more alignment between the projects later on (How about combining image metadata and custom wiki metadata about image pages in queries? :-) but for GSoC you should definitely focus on your core goals and solve this task as good as possible.
Based on the comments I recieved I might be moving towards a more simple table layout which will probably be less aligned with SMW_light's goals, but I'd love to see more alignment where it fits into the goals of my project. Personally I've always thought that a lot of the smw stuff was rather cool.
On Fri, May 28, 2010 at 3:28 PM, Neil Kandalgaonkar neilk@wikimedia.org wrote:
[snip]
Okay, I just wrote a little novel here, but please take it as just opening a discussion. I think you should try for a simpler design, but I'm open to discussion.
After reading the comments so far I tend to agree that perhaps my original design was a bit more complicated than it needed to be. Scalability is pretty much the number one concern, so the simpler the better
BLOBS OF SERIALIZED PHP ARE GOOD
You should not be afraid of storing (some) data as serialized PHP, *especially* if it's a complex data structure. If the database doesn't need to query or index on a particular field, then it's a huge win NOT to parse it out into columns and reassemble it into PHP data structures on every access.
GO FOR MEANINGFUL DATA, NOT DATA PRESERVATION
Okay onto the next topic -- how you want to parse XMP out into a flat structure, with links between them. I think you were clever in how you tried to make the cost of storing the tree relatively minimal, but I just question whether it's necessary to store it at all, and whether this meets our needs.
[snip]
So we shouldn't attempt to make a meta-metadata-format that has all the features of all possible metadata formats. Instead we should just standardize on one, hardcoded, metadata format that's useful for our purposes, and then translate other formats to that format. The simplest thing is just a flat series of columns. In other words, something like this:
[snip]
And of course metadata formats differ, and not all metadata fields need to be queryable or indexable. It would be perfectly acceptable to parse out some common interesting metadata into columns, and leave all the other random stuff in a serialized PHP blob, much as we have today. That structure could be recursive or whatever floats your boat.
Hmm, I like the idea of using the serialized blobs generally, and then exposing some special few interesting properties into another table. I was actually thinking that perhaps page_props could be used for this. Currently all it contains is the hidden category listings (well and theoretically any extension can house stuff there using $wgPagePropLinkInvalidations, but i have yet to see an extension use that, which is a little surprising as it seems like a wonderful way to make really cool extensions really easily). Although it seems as if that table is more meant for properties that change the behaviour of the page they belong to in some way (like __HIDDENCAT__), any metadata stored there would still be a "property", so I don't think thats too abusing its purpose too much. Really there seems no reason to create a new table if that one will do fine.
Thanks a lot for presenting your design here in detail. If you want to take it to a wiki I can reiterate some of this debate on your design's talk page.
Thank you for responding, your post has given me a lot to think about. I still have a lot to learn about databases, and especially scalable databases, and I really appreciate all the comments that you and everyone else on this list have gave me.
Platonides wrote:
Since you are storing in the db the metadata of the images, try to make the schema able to store metadata coming from the page, so it can be used to implement bug 8298 or extensions like ImageFilter.
I think the page_props table would be the best way to implement bug 8298. Actually i was reading up on the page_props table the other day, and I believe that in the commit implementing that table, bug 8298 was given as an example of something cool the table could be used to implement.
However, if I do implement a new table as part of this, it will probably use page_ids to identify the image - I don't see any reason to artificially restrict it to just the file namespace.
Thanks again everyone for all the comments. I really appreciate the great response :) -- -bawolff