Re: [Wikitech-l] [gsoc] splitting the img_metadata field into a new table

31 May 2010


      I'm going to use this message to respond to several people with this
email, hopefully it doesn't become confusing.
Markus wrote:
[snip]
...
(1) You use mediumblob for values.
I'll be honest, I chose a type at random for that field. It needed to
be long since it should be able to store rather long strings since
some metadata formats don't have length limits on strings. (in that
version of the new table plan anyway. based on  feedback, I think I'll
try to make my plan for tables much simpler)
...
Each row in your table specifies...meta_qualifies
In xmp you can have special type of properties that instead of being a
property of the image, modify the meaning of another property. The
example given in the spec was if you have a creator property, you
could have a qualifer for that property named role that denotes if
that author proerty is the singer, the writer, or whatever. Its most
common use seems to be in if multiple thumbnails  of the image are
stored in xmp at different resolutions, it uses qualifiers to specify
the resolutions of the different choices (which is a kind of moot
example for us, as i don't think we want to be storing embeded
thumbnails of the image in the db). the column was meant to be boolean
flag to say if this property was a sub-property of the parent, or if
it modified the meaning of the parent.
...
But overall, I am quite excited to see this project progressing. Maybe we
could have some more alignment between the projects later on (How about
combining image metadata and custom wiki metadata about image pages in
queries? :-) but for GSoC you should definitely focus on your core goals and
solve this task as good as possible.
Based on the comments I recieved I might be moving towards a more
simple table layout which will probably be less aligned with
SMW_light's goals, but I'd love to see more alignment where it fits
into the goals of my project. Personally I've always thought that a
lot of the smw stuff was rather cool.
On Fri, May 28, 2010 at 3:28 PM, Neil Kandalgaonkar neilk@wikimedia.org wrote:
[snip]
...
...
Okay, I just wrote a little novel here, but please take it as just
opening a discussion. I think you should try for a simpler design, but
I'm open to discussion.
After reading the comments so far I tend to agree that perhaps my
original design was a bit more complicated than it needed to be.
Scalability is pretty much the number one concern, so the simpler the
better
...
BLOBS OF SERIALIZED PHP ARE GOOD
You should not be afraid of storing (some) data as serialized PHP,
*especially* if it's a complex data structure. If the database doesn't
need to query or index on a particular field, then it's a huge win NOT
to parse it out into columns and reassemble it into PHP data structures
on every access.
GO FOR MEANINGFUL DATA, NOT DATA PRESERVATION
Okay onto the next topic -- how you want to parse XMP out into a flat
structure, with links between them. I think you were clever in how you
tried to make the cost of storing the tree relatively minimal, but I
just question whether it's necessary to store it at all, and whether
this meets our needs.
[snip]
...
So we shouldn't attempt to make a meta-metadata-format that has all the
features of all possible metadata formats. Instead we should just
standardize on one, hardcoded, metadata format that's useful for our
purposes, and then translate other formats to that format. The simplest
thing is just a flat series of columns. In other words, something like this:
[snip]
...
And of course metadata formats differ, and not all metadata fields need
to be queryable or indexable. It would be perfectly acceptable to parse
out some common interesting metadata into columns, and leave all the
other random stuff in a serialized PHP blob, much as we have today. That
structure could be recursive or whatever floats your boat.
Hmm, I like the idea of using the serialized blobs generally, and then
exposing some special few interesting properties into another table. I
was actually thinking that perhaps page_props could be used for this.
Currently all it contains is the  hidden category listings (well and
theoretically any extension can house stuff there using
$wgPagePropLinkInvalidations, but i have yet to see an extension use
that, which is a little surprising as it seems like a wonderful way to
make really cool extensions really easily). Although it seems as if
that table is more meant for properties that change the behaviour of
the page they belong to in some way (like __HIDDENCAT__), any metadata
stored there would still be a "property", so I don't think thats too
abusing its purpose too much. Really there seems no reason to create a
new table if that one will do fine.
...
Thanks a lot for presenting your design here in detail. If you want to
take it to a wiki I can reiterate some of this debate on your design's
talk page.
Thank you for responding, your post has given me a lot to think about.
I still have a lot to learn about databases, and especially scalable
databases, and I really appreciate all the comments that you and
everyone else on this list have gave me.
Platonides wrote:
...
Since you are storing in the db the metadata of the images, try to make
the schema able to store metadata coming from the page, so it can be
used to implement bug 8298 or extensions like ImageFilter.
I think the page_props table would be the best way to implement bug
8298. Actually i was reading up on the page_props table the other day,
and I believe that in the commit implementing that table, bug 8298 was
given as an example of something cool the table could be used to
implement.
However, if I do implement a new table as part of this, it will
probably use page_ids to identify the image - I don't see any reason
to artificially restrict it to just the file namespace.
Thanks again everyone for all the comments. I really appreciate the
great response :)
--
-bawolff

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] [gsoc] splitting the img_metadata field into a new table