Re: [Wikitech-l] [gsoc] splitting the img_metadata field into a new table

28 May 2010

(This gets a little bit off the topic, but it should still be helpful for the 
current discussion. But if we want to discuss a more general data management 
architecture for MW, then it might be sensible to make a new thread ;-)

On Freitag, 28. Mai 2010, Michael Dale wrote:
...
  More important than file_metadata and page asset
metadata working with
 the same db table backed, its important that you can query export all
 the properties in the same way.

 Within SMW you already have some "special" properties like pagelinks,
 langlinks, category properties etc, that are not stored the same as the
 other SMW page properties ...  The SMW system should name-space all
 these file_metadata properties along with all the other structured data
 available and enable universal querying / RDF exporting all the
 structured wiki data. This way file_metadata would just be one more
 special data type with its own independent tables. ...

 SMW should abstract the data store so it works with the existing
 structured tables. I know this was already done for categories correct? 
More recent versions of SMW actually do no longer use MW's category table for 
this, mostly to improve query performance.
[In a nutshell: SMW properties can refer to non-existing pages, and the full 
version of SMW therefore has its own independent page id management (because 
we want to use numerical IDs for all pages that are used as property values, 
whether or not they exist). Using IDs everywhere improves our query 
performance and reduces SQL query size, but it creates a barrier for including 
MW table data since more joins would be needed to translate between IDs. This 
is one reason SMW Light will not support queries: it uses a much simpler DB 
layout and less code, but the resulting DB is not as suitable for querying.]

...
  Was enabling this for all the other links and usage
tables explored? 
Having a unified view on the variety of MediaWiki data (page metadata, user-
edited content data, file metadata, ...) would of course be great. But 
accomplishing this would require a more extensive effort than our little SMW 
extension. What SMW tries to provide for now is just a way of storing user-
edited data in a wiki (and also for displaying/exporting it).

Of course SMW already has a PHP abstraction for handling the property-value 
pairs that were added to some page, and this abstraction layer completely 
hides the underlying DB tables. This allows us to make more data accessible 
even if it is in other tables, and even to change the DB layout of our custom 
tables if required. You are right that such an abstraction could be extended 
to cover more of the native data of MediaWiki, so that data dumps can include 
it as well.

I think this idea is realistic, and I hope that SMW helps to accomplish this 
in some future. Yet, this is not a small endeavour given that not even most 
basic data management features are deployed on the big Wikimedia projects 
today. To get there, we first need a code review regarding security and 
performance, and so for the moment we are pressed to reduce features and to 
shrink your code base. This is why we are currently building the "Light" 
version that only covers data input (without link syntax extensions), storage, 
look-up, and basic export/dump. For this step, I really think that sharing a 
data table with the EXIF extension would make sense, since the data looks very 
similar and a more complex DB layout is not necessary for the initial goals. 
We can always consider using more tables if the need arises.

But I would be very happy if there were more people who want to make concrete 
progress toward the goal you describe. Meanwhile, we are planning in smaller 
steps ;-)

...

 This also make sense from an architecture perspective, where
 file_metadata is tied to the file asset and SMW properties are tied to
 the asset wiki description page.  This way you know you don't have to
 think about that subset of metadata properties on page updates since
 they are tied to the file asset not the wiki page propriety driven from
 structured user input. Likewise uploading a new version of the file
 would not touch the page data tables. 
Right, it might be useful to distinguish the internal handles (and external 
URIs) of the Image page and of the image file. But having a dedicated 
meta_schema value for user-edited properties of the page might suffice to 
accomplish this on the DB level. I am fairly agnostic about the details, but I 
have a tendency to wait with developing a more sophisticated DB layout until 
we have some usage statistics from the initial deployment to guide us.

-- Markus

...

 Markus Krötzsch wrote:
  Hi Bawolff,

 interesting project! I am currently preparing a "light" version of SMW
 that does something very similar, but using wiki-defined properties for
 adding metadata to normal pages (in essence, SMW is an extension to store
 and retrieve page metadata for properties defined in the wiki -- like XMP
 for MW pages; though our data model is not quite as sophisticated ;-).

 The use cases for this light version are just what you describe: simple
 retrieval (select) and basic inverse searches. The idea is to thus have a
 solid foundation for editing and viewing data, so that more complex
 functions like category intersections or arbitrary metadata conjunctive
 queries would be done on external servers based on some data dump.

 It would be great if the table you design could be used for such metadata
 as well. As you say, XMP already requires extensibility by design, so it
 might not be too much work to achieve this. SMW properties are usually
 identified by pages in the wiki (like categories), so page titles can be
 used to refer to them. This just requires that the meta_name field is
 long enough to hold MW page title names. Your meta_schema could be used
 to separate wiki properties from other XMP properties. SMW Light does not
 require nested structures, but they could be interesting for possible
 extensions (the full SMW does support one-level of nesting for making
 compound values).

 Two things about your design I did not completely understand (maybe just
 because I don't know much about XMP):

 (1) You use mediumblob for values. This excludes range searches for
 numerical image properties ("Show all images of height 1000px or more")
 which do not seem to be overly costly if a suitable schema were used. If
 XMP has a typing scheme for property values anyway, then I guess one
 could find the numbers and simply put them in a table where the value
 field is a number. Is this use case out of scope for you, or do you think
 the cost of reading from two tables too high? One could also have an
 optional helper field "meta_numvalue" used for sorting/range-SELECT when
 it is known from the input that the values that are searched for are
 numbers.

 (2) Each row in your table specifies property (name and schema), type,
 and the additional meta_qualifies. Does this mean that one XMP property
 can have values of many different types and with different flags for
 meta_qualifies? Otherwise it seems like a lot of redundant data. Also,
 one could put stuff like type and qualifies into the mediumblob value
 field if they are closely tied together (I guess, when searching for some
 value, you implicitly specify what type the data you search for has, so
 it is not problematic to search for the value + type data at once). Maybe
 such considerations could simplify the table layout, and also make it
 less specific to XMP.

 But overall, I am quite excited to see this project progressing. Maybe we
 could have some more alignment between the projects later on (How about
 combining image metadata and custom wiki metadata about image pages in
 queries? :-) but for GSoC you should definitely focus on your core goals
 and solve this task as good as possible.

 Best regards,

 Markus

 On Freitag, 28. Mai 2010, bawolff wrote:
  Hi all,

 For those who don't know me, I'm one of the GSOC students this year.
 My mentor is ^demon, and my project is to enhance support for metadata
 in uploaded files. Similar to the recent thread on interwiki
 transclusions, I'd thought I'd ask for comments about what I propose
 to do.

 Currently metadata is stored in img_metadata field of the image table
 as a serialized php array. Well this works fine for the primary use
 case - listing the metadata in a little box on the image description
 page, its not very flexible. Its impossible to do queries like get a
 list of images with some specific metadata property equal to some
 specific value, or get a list of images ordered by what software
 edited them.

 So as part of my project I would like to move the metadata to its own
 table. However I think the structure of the table will need to be a
 little more complicated then just <page id>, <name>, <value> triples,
 since ideally it would be able to store XMP metadata, which can
 contain nested structures. XMP metadata is pretty much the most
 complex metadata format currently popular (for metadata stored inside
 images anyways), and can store pretty much all other types of
 metadata. Its also the only format that can store multi-lingual
 content, which is a definite plus as those commons folks love their
 languages. Thus I think it would be wise to make the table store
 information in a manner that is rather close to the XMP data model.

 So basically my proposed metadata table looks like:

 *meta_id - primary key, auto-incrementing integer
 *meta_page - foreign key for page_id - what image is this for
 *meta_type - type of entry - simple value or some sort of compound
 structure. XMP supports ordered/unordered lists, associative array
 type structures, alternate array's (things like arrays listing the
 value of the property in different languages).
 *meta_schema - xmp uses different namespaces to prevent name
 collisions. exif properties have their own namespace, IPTC properties
 have their own namespace, etc
 *meta_name - The name of the property
 *meta_value - the value of the property (or null for some compound
 things, see below)
 *meta_ref - a reference to a meta_id of a different row for nested
 structures, or null if not applicable (or 0 perhaps)
 *meta_qualifies - boolean to denote if this property is a qualifier
 (in XMP there are normal properties and qualifiers)

 (see http://www.mediawiki.org/wiki/User:Bawolff/metadata_table for a
 longer explanation of the table structure)

 Now, before everyone says eww nested structures in a db are
 inefficient and what not, I don't think its that bad (however I'm new
 to the whole scalability thing, so hopefully someone more
 knowledgeable than me will confirm or deny that).

 The XMP specification specifically says that there is no artificial
 limit on nesting depth, however in general practise its not nested
 very deeply. Furthermore in most cases the tree structure can be
 safely ignored. Consider:
 *Use-case 1 (primary usecase), displaying a metadata info box on an
 image page. Most of the time that'd be translating specific name and
 values into html table cells. The tree structure is totally
 unnecessary. for example the exif property DateTimeOriginal can only
 appear once per image (also it can only appear at the root of the tree
 structure but thats beside the point). There is no need to reconstruct
 the tree, just look through all the props for the one you need. If the
 tree structure is important  it can be reconstructed on the php side,
 and would typically be only the part of the tree that is relevant, not
 the entire nested structure.
 *Use-case 2 (secondary usecase). Get list of images ordered by some
 property starting at foo. or get list of images where property bar =
 baz. In this case its a simple select. It does not matter where in the
 tree structure the property is.

 Thus, all the nestedness of XMP is preserved (So we could re-output it
 into xmp form if we so desired), and there is no evil joining the
 metadata table with itself over and over again (or at all), which from
 what i understand, self-joining to reconstruct nested structures is
 what makes them inefficient in databases.

 I also think this schema would be future proof because it can store
 pretty much all metadata we can think of. We can also extend it with
 custom properties we make up that are guaranteed to not conflict with
 anything (The X in xmp is for extensible).

 As a side-note, based on my rather informal survey of commons (aka the
 couple people who happened to be on #wikimedia-commons at that moment)
 another use-case people think would be cool and useful is metadata
 intersections, and metadata-category intersections. I'm not planning
 to do this as part of my project, as I believe that would have
 performance issues. However doing a metadata table like this does
 leave the possibility open for people to do such intersection things
 on the toolserver or in a DPL-like extension.

 I'd love to get some feedback on this. Is this a reasonable approach
 for me to take on this.

 Thanks for reading.

 --
 -bawolff

 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l 
 ------------------------------------------------------------------------

 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l 

-- 
Markus Krötzsch  &lt;markus(a)semantic-mediawiki.org&gt;
* Personal page: http://korrekt.org
* Semantic MediaWiki: http://semantic-mediawiki.org
* Semantic Web textbook: http://semantic-web-book.org
--

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] [gsoc] splitting the img_metadata field into a new table