context ------- i’m working on a mediawiki extension, http://www.mediawiki.org/wiki/Extension:GWToolset, which has as one of its goals, the ability to upload media files to a wiki. the extension, among other tasks, will process an XML file that has a list of urls to media files and upload those media files to the wiki along with metadata contained within the XML file. our ideal goal is to have this extension run on http://commons.wikimedia.org/ onhttp://commons.wikimedia.org/.
background ---------- h ttp://commons.wikimedia.org/wiki/Commons:GLAMToolset_project/Request_for_Comments/Technical_Architecturehttp://commons.wikimedia.org/wiki/Commons:GLAMToolset_project/Request_for_Comments/Technical_Architecture
Metadata Set Repo ----------------- one of the goals of the project is to store Metadata Sets, such as XML under some type of version control. those Metadata Sets need to be accessible so that the extension can grab the content from it and process it. processing involves iterating over the entire Metadata Set and creating Jobs for the Job Queue which will upload each individual media file and metadata into a media file page using a Mediawiki template format, such as Artwork.
some initial requirements • File sizes • can range from a few kilobytes to several megabytes. • max file-size is 100mb.
• XML Schema - not required. • XML DTD - not required.
• When metadata is in XML format, each record must consist of a single parent with many child • XML attribute lang= is the only one currently used and without user interaction
• There is no need to display the Metadata sets in the wiki. • There is no need to edit the Metadata sets in the wiki.
we initially developed the extension to store the files in the File: namespace, but we were told by the Foundation that we should use ContentHandler instead. unfortunately there is an issue with storing content > 1mb in the db so we need to find another solution.
1. any suggestions?
Mapping ------- a mapping is a json that maps a metadata set to a mediawiki template. we’re currently storing those as Content in the namespace GWToolset. an entry might be in GWToolset:Metadata_Mappings/Dan-nl/Rijkmuseum.
1. does that namespace make sense? a. if not, what namespace would you recommend?
2. does this concept make sense? a. if not, what would you recommend?
Maintaining Original Metadata Snippet & Mapping ----------------------------------------------- another goal is to link or somehow connect the original metadata used to create the mediafile:
• metadata set • metadata snippet • metadata mapping
the current thought is to insert these items as comments within the wiki text of the media file page
1. does that make sense? a. if not, what would you recommend doing?
2. is there a better way to do this?
mediawiki template parameters ----------------------------- the application needs to know what mediawiki template parameters exist and are available to use for mapping media file metadata to the mediawiki templates. for the moment we are hard-coding these parameters in a db table and sometimes in the code. this is not ideal. i have briefly seen TemplateData, but haven’t had enough time to see if it would address our needs.
1. is there a way to programatically discover the available parameters for a mediawiki template?
thanks in advance for your help! dan
On Wed, Jul 24, 2013 at 11:59 AM, dan entous <dan.entous.wikimedia@gmail.com
wrote:
context
i’m working on a mediawiki extension, http://www.mediawiki.org/wiki/Extension:GWToolset, which has as one of its goals, the ability to upload media files to a wiki. the extension, among other tasks, will process an XML file that has a list of urls to media files and upload those media files to the wiki along with metadata contained within the XML file. our ideal goal is to have this extension run on http://commons.wikimedia.org/ onhttp://commons.wikimedia.org/.
Check out the 'DataPages' subdirectory in the mediawiki/extensions/examples repository (< https://git.wikimedia.org/summary/mediawiki%2Fextensions%2Fexamples.git%3E). It was designed to showcase how to work with ContentHandler, and it does so by implementing an XML content type and namespace.
we initially developed the extension to store the files in the File: namespace, but we were told by the Foundation that we should use ContentHandler instead. unfortunately there is an issue with storing content > 1mb in the db so we need to find another solution.
That's a lot of XML! You can gzip page content, FWIW.
On Wed, Jul 24, 2013 at 9:12 PM, Ori Livneh ori@wikimedia.org wrote: thanks for the reply Ori.
On Wed, Jul 24, 2013 at 11:59 AM, dan entous <dan.entous.wikimedia@gmail.com
wrote:
context
i’m working on a mediawiki extension, http://www.mediawiki.org/wiki/Extension:GWToolset, which has as one of its goals, the ability to upload media files to a wiki. the extension, among other tasks, will process an XML file that has a list of urls to media files and upload those media files to the wiki along with metadata contained within the XML file. our ideal goal is to have this extension run on http://commons.wikimedia.org/ onhttp://commons.wikimedia.org/.
Check out the 'DataPages' subdirectory in the mediawiki/extensions/examples repository (< https://git.wikimedia.org/summary/mediawiki%2Fextensions%2Fexamples.git%3E). It was designed to showcase how to work with ContentHandler, and it does so by implementing an XML content type and namespace.
in our last meeting with the foundation, in july, we were asked to move away from ContentHandler since there is a potential for XML files to exceed a 1mb limit. at the moment, the extension is using ContentHandler and DOMDocument to read the XML Content because in june we were asked to use ContentHandler; we originally planned to read the XML as a file and use XMLReader, which would be more efficient.
in a subsequent reply to this thread, Brian offers a potential way of dealing with this issue. i’ll be able to take a look at his approach later this month, but if anyone can prove the concept beforehand or refer me to some code that has already done so, that would be great.
we initially developed the extension to store the files in the File: namespace, but we were told by the Foundation that we should use ContentHandler instead. unfortunately there is an issue with storing content > 1mb in the db so we need to find another solution.
That's a lot of XML! You can gzip page content, FWIW. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Wed, Jul 24, 2013 at 08:59:25PM +0200, dan entous wrote:
Mapping
a mapping is a json that maps a metadata set to a mediawiki template. we’re currently storing those as Content in the namespace GWToolset. an entry might be in GWToolset:Metadata_Mappings/Dan-nl/Rijkmuseum.
- does that namespace make sense?
a. if not, what namespace would you recommend?
I'd say that the example you gave should give a better hint about what the namespace should be called: Metadata mapping. /wiki/Metadata_mapping:Rijkmuseum makes a lot more sense from a resource/ subresource perspective, since "Metadata mappings" wouldn't be a resource on its own, just a parent directory for other resources. And per-user directories probably wouldn't make much sense, IMO.
mediawiki template parameters
the application needs to know what mediawiki template parameters exist and are available to use for mapping media file metadata to the mediawiki templates. for the moment we are hard-coding these parameters in a db table and sometimes in the code. this is not ideal. i have briefly seen TemplateData, but haven’t had enough time to see if it would address our needs.
- is there a way to programatically discover the available parameters for
a mediawiki template?
TemplateData is, in fact, exactly what you need for that.
On Wed, Jul 24, 2013 at 9:45 PM, Mark Holmquist mtraceur@member.fsf.org wrote:
thanks for the reply Mark.
On Wed, Jul 24, 2013 at 08:59:25PM +0200, dan entous wrote:
Mapping
a mapping is a json that maps a metadata set to a mediawiki template. we’re currently storing those as Content in the namespace GWToolset. an entry might be in GWToolset:Metadata_Mappings/Dan-nl/Rijkmuseum.
- does that namespace make sense?
a. if not, what namespace would you recommend?
I'd say that the example you gave should give a better hint about what the namespace should be called: Metadata mapping. /wiki/Metadata_mapping:Rijkmuseum makes a lot more sense from a resource/ subresource perspective, since "Metadata mappings" wouldn't be a resource on its own, just a parent directory for other resources. And per-user directories probably wouldn't make much sense, IMO.
the mappings will serve a specific purpose. they will map potentially unique XML metadata formats and standard XML metadata formats to mediawiki template parameters. would the namespace Metadata_mappings, i prefer plural because there will be many mappings, be too generic or would that suffice for everyone?
i still believe that the use of the user name is important. two or more people could come up with their own version of how to map Rijksmuseum metadata with mediawiki template parameters, so if we continue with this namespacing concept the potential title would be : Metadata_mappings:Dan-nl/Rijksmusem.
one thing i forgot to mention was the addition of an extension to the title to help identify the format of the content of the title. we were planning to use .json, so the end title would be : Metadata_mappings:Dan-nl/Rijksmusem.json. would that make sense to everyone?
mediawiki template parameters
the application needs to know what mediawiki template parameters exist and are available to use for mapping media file metadata to the mediawiki templates. for the moment we are hard-coding these parameters in a db table and sometimes in the code. this is not ideal. i have briefly seen TemplateData, but haven’t had enough time to see if it would address our needs.
- is there a way to programatically discover the available parameters for
a mediawiki template?
TemplateData is, in fact, exactly what you need for that.
i took a look at the current implementation of TemplateData on commons and have not seen it used for the templates we're currently looking at. for now i will look into coding the use of the TemplateData if present and if not, fallback to our current db look-up implementation.
http://commons.wikimedia.org/w/api.php?action=templatedata&titles=Templa... http://commons.wikimedia.org/w/api.php?action=templatedata&titles=Templa... http://commons.wikimedia.org/w/api.php?action=templatedata&titles=Templa... http://commons.wikimedia.org/w/api.php?action=templatedata&titles=Templa... http://commons.wikimedia.org/w/api.php?action=templatedata&titles=Templa...
-- Mark Holmquist Software Engineer, Multimedia Wikimedia Foundation mtraceur@member.fsf.org https://wikimediafoundation.org/wiki/User:MHolmquist
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Fri, Aug 02, 2013 at 11:57:36AM +0200, dan entous wrote:
the mappings will serve a specific purpose. they will map potentially unique XML metadata formats and standard XML metadata formats to mediawiki template parameters. would the namespace Metadata_mappings, i prefer plural because there will be many mappings, be too generic or would that suffice for everyone?
There are many articles. We use Article:. There are many users. We use User:. It makes little sense to depart from established practice.
i still believe that the use of the user name is important. two or more people could come up with their own version of how to map Rijksmuseum metadata with mediawiki template parameters, so if we continue with this namespacing concept the potential title would be : Metadata_mappings:Dan-nl/Rijksmusem.
But the Rijksmusem isn't a subresource of you. If anything I would suggest having a "base" and enabling subpages so users could add their own mappings, hopefully with more informative titles than just their usernames, like Metadata_mapping:Rijksmusem/No_publication_date or something. (admittedly I made something up but you get the idea)
one thing i forgot to mention was the addition of an extension to the title to help identify the format of the content of the title. we were planning to use .json, so the end title would be : Metadata_mappings:Dan-nl/Rijksmusem.json. would that make sense to everyone?
There's no need for this. Everything in this namespace would be JSON, so putting that information in the title twice would be silly.
i took a look at the current implementation of TemplateData on commons and have not seen it used for the templates we're currently looking at. for now i will look into coding the use of the TemplateData if present and if not, fallback to our current db look-up implementation.
http://commons.wikimedia.org/w/api.php?action=templatedata&titles=Templa... http://commons.wikimedia.org/w/api.php?action=templatedata&titles=Templa... http://commons.wikimedia.org/w/api.php?action=templatedata&titles=Templa... http://commons.wikimedia.org/w/api.php?action=templatedata&titles=Templa... http://commons.wikimedia.org/w/api.php?action=templatedata&titles=Templa...
It wouldn't be hard to add to these templates, and I've already done it for the Information template, so this would be a good idea to do now-or-soon. Interface with Nazmul, who is Rasel160, who's been working on auto-generating forms for Commons templates in UploadWizard, and see if you can't work together on this :)
Ta,
Dan,
Great move with the extension, I think it as good way to integrate the GWT into Commons. I would like to make you aware of this proposal to move Commons towards linked data: http://commons.wikimedia.org/wiki/Commons:Wikidata_for_media_info
Obviously this move would simplify the work for you since you would know beforehand what are the properties available and use them as you please (unlike templates). Some other aspects would be perhaps more complicated, like determining which Wikidata item to link for a given author.
This move in Commons is supposed to be "some time next year" (according to the talk page), so in the mean time using TemplateData as Mark suggested, seems a good idea.
David
On Wed, Jul 24, 2013 at 2:59 PM, dan entous dan.entous.wikimedia@gmail.comwrote:
context
i’m working on a mediawiki extension, http://www.mediawiki.org/wiki/Extension:GWToolset, which has as one of its goals, the ability to upload media files to a wiki. the extension, among other tasks, will process an XML file that has a list of urls to media files and upload those media files to the wiki along with metadata contained within the XML file. our ideal goal is to have this extension run on http://commons.wikimedia.org/ onhttp://commons.wikimedia.org/.
background
h ttp:// commons.wikimedia.org/wiki/Commons:GLAMToolset_project/Request_for_Comments/Technical_Architecture < http://commons.wikimedia.org/wiki/Commons:GLAMToolset_project/Request_for_Co...
Metadata Set Repo
one of the goals of the project is to store Metadata Sets, such as XML under some type of version control. those Metadata Sets need to be accessible so that the extension can grab the content from it and process it. processing involves iterating over the entire Metadata Set and creating Jobs for the Job Queue which will upload each individual media file and metadata into a media file page using a Mediawiki template format, such as Artwork.
some initial requirements • File sizes • can range from a few kilobytes to several megabytes. • max file-size is 100mb.
• XML Schema - not required. • XML DTD - not required.
• When metadata is in XML format, each record must consist of a single parent with many child • XML attribute lang= is the only one currently used and without user interaction
• There is no need to display the Metadata sets in the wiki. • There is no need to edit the Metadata sets in the wiki.
we initially developed the extension to store the files in the File: namespace, but we were told by the Foundation that we should use ContentHandler instead. unfortunately there is an issue with storing content > 1mb in the db so we need to find another solution.
- any suggestions?
Mapping
a mapping is a json that maps a metadata set to a mediawiki template. we’re currently storing those as Content in the namespace GWToolset. an entry might be in GWToolset:Metadata_Mappings/Dan-nl/Rijkmuseum.
- does that namespace make sense?
a. if not, what namespace would you recommend?
- does this concept make sense?
a. if not, what would you recommend?
Maintaining Original Metadata Snippet & Mapping
another goal is to link or somehow connect the original metadata used to create the mediafile:
• metadata set • metadata snippet • metadata mapping
the current thought is to insert these items as comments within the wiki text of the media file page
- does that make sense?
a. if not, what would you recommend doing?
- is there a better way to do this?
mediawiki template parameters
the application needs to know what mediawiki template parameters exist and are available to use for mapping media file metadata to the mediawiki templates. for the moment we are hard-coding these parameters in a db table and sometimes in the code. this is not ideal. i have briefly seen TemplateData, but haven’t had enough time to see if it would address our needs.
- is there a way to programatically discover the available parameters for
a mediawiki template?
thanks in advance for your help! dan _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Metadata Set Repo
one of the goals of the project is to store Metadata Sets, such as XML under some type of version control. those Metadata Sets need to be accessible so that the extension can grab the content from it and process it. processing involves iterating over the entire Metadata Set and creating Jobs for the Job Queue which will upload each individual media file and metadata into a media file page using a Mediawiki template format, such as Artwork.
some initial requirements • File sizes • can range from a few kilobytes to several megabytes. • max file-size is 100mb.
• XML Schema - not required. • XML DTD - not required.
• When metadata is in XML format, each record must consist of a single parent with many child • XML attribute lang= is the only one currently used and without user interaction
• There is no need to display the Metadata sets in the wiki. • There is no need to edit the Metadata sets in the wiki.
we initially developed the extension to store the files in the File: namespace, but we were told by the Foundation that we should use ContentHandler instead. unfortunately there is an issue with storing content > 1mb in the db so we need to find another solution.
- any suggestions?
What I would suggest is a hybrid approach. The metadata file gets uploaded, and is stored using FileBackend class. (There's a couple extensions that store "files" without them being a file page. For example the Score extension stores the rendered files on the server, but its not attached to any file page). Once the xml file is on the server, use ContentHandler to make a new content type that stores a reference to the file [instead of the original file] (probably in the form of a mediawiki virtual file url).
--bawolff
On Wed, Jul 31, 2013 at 7:19 PM, Brian Wolff bawolff@gmail.com wrote:
Metadata Set Repo
one of the goals of the project is to store Metadata Sets, such as XML under some type of version control. those Metadata Sets need to be accessible so that the extension can grab the content from it and process it. processing involves iterating over the entire Metadata Set and creating Jobs for the Job Queue which will upload each individual media file and metadata into a media file page using a Mediawiki template format, such as Artwork.
some initial requirements • File sizes • can range from a few kilobytes to several megabytes. • max file-size is 100mb.
• XML Schema - not required. • XML DTD - not required.
• When metadata is in XML format, each record must consist of a single parent with many child • XML attribute lang= is the only one currently used and without user interaction
• There is no need to display the Metadata sets in the wiki. • There is no need to edit the Metadata sets in the wiki.
we initially developed the extension to store the files in the File: namespace, but we were told by the Foundation that we should use ContentHandler instead. unfortunately there is an issue with storing content > 1mb in the db so we need to find another solution.
- any suggestions?
What I would suggest is a hybrid approach. The metadata file gets uploaded, and is stored using FileBackend class. (There's a couple extensions that store "files" without them being a file page. For example the Score extension stores the rendered files on the server, but its not attached to any file page). Once the xml file is on the server, use ContentHandler to make a new content type that stores a reference to the file [instead of the original file] (probably in the form of a mediawiki virtual file url).
--bawolff
thanks for the response brian. i’ll be able to take a look at this at the end of the month. if you or anyone else has the time to prototype this, prove the concept and refer me to the code that would be great.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
1. has anyone been able to prototype brian’s concept? 2. if we get this concept to work, is it an acceptable manner in which to store XML data in the foundation cluster? a. if not, any other suggestions on what we can do instead?
On Fri, Aug 2, 2013 at 12:01 PM, dan entous dan.entous.wikimedia@gmail.com wrote:
thanks for the response brian. i’ll be able to take a look at this at the end of the month. if you or anyone else has the time to prototype this, prove the concept and refer me to the code that would be great.
On Wed, Jul 31, 2013 at 7:19 PM, Brian Wolff bawolff@gmail.com wrote:
What I would suggest is a hybrid approach. The metadata file gets uploaded, and is stored using FileBackend class. (There's a couple extensions that store "files" without them being a file page. For example the Score extension stores the rendered files on the server, but its not attached to any file page). Once the xml file is on the server, use ContentHandler to make a new content type that stores a reference to the file [instead of the original file] (probably in the form of a mediawiki virtual file url).
--bawolff
On Wed, Jul 24, 2013 at 11:59 AM, dan entous <dan.entous.wikimedia@gmail.com wrote:
Metadata Set Repo
one of the goals of the project is to store Metadata Sets, such as XML under some type of version control. those Metadata Sets need to be accessible so that the extension can grab the content from it and process it. processing involves iterating over the entire Metadata Set and creating Jobs for the Job Queue which will upload each individual media file and metadata into a media file page using a Mediawiki template format, such as Artwork.
some initial requirements • File sizes • can range from a few kilobytes to several megabytes. • max file-size is 100mb.
• XML Schema - not required. • XML DTD - not required.
• When metadata is in XML format, each record must consist of a single parent with many child • XML attribute lang= is the only one currently used and without user interaction
• There is no need to display the Metadata sets in the wiki. • There is no need to edit the Metadata sets in the wiki.
we initially developed the extension to store the files in the File: namespace, but we were told by the Foundation that we should use ContentHandler instead. unfortunately there is an issue with storing content > 1mb in the db so we need to find another solution.
- any suggestions?
wikitech-l@lists.wikimedia.org