On 09/04/2013 09:59 AM, Brian Wolff wrote:
This [1] looks quite acrobatic indeed. Can’t we make better use of the machine-readable markings provided by templates? https://commons.wikimedia.org/wiki/Commons:Machine-readable_data
[1] https://gerrit.wikimedia.org/r/#/c/80403/4/CommonsMetadata_body.php
It is using the machine readable data from that page. (Although its debatable how much "Look for a <td> with this id, and then look at the contents of the next sibling <td> you encounter is").
I'm somewhat of a newb though with extracting microformat style metadata, so its quite possible there is a better way, or some higher level parsing library I could use (Something like xpath maybe, although its not really xml I'm looking at).
Parsoid might be able to help you with access to template parameters along with the fully expanded HTML that was produced from them. See [1].
We are going to work on page metadata storage as well, see [2] and [3]. Maybe our storage work could eventually also provide a backend for you.
Gabriel
[1]: https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec#Template_content [2]: https://bugzilla.wikimedia.org/show_bug.cgi?id=53508 [3]: https://bugzilla.wikimedia.org/show_bug.cgi?id=49143