On Tue, May 3, 2016 at 4:34 PM, Gergo Tisza gtisza@wikimedia.org wrote:
There aren't many options other than content-scraping if you want to transform Wikipedia articles into some semblance of structured data. We even do it ourselves, for media metadata (and use an XML parser for it
Actually the XML parser has been replaced with DOMDocument a while ago, which can handle HTML5 fine. But the point stands: HTML scraping is hardly an unusual requirement for reusers of our content.