On 01/12/2012 05:37 AM, Sebastian Hellmann wrote:
Hello all, is there a query language for wiki syntax? (NOTE: I really do not mean the Wikipedia API here.)
I am looking for an easy way to scrape data from Wiki pages. In this way, we could apply a crowd-sourcing approach to knowledge extraction from Wikis.
There must be thousands of data scraping approaches. But is there one amongst them that has developed a "wiki scraper language" ? Maybe with some sort of fuzziness involved, if the pages are too messy. I have not yet worked with the XML transformation of the wiki markup:
*action=expandtemplates ** generatexml - Generate XML parse tree
Is it any good for issuing XPATH queries ?
You could use an HTML parser to produce a DOM of the rendered document, and then process that using plain DOM methods or JQuery. An example would be the 'html5' node.js module, which produces a DOM compatible with JQuery. There are also more specialized HTML scrape libraries available in various languages.
Rendered HTML obviously misses some of the information available in the wiki source, so you might have to rely on CSS class / tag pairs to identify template output.
Gabriel