Dear all, I had a misconfigured mail client and did not receive any of your answers in January. I concluded, that the mailing list was not populated. I really have to apologize for not replying to your answers.
Since we assumed that nobody replied, we already started to develop a generic, configurable scraper and used it on the Englsih and German Wiktionary. The config files and data can be found here (it is part of DBpedia): [1] [2] [3] . We hope that it is generic enough to be applied to all languages of Wiktionary and that it can also be used on other MediaWikis (e.g. travelwiki.org). Normally a transformation is done by an Extract-Transform-Load (ETL) process. Generally the E (extract) can also be considered a "select" or "query" procedure. Hence my initial question about the "Wiki Query Language". If you have a good language for E, then T and L are easy ;)
One of the main unsolved problems, yet, is scraping infos from templates: to effectively build a generic scraper, it would require to be able to "interpret" templates right. Templates are a good way to structure information, and are easy to scrape (technically speaking) . The problem is more that you would need one config file for each template to get "good" data. In Wikipedia, infoboxes can all be parsed with the same algorithm, but in DBpedia we still have to do so-called "mappings" to get good data: http://mappings.dbpedia.org/ Infoboxes are a special case however, as they are all structured in a similar way. So the "mapping solution" only works for infoboxes.
It comes down to these two options: a) create one scraper configuration for each template, which captures the intention of the creator and allows to "correctly" scrape the data from all pages. b) load all necessary template definitions into MediaWiki and then do a transformation to HTML or XML and use XPath (or JQuery)
On 01/12/2012 03:38 PM, Oren Bochman wrote:
- the only aplication which (correctly!?) expands templates is
MedaiWiki itself.
(Thanks for your answer) I agree, that only Mediawiki can "correctly" expand templates, as it can interpret the code on the template pages. The MediaWiki parser can transform Wiki Markup into XML and HTML. (I am currently not aware of any other transformation options.)
On 01/12/2012 07:06 PM, Gabriel Wicke wrote:
Rendered HTML obviously misses some of the information available in the wiki source, so you might have to rely on CSS class / tag pairs to identify template output.
(Thanks for your answer): It misses some information, but it also gets more on the other hand. A good example would be inflection of the Latin word "suus" in wiktionary: http://en.wiktionary.org/wiki/suus#Latin
====Inflection==== {{la-decl-1&2|su}}
To ask more precisely: Is there a best practice for scraping data from Wikipedia? What is the smartest way to resolve templates for scraping? Am I not seeing any third option?
On 01/12/2012 06:56 PM, Platonides wrote:
I don't think so. I think the most similar piece used are applying regex to the page. Which you may find too powerful/low-level.
Regex is effective, but has its limits. We included it as a tool.
I hope this has not been TL;DR and thanks again for your answers. All the best, Sebastian
[1] http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/extraction_framework/file/f4... [2] http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/extraction_framework/file/f4... [3] http://downloads.dbpedia.org/wiktionary/