Re: [Wikitext-l] Wiki Query Language

4 Mar 2012


      Dear all,
I had a misconfigured mail client and did not receive any of your 
answers in January. I concluded, that the mailing list was not 
populated.  I really have to apologize for not replying to your answers.
Since we assumed that nobody replied, we already started to develop a 
generic, configurable scraper and used it on the Englsih and German 
Wiktionary. The config files and data can be found here (it is part of 
DBpedia): [1] [2] [3] . We hope that it is generic enough to be applied 
to all languages of Wiktionary and that it can also be used on other 
MediaWikis (e.g. travelwiki.org).
Normally a transformation is done by an Extract-Transform-Load (ETL) 
process. Generally the E (extract) can also be considered a "select" or 
"query" procedure. Hence my initial question about the "Wiki Query 
Language".  If you have a good language for E, then T and L are easy ;)
One of the main unsolved problems, yet, is scraping infos from 
templates: to effectively build a generic scraper, it would require to 
be able to "interpret" templates right.  Templates are a good way to 
structure information, and  are easy to scrape (technically speaking) . 
The problem is more that you would need one config file for each 
template to get "good" data. In Wikipedia, infoboxes can all be  parsed 
with the same algorithm, but in DBpedia we still have to do so-called 
"mappings" to get good data: http://mappings.dbpedia.org/   Infoboxes 
are a special case however, as they are all structured in a similar way. 
So the "mapping solution" only works for infoboxes.
It comes down to these two options:
a) create one scraper configuration for each template, which captures 
the intention of the creator and allows to "correctly" scrape the data 
from all pages.
b) load all necessary template definitions into MediaWiki and then do a 
transformation to HTML or XML and use XPath (or JQuery)
On 01/12/2012 03:38 PM, Oren Bochman wrote:
...

the only aplication which (correctly!?) expands templates is

MedaiWiki itself.
(Thanks for your answer) I agree, that only Mediawiki can "correctly" 
expand templates, as it can interpret the code on the template pages.  
The MediaWiki parser can transform Wiki Markup into XML and HTML. (I am 
currently not aware of any other transformation options.)
On 01/12/2012 07:06 PM, Gabriel Wicke wrote:
...
Rendered HTML obviously misses some of the information available in the
wiki source, so you might have to rely on CSS class / tag pairs to
identify template output.
(Thanks for your answer): It misses some information, but it also gets 
more on the other hand.
A good example would be inflection of the Latin word "suus" in 
wiktionary: http://en.wiktionary.org/wiki/suus#Latin
====Inflection====
{{la-decl-1&2|su}}
To ask more precisely:
Is there a best practice for scraping data from Wikipedia? What is the 
smartest way to resolve templates for scraping? Am I not seeing any 
third option?
On 01/12/2012 06:56 PM, Platonides wrote:
...
I don't think so. I think the most similar piece used are applying regex
to the page. Which you may find too powerful/low-level.
Regex is effective, but has its limits. We included it as a tool.
I hope this has not been TL;DR and thanks again for your answers.
All the best,
Sebastian
[1] 
http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/extraction_framework/file/f4...
[2] 
http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/extraction_framework/file/f4...
[3] http://downloads.dbpedia.org/wiktionary/
-- 
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Projects: http://nlp2rdf.org , http://dbpedia.org
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [Wikitext-l] Wiki Query Language