Hello all, is there a query language for wiki syntax? (NOTE: I really do not mean the Wikipedia API here.)
I am looking for an easy way to scrape data from Wiki pages. In this way, we could apply a crowd-sourcing approach to knowledge extraction from Wikis.
There must be thousands of data scraping approaches. But is there one amongst them that has developed a "wiki scraper language" ? Maybe with some sort of fuzziness involved, if the pages are too messy. I have not yet worked with the XML transformation of the wiki markup:
*action=expandtemplates ** generatexml - Generate XML parse tree
Is it any good for issuing XPATH queries ?
Thank you very much, Sebastian
Hi
Your question is to vaguely formulated - please correct it
On Thu, Jan 12, 2012 at 2:37 PM, Sebastian Hellmann < hellmann@informatik.uni-leipzig.de> wrote:
Hello all, is there a query language for wiki syntax? (NOTE: I really do not mean the Wikipedia API here.)
I am looking for an easy way to scrape data from Wiki pages. In this way, we could apply a crowd-sourcing approach to knowledge extraction from Wikis.
There must be thousands of data scraping approaches. But is there one amongst them that has developed a "wiki scraper language" ? Maybe with some sort of fuzziness involved, if the pages are too messy. I have not yet worked with the XML transformation of the wiki markup:
*action=expandtemplates ** generatexml - Generate XML parse tree
Is it any good for issuing XPATH queries ?
1. XPATH reqires XML , mediawiki markup is not XML. 2. the only aplication which (correctly!?) expands templates is MedaiWiki itself. 3. You neglected to explain what you are trying to scrape and what constitutes a messy page.
Thank you very much, Sebastian
-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects: http://nlp2rdf.org , http://dbpedia.org Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org
Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l
Dear all, I had a misconfigured mail client and did not receive any of your answers in January. I concluded, that the mailing list was not populated. I really have to apologize for not replying to your answers.
Since we assumed that nobody replied, we already started to develop a generic, configurable scraper and used it on the Englsih and German Wiktionary. The config files and data can be found here (it is part of DBpedia): [1] [2] [3] . We hope that it is generic enough to be applied to all languages of Wiktionary and that it can also be used on other MediaWikis (e.g. travelwiki.org). Normally a transformation is done by an Extract-Transform-Load (ETL) process. Generally the E (extract) can also be considered a "select" or "query" procedure. Hence my initial question about the "Wiki Query Language". If you have a good language for E, then T and L are easy ;)
One of the main unsolved problems, yet, is scraping infos from templates: to effectively build a generic scraper, it would require to be able to "interpret" templates right. Templates are a good way to structure information, and are easy to scrape (technically speaking) . The problem is more that you would need one config file for each template to get "good" data. In Wikipedia, infoboxes can all be parsed with the same algorithm, but in DBpedia we still have to do so-called "mappings" to get good data: http://mappings.dbpedia.org/ Infoboxes are a special case however, as they are all structured in a similar way. So the "mapping solution" only works for infoboxes.
It comes down to these two options: a) create one scraper configuration for each template, which captures the intention of the creator and allows to "correctly" scrape the data from all pages. b) load all necessary template definitions into MediaWiki and then do a transformation to HTML or XML and use XPath (or JQuery)
On 01/12/2012 03:38 PM, Oren Bochman wrote:
- the only aplication which (correctly!?) expands templates is
MedaiWiki itself.
(Thanks for your answer) I agree, that only Mediawiki can "correctly" expand templates, as it can interpret the code on the template pages. The MediaWiki parser can transform Wiki Markup into XML and HTML. (I am currently not aware of any other transformation options.)
On 01/12/2012 07:06 PM, Gabriel Wicke wrote:
Rendered HTML obviously misses some of the information available in the wiki source, so you might have to rely on CSS class / tag pairs to identify template output.
(Thanks for your answer): It misses some information, but it also gets more on the other hand. A good example would be inflection of the Latin word "suus" in wiktionary: http://en.wiktionary.org/wiki/suus#Latin
====Inflection==== {{la-decl-1&2|su}}
To ask more precisely: Is there a best practice for scraping data from Wikipedia? What is the smartest way to resolve templates for scraping? Am I not seeing any third option?
On 01/12/2012 06:56 PM, Platonides wrote:
I don't think so. I think the most similar piece used are applying regex to the page. Which you may find too powerful/low-level.
Regex is effective, but has its limits. We included it as a tool.
I hope this has not been TL;DR and thanks again for your answers. All the best, Sebastian
[1] http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/extraction_framework/file/f4... [2] http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/extraction_framework/file/f4... [3] http://downloads.dbpedia.org/wiktionary/
Hello Sebastian,
It comes down to these two options: a) create one scraper configuration for each template, which captures the intention of the creator and allows to "correctly" scrape the data from all pages. b) load all necessary template definitions into MediaWiki and then do a transformation to HTML or XML and use XPath (or JQuery)
On 01/12/2012 03:38 PM, Oren Bochman wrote:
- the only aplication which (correctly!?) expands templates is
MedaiWiki itself.
(Thanks for your answer) I agree, that only Mediawiki can "correctly" expand templates, as it can interpret the code on the template pages. The MediaWiki parser can transform Wiki Markup into XML and HTML. (I am currently not aware of any other transformation options.)
we are currently working on http://www.mediawiki.org/wiki/Parsoid, a JS parser that by now expands templates well and also supports a few parser functions. We need to mark up template parameters for the visual editor in any case, and plan to employ HTML5 microdata or RDFa for this purpose (see http://www.mediawiki.org/wiki/Parsoid/HTML5_DOM_with_microdata). I intend to start implementing this sometime this month. Let us know if you have feedback / ideas on the microdata or RDFa design.
To ask more precisely: Is there a best practice for scraping data from Wikipedia? What is the smartest way to resolve templates for scraping? Am I not seeing any third option?
AFAIK most scraping is based on parsing the WikiText source. This gets you the top-most template parameters, which might already be good enough for many of your applications.
We try to provide provenance information for expanded content in the HTML DOM produced by Parsoid. Initially this will likely focus on top-level arguments, as that is all we need for the editor. Extending this to nested expansions should be quite straightforward however, as provenance is tracked per-token internally.
Gabriel
On 03/04/2012 05:09 AM, Gabriel Wicke wrote:
Hello Sebastian,
It comes down to these two options: a) create one scraper configuration for each template, which captures the intention of the creator and allows to "correctly" scrape the data from all pages. b) load all necessary template definitions into MediaWiki and then do a transformation to HTML or XML and use XPath (or JQuery)
On 01/12/2012 03:38 PM, Oren Bochman wrote:
- the only aplication which (correctly!?) expands templates is
MedaiWiki itself.
(Thanks for your answer) I agree, that only Mediawiki can "correctly" expand templates, as it can interpret the code on the template pages. The MediaWiki parser can transform Wiki Markup into XML and HTML. (I am currently not aware of any other transformation options.)
we are currently working on http://www.mediawiki.org/wiki/Parsoid, a JS parser that by now expands templates well and also supports a few parser functions. We need to mark up template parameters for the visual editor in any case, and plan to employ HTML5 microdata or RDFa for this purpose (see http://www.mediawiki.org/wiki/Parsoid/HTML5_DOM_with_microdata). I intend to start implementing this sometime this month. Let us know if you have feedback / ideas on the microdata or RDFa design.
To ask more precisely: Is there a best practice for scraping data from Wikipedia? What is the smartest way to resolve templates for scraping? Am I not seeing any third option?
AFAIK most scraping is based on parsing the WikiText source. This gets you the top-most template parameters, which might already be good enough for many of your applications.
We try to provide provenance information for expanded content in the HTML DOM produced by Parsoid. Initially this will likely focus on top-level arguments, as that is all we need for the editor. Extending this to nested expansions should be quite straightforward however, as provenance is tracked per-token internally.
Gabriel
On 03/04/2012 02:09 PM, Gabriel Wicke wrote:
Hello Sebastian,
It comes down to these two options: a) create one scraper configuration for each template, which captures the intention of the creator and allows to "correctly" scrape the data from all pages. b) load all necessary template definitions into MediaWiki and then do a transformation to HTML or XML and use XPath (or JQuery)
On 01/12/2012 03:38 PM, Oren Bochman wrote:
- the only aplication which (correctly!?) expands templates is
MedaiWiki itself.
(Thanks for your answer) I agree, that only Mediawiki can "correctly" expand templates, as it can interpret the code on the template pages. The MediaWiki parser can transform Wiki Markup into XML and HTML. (I am currently not aware of any other transformation options.)
we are currently working on http://www.mediawiki.org/wiki/Parsoid, a JS parser that by now expands templates well and also supports a few parser functions. We need to mark up template parameters for the visual editor in any case, and plan to employ HTML5 microdata or RDFa for this purpose (see http://www.mediawiki.org/wiki/Parsoid/HTML5_DOM_with_microdata). I intend to start implementing this sometime this month. Let us know if you have feedback / ideas on the microdata or RDFa design.
Awesome!! I forwarded it to DBpedia developers. I think, the Parsoid project might interest some of our people. How is it possible to join? Or is it Wikimedia internal development? Is there a parsoid mailing list?
Can JS handle this? I read somewher, that it was several magnitudes slower than other languages... Maybe this is not true for node-JS.
All the data in our mappings wiki was created to "mark up" Wikipedia template parameters. So please try to reuse it. I think there are almost 200 active users in http://mappings.dbpedia.org/ who have added extra parsing information to thousands of templates in Wikipedia across 20 languages. You can download and reuse it or we can also add your requirements to it.
All the best, Sebastian
To ask more precisely: Is there a best practice for scraping data from Wikipedia? What is the smartest way to resolve templates for scraping? Am I not seeing any third option?
AFAIK most scraping is based on parsing the WikiText source. This gets you the top-most template parameters, which might already be good enough for many of your applications.
We try to provide provenance information for expanded content in the HTML DOM produced by Parsoid. Initially this will likely focus on top-level arguments, as that is all we need for the editor. Extending this to nested expansions should be quite straightforward however, as provenance is tracked per-token internally.
Gabriel
Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l
On 03/05/2012 08:49 AM, Sebastian Hellmann wrote:
On 03/04/2012 02:09 PM, Gabriel Wicke wrote:
we are currently working on http://www.mediawiki.org/wiki/Parsoid, a JS parser that by now expands templates well and also supports a few parser functions. We need to mark up template parameters for the visual editor in any case, and plan to employ HTML5 microdata or RDFa for this purpose (see http://www.mediawiki.org/wiki/Parsoid/HTML5_DOM_with_microdata). I intend to start implementing this sometime this month. Let us know if you have feedback / ideas on the microdata or RDFa design.
Awesome!! I forwarded it to DBpedia developers. I think, the Parsoid project might interest some of our people. How is it possible to join? Or is it Wikimedia internal development? Is there a parsoid mailing list?
This list (wikitext-l) is also the Parsoid mailing list.
As part of the MediaWiki project, Parsoid development is also open source and available for anyone to help with. We'd welcome your help!
Check out https://www.mediawiki.org/wiki/Parsoid#Getting_started and https://www.mediawiki.org/wiki/Parsoid/Todo for ways to jump in.
Awesome!! I forwarded it to DBpedia developers. I think, the Parsoid project might interest some of our people. How is it possible to join? Or is it Wikimedia internal development? Is there a parsoid mailing list?
You are very welcome to join- http://www.mediawiki.org/wiki/Parsoid has most of the information to get you started. We are using this mailing list for discussions. You can also catch me in the #mediawiki IRC channel as gwicke.
Can JS handle this? I read somewhere, that it was several magnitudes slower than other languages... Maybe this is not true for node-JS.
Competition between JS runtimes has improved performance a lot in the last years. See for example the fun Computer Language Benchmarks Game: http://shootout.alioth.debian.org/u32/which-programming-languages-are-fastes...
It is still hard to beat C or C++ performance for memory-dominated tasks of course.
All the data in our mappings wiki was created to "mark up" Wikipedia template parameters. So please try to reuse it. I think there are almost 200 active users in http://mappings.dbpedia.org/ who have added extra parsing information to thousands of templates in Wikipedia across 20 languages. You can download and reuse it or we can also add your requirements to it.
Our primary requirement is marking up all top-level template arguments (and generated content like image thumbnails) to enable editing in the visual editor. The editor could however also benefit from type information, so refining vocabulary information (and perhaps mapping into an ontology) is also interesting to us. We should definitely collaborate on this.
What do you think about embedding schema information (maybe RDFa profiles?) into the noinclude section of a template page?
Gabriel
Dear Gabriel, I cross-posted to dbpedia-developers list. @DBpedia Team: although the text below might seem out of context, the Wikitext-list is actually the one that overlaps the most with our main topic: parsing Wiki syntax and templates. Please have a look at the http://www.mediawiki.org/wiki/Parsoid project.
@Gabriel: I think we should not include the markup in the <noinclude> section, but on the doc page of the template, so it also helps normal editors to better know, what the templates mean. Back then we actually designed our approach to work in this way and also attempted to add it to WP. Of course, we were using a naive WP:BOLD approach, which got deleted: http://en.wikipedia.org/w/index.php?title=Template:Infobox_person/doc&ol...
But we nevertheless used template syntax, hoping that one day it will be included in Wikipedia.
see http://mappings.dbpedia.org/index.php/Mapping:Infobox_actor
{{TemplateMapping | mapToClass = Actor | mappings = {{ PropertyMapping | templateProperty = name | ontologyProperty = foaf:name }} {{ PropertyMapping | templateProperty = birth_place | ontologyProperty = birthPlace }} {{ DateIntervalMapping | templateProperty = yearsactive | startDateOntologyProperty = activeYearsStartYear | endDateOntologyProperty = activeYearsEndYear }} .... }}
It kind of helps to interpret the template and parse out the values correctly. It seems that you try to do something similar. MAybe we can just modify or change out approach, so it also fits your requirements. I will be on holidays in the rest of March, so there will not be any mails from me any more. Sebastian
On 03/05/2012 06:43 PM, Gabriel Wicke wrote:
Awesome!! I forwarded it to DBpedia developers. I think, the Parsoid project might interest some of our people. How is it possible to join? Or is it Wikimedia internal development? Is there a parsoid mailing list?
You are very welcome to join- http://www.mediawiki.org/wiki/Parsoid has most of the information to get you started. We are using this mailing list for discussions. You can also catch me in the #mediawiki IRC channel as gwicke.
Can JS handle this? I read somewhere, that it was several magnitudes slower than other languages... Maybe this is not true for node-JS.
Competition between JS runtimes has improved performance a lot in the last years. See for example the fun Computer Language Benchmarks Game: http://shootout.alioth.debian.org/u32/which-programming-languages-are-fastes...
It is still hard to beat C or C++ performance for memory-dominated tasks of course.
All the data in our mappings wiki was created to "mark up" Wikipedia template parameters. So please try to reuse it. I think there are almost 200 active users in http://mappings.dbpedia.org/ who have added extra parsing information to thousands of templates in Wikipedia across 20 languages. You can download and reuse it or we can also add your requirements to it.
Our primary requirement is marking up all top-level template arguments (and generated content like image thumbnails) to enable editing in the visual editor. The editor could however also benefit from type information, so refining vocabulary information (and perhaps mapping into an ontology) is also interesting to us. We should definitely collaborate on this.
What do you think about embedding schema information (maybe RDFa profiles?) into the noinclude section of a template page?
Gabriel
Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l
On 12/01/12 14:37, Sebastian Hellmann wrote:
Hello all, is there a query language for wiki syntax? (NOTE: I really do not mean the Wikipedia API here.)
I am looking for an easy way to scrape data from Wiki pages. In this way, we could apply a crowd-sourcing approach to knowledge extraction from Wikis.
There must be thousands of data scraping approaches. But is there one amongst them that has developed a "wiki scraper language" ? Maybe with some sort of fuzziness involved, if the pages are too messy.
I don't think so. I think the most similar piece used are applying regex to the page. Which you may find too powerful/low-level.
On 01/12/2012 05:37 AM, Sebastian Hellmann wrote:
Hello all, is there a query language for wiki syntax? (NOTE: I really do not mean the Wikipedia API here.)
I am looking for an easy way to scrape data from Wiki pages. In this way, we could apply a crowd-sourcing approach to knowledge extraction from Wikis.
There must be thousands of data scraping approaches. But is there one amongst them that has developed a "wiki scraper language" ? Maybe with some sort of fuzziness involved, if the pages are too messy. I have not yet worked with the XML transformation of the wiki markup:
*action=expandtemplates ** generatexml - Generate XML parse tree
Is it any good for issuing XPATH queries ?
You could use an HTML parser to produce a DOM of the rendered document, and then process that using plain DOM methods or JQuery. An example would be the 'html5' node.js module, which produces a DOM compatible with JQuery. There are also more specialized HTML scrape libraries available in various languages.
Rendered HTML obviously misses some of the information available in the wiki source, so you might have to rely on CSS class / tag pairs to identify template output.
Gabriel
wikitext-l@lists.wikimedia.org