Hi all,
after quite some work into improving the DBpedia information extraction framework, we have released a new version of the DBpedia dataset today.
DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia and to link other datasets on the Web to Wikipedia data.
The DBpedia dataset describes 1,950,000 "things", including at least 80,000 persons, 70,000 places, 35,000 music albums, 12,000 films. It contains 657,000 links to images, 1,600,000 links to relevant external web pages and 440,000 external links into other RDF datasets. Altogether, the DBpedia dataset consists of around 103 million RDF triples.
The Dataset has been extracted from the July 2007 Wikipedia dumps of English, German, French, Spanish, Italian, Portuguese, Polish, Swedish, Dutch, Japanese, Chinese, Russian, Finnish and Norwegian versions of Wikipedia. It contains descriptions in all these languages.
Compared to the last version, we did the following:
1. Improved the Data Quality
We increased the quality of the data, be improving the DBpedia information extraction algorithms. So if you have decided that the old version of the dataset was too dirty for your application, please look again, you will be surprised :-)
2. Third Classification Schema Added
We have added a third classification schema to the dataset. Beside of the Wikipedia categorization and the YAGO classification, concepts are now also classified by associating them to WordNet synsets.
3. Geo-Coordinates
The dataset contains geo-coordinates for for geographic locations. Geo-coordinates are expressed using the W3C Basic Geo Vocabulary. This enables location-based SPARQL queries.
4. RDF Links to other Open Datasets
We interlinked DBpedia with further open datasets and ontologies. The dataset now contains 440 000 external RDF links into the Geonames, Musicbrainz, WordNet, World Factbook, EuroStat, Book Mashup, DBLP Bibliography and Project Gutenberg datasets. Altogether, the network of interlinked datasources around DBpedia currently amounts to around 2 billion RDF triples which are accessible as Linked Data on the Web.
The DBpedia dataset is licensed under the terms GNU Free Documentation License. The dataset can be accessed online via a SPARQL endpoint and as Linked Data. It can also be downloaded in the form of RDF dumps.
Please refer to the DBpedia webpage for more information about the dataset and its use cases:
Many thanks for their excellent work to:
1. Georgi Kobilarov (Freie Universität Berlin) who redesigned and improved the extraction framework and implemented many of the interlinking algorithms. 2. Piet Hensel (Freie Universität Berlin) who improved the infobox extraction code, wrote the unit test suite. 3. Richard Cyganiak (Freie Universität Berlin) for his advice on redesigning the architecture of the extraction framework and for helping to solve many annoying Unicode and URI problems. 4. Zdravko Tashev (OpenLink Software) for his patience to try several times to import buggy versions of the dataset into Virtuoso. 5. OpenLink Software altogether for providing the server that hosts the DBpedia SPARQL endpoint. 6. Sören Auer, Jens Lehmann and Jörg Schüppel (Universität Leipzig) for the original version of the infobox extraction code. 7. Tom Heath and Peter Coetzee (Open University) for the RDFS version of the YAGO class hirarchy. 8. Fabian M. Suchanek, Gjergji Kasneci (Max-Plank-Institut Saarbrücken) for allowing us to integrate the YAGO classification. 9. Christian Becker (Freie Universität Berlin) for writing the geo-coordinates and the homepage extractor. 10. Ivan Herman, Tim Berners-Lee, Rich Knopman and many others for their bug reports.
Have fun exploring the new dataset :-)
Cheers
Chris
-- Chris Bizer Freie Universität Berlin Phone: +49 30 838 54057 Mail: chris@bizer.de Web: www.bizer.de
-- Chris Bizer Freie Universität Berlin Phone: +49 30 838 54057 Mail: chris@bizer.de Web: www.bizer.de
This is probably a dumb question, but I'm going to ask it anyway. ;-)
Is there somewhere place we can find some simple documentation about how to build queries? And what do we do if we get unexpected results?
For example, I'm trying out an example query, using the query builder at http://wikipedia.aksw.org/index.php? (which I found under the "OnlineAccess" page on dbpedia.org, and which I'm assuming is an interface to the new DBpedia.org data, but please correct me if that's wrong), to find suburbs less than 10 kms from the CBD, using the following query:
Subject Predicate Object ?suburb rdf:type Category:Suburbs_of_Sydney ?suburb location <10
But that doesn't find any matches, which I'm guessing is because it seems to be based on Infoboxes ... so rephrased as an infobox query, from looking at the other examples I'm guessing it is something like this:
Subject Predicate Object ?Australian Place type suburb ?Australian Place city Sydney ?Australian Place dist1 <10
However, that finds no matches ... but if we increase the distance to 200 kms, like so:
Subject Predicate Object ?Australian Place type suburb ?Australian Place city Sydney ?Australian Place dist1 <200
(saved as: http://wikipedia.aksw.org/index.php?qid=145 ) ... then it finds 7 matches, which is good because it means the query probably made sense, but it should find many more. For example, why doesn't it match this page: http://en.wikipedia.org/w/index.php?title=Leichhardt%2C_New_South_Wales&... , which contains: ----------- {{Infobox Australian Place | type = suburb | name = Leichhardt | city = Sydney ... | dist1 = 5 ... }} ----------- ... which was added some time in March - is the answer that the data on http://wikipedia.aksw.org is old? Or am I just phrasing the query incorrectly? Or should I be using some other site or tool to perform these queries on DBpedia?
Also I tried it at http://dbpedia.org/sparql , with the query below (which was just a wild guess at the syntax, as the 2204-page virtdocs.pdf seemed a bit overwhelming, but it's probably wrong), but it kept giving 503 Service unavailable errors: ----------- select distinct ?Australian Place where type = 'suburb' and city = 'Sydney'; -----------
-- All the best, Nick.
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org]On Behalf Of Chris Bizer Sent: Thursday, 6 September 2007 2:15 AM To: wikitech-l@lists.wikimedia.org Subject: [Wikitech-l] DBpedia - Querying Wikipedia like a Database: Improveddataset released.
Hi all,
after quite some work into improving the DBpedia information extraction framework, we have released a new version of the DBpedia dataset today.
DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia and to link other datasets on the Web to Wikipedia data.
The DBpedia dataset describes 1,950,000 "things", including at least 80,000 persons, 70,000 places, 35,000 music albums, 12,000 films. It contains 657,000 links to images, 1,600,000 links to relevant external web pages and 440,000 external links into other RDF datasets. Altogether, the DBpedia dataset consists of around 103 million RDF triples.
The Dataset has been extracted from the July 2007 Wikipedia dumps of English, German, French, Spanish, Italian, Portuguese, Polish, Swedish, Dutch, Japanese, Chinese, Russian, Finnish and Norwegian versions of Wikipedia. It contains descriptions in all these languages.
Compared to the last version, we did the following:
- Improved the Data Quality
We increased the quality of the data, be improving the DBpedia information extraction algorithms. So if you have decided that the old version of the dataset was too dirty for your application, please look again, you will be surprised :-)
- Third Classification Schema Added
We have added a third classification schema to the dataset. Beside of the Wikipedia categorization and the YAGO classification, concepts are now also classified by associating them to WordNet synsets.
- Geo-Coordinates
The dataset contains geo-coordinates for for geographic locations. Geo-coordinates are expressed using the W3C Basic Geo Vocabulary. This enables location-based SPARQL queries.
- RDF Links to other Open Datasets
We interlinked DBpedia with further open datasets and ontologies. The dataset now contains 440 000 external RDF links into the Geonames, Musicbrainz, WordNet, World Factbook, EuroStat, Book Mashup, DBLP Bibliography and Project Gutenberg datasets. Altogether, the network of interlinked datasources around DBpedia currently amounts to around 2 billion RDF triples which are accessible as Linked Data on the Web.
The DBpedia dataset is licensed under the terms GNU Free Documentation License. The dataset can be accessed online via a SPARQL endpoint and as Linked Data. It can also be downloaded in the form of RDF dumps.
Please refer to the DBpedia webpage for more information about the dataset and its use cases:
Many thanks for their excellent work to:
- Georgi Kobilarov (Freie Universität Berlin) who redesigned and
improved the extraction framework and implemented many of the interlinking algorithms. 2. Piet Hensel (Freie Universität Berlin) who improved the infobox extraction code, wrote the unit test suite. 3. Richard Cyganiak (Freie Universität Berlin) for his advice on redesigning the architecture of the extraction framework and for helping to solve many annoying Unicode and URI problems. 4. Zdravko Tashev (OpenLink Software) for his patience to try several times to import buggy versions of the dataset into Virtuoso. 5. OpenLink Software altogether for providing the server that hosts the DBpedia SPARQL endpoint. 6. Sören Auer, Jens Lehmann and Jörg Schüppel (Universität Leipzig) for the original version of the infobox extraction code. 7. Tom Heath and Peter Coetzee (Open University) for the RDFS version of the YAGO class hirarchy. 8. Fabian M. Suchanek, Gjergji Kasneci (Max-Plank-Institut Saarbrücken) for allowing us to integrate the YAGO classification. 9. Christian Becker (Freie Universität Berlin) for writing the geo-coordinates and the homepage extractor. 10. Ivan Herman, Tim Berners-Lee, Rich Knopman and many others for their bug reports.
Have fun exploring the new dataset :-)
Cheers
Chris
-- Chris Bizer Freie Universität Berlin Phone: +49 30 838 54057 Mail: chris@bizer.de Web: www.bizer.de
-- Chris Bizer Freie Universität Berlin Phone: +49 30 838 54057 Mail: chris@bizer.de Web: www.bizer.de
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hi Nick,
Is there somewhere place we can find some simple documentation about how to build queries? And what do we do if we get unexpected results?
As you already found out, there are two ways of querying DBpedia:
1. the SPARQL endpoint which can be queried directly or via the SNORQLor OpenLink javascript query builders and 2. the Leipzig Query Builder.
You find documentation on how to write SPARQL queries at: http://www.w3.org/TR/rdf-sparql-query/ More links at http://en.wikipedia.org/wiki/SPARQL
As we got lots of traffic after the announcement yesterday, the SPARQL endpoint is a bit unstable this morning. We are busy fixing this.
For questions and documentation on how to use the Leipzig Query Builder (http://wikipedia.aksw.org/index.php), please ask Sören Auer (auer@informatik.uni-leipzig.de) and Jens Lehman (lehmann@informatik.uni-leipzig.de) who have implemented the tool.
The queries from the Leipzig Query Builder do not run against the SPARQL endpoint, but against a RDF store in Leipzig. I don't know if this store has already been updated for the new dataset, but I think Sören and Jens will take care of this shortly.
All the best,
Chris
-- Chris Bizer Freie Universität Berlin Phone: +49 30 838 54057 Mail: chris@bizer.de Web: www.bizer.de
----- Original Message ----- From: "Nick Jenkins" nickpj@gmail.com To: "Wikimedia developers" wikitech-l@lists.wikimedia.org Sent: Thursday, September 06, 2007 8:33 AM Subject: Re: [Wikitech-l] DBpedia - Querying Wikipedia like a Database:Improveddataset released.
This is probably a dumb question, but I'm going to ask it anyway. ;-)
Is there somewhere place we can find some simple documentation about how to build queries? And what do we do if we get unexpected results?
For example, I'm trying out an example query, using the query builder at http://wikipedia.aksw.org/index.php? (which I found under the "OnlineAccess" page on dbpedia.org, and which I'm assuming is an interface to the new DBpedia.org data, but please correct me if that's wrong), to find suburbs less than 10 kms from the CBD, using the following query:
Subject Predicate Object ?suburb rdf:type Category:Suburbs_of_Sydney ?suburb location <10
But that doesn't find any matches, which I'm guessing is because it seems to be based on Infoboxes ... so rephrased as an infobox query, from looking at the other examples I'm guessing it is something like this:
Subject Predicate Object ?Australian Place type suburb ?Australian Place city Sydney ?Australian Place dist1 <10
However, that finds no matches ... but if we increase the distance to 200 kms, like so:
Subject Predicate Object ?Australian Place type suburb ?Australian Place city Sydney ?Australian Place dist1 <200
(saved as: http://wikipedia.aksw.org/index.php?qid=145 ) ... then it finds 7 matches, which is good because it means the query probably made sense, but it should find many more. For example, why doesn't it match this page: http://en.wikipedia.org/w/index.php?title=Leichhardt%2C_New_South_Wales&... , which contains: ----------- {{Infobox Australian Place | type = suburb | name = Leichhardt | city = Sydney ... | dist1 = 5 ... }} ----------- ... which was added some time in March - is the answer that the data on http://wikipedia.aksw.org is old? Or am I just phrasing the query incorrectly? Or should I be using some other site or tool to perform these queries on DBpedia?
Also I tried it at http://dbpedia.org/sparql , with the query below (which was just a wild guess at the syntax, as the 2204-page virtdocs.pdf seemed a bit overwhelming, but it's probably wrong), but it kept giving 503 Service unavailable errors: ----------- select distinct ?Australian Place where type = 'suburb' and city = 'Sydney'; -----------
-- All the best, Nick.
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org]On Behalf Of Chris Bizer Sent: Thursday, 6 September 2007 2:15 AM To: wikitech-l@lists.wikimedia.org Subject: [Wikitech-l] DBpedia - Querying Wikipedia like a Database: Improveddataset released.
Hi all,
after quite some work into improving the DBpedia information extraction framework, we have released a new version of the DBpedia dataset today.
DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia and to link other datasets on the Web to Wikipedia data.
The DBpedia dataset describes 1,950,000 "things", including at least 80,000 persons, 70,000 places, 35,000 music albums, 12,000 films. It contains 657,000 links to images, 1,600,000 links to relevant external web pages and 440,000 external links into other RDF datasets. Altogether, the DBpedia dataset consists of around 103 million RDF triples.
The Dataset has been extracted from the July 2007 Wikipedia dumps of English, German, French, Spanish, Italian, Portuguese, Polish, Swedish, Dutch, Japanese, Chinese, Russian, Finnish and Norwegian versions of Wikipedia. It contains descriptions in all these languages.
Compared to the last version, we did the following:
- Improved the Data Quality
We increased the quality of the data, be improving the DBpedia information extraction algorithms. So if you have decided that the old version of the dataset was too dirty for your application, please look again, you will be surprised :-)
- Third Classification Schema Added
We have added a third classification schema to the dataset. Beside of the Wikipedia categorization and the YAGO classification, concepts are now also classified by associating them to WordNet synsets.
- Geo-Coordinates
The dataset contains geo-coordinates for for geographic locations. Geo-coordinates are expressed using the W3C Basic Geo Vocabulary. This enables location-based SPARQL queries.
- RDF Links to other Open Datasets
We interlinked DBpedia with further open datasets and ontologies. The dataset now contains 440 000 external RDF links into the Geonames, Musicbrainz, WordNet, World Factbook, EuroStat, Book Mashup, DBLP Bibliography and Project Gutenberg datasets. Altogether, the network of interlinked datasources around DBpedia currently amounts to around 2 billion RDF triples which are accessible as Linked Data on the Web.
The DBpedia dataset is licensed under the terms GNU Free Documentation License. The dataset can be accessed online via a SPARQL endpoint and as Linked Data. It can also be downloaded in the form of RDF dumps.
Please refer to the DBpedia webpage for more information about the dataset and its use cases:
Many thanks for their excellent work to:
- Georgi Kobilarov (Freie Universität Berlin) who redesigned and
improved the extraction framework and implemented many of the interlinking algorithms. 2. Piet Hensel (Freie Universität Berlin) who improved the infobox extraction code, wrote the unit test suite. 3. Richard Cyganiak (Freie Universität Berlin) for his advice on redesigning the architecture of the extraction framework and for helping to solve many annoying Unicode and URI problems. 4. Zdravko Tashev (OpenLink Software) for his patience to try several times to import buggy versions of the dataset into Virtuoso. 5. OpenLink Software altogether for providing the server that hosts the DBpedia SPARQL endpoint. 6. Sören Auer, Jens Lehmann and Jörg Schüppel (Universität Leipzig) for the original version of the infobox extraction code. 7. Tom Heath and Peter Coetzee (Open University) for the RDFS version of the YAGO class hirarchy. 8. Fabian M. Suchanek, Gjergji Kasneci (Max-Plank-Institut Saarbrücken) for allowing us to integrate the YAGO classification. 9. Christian Becker (Freie Universität Berlin) for writing the geo-coordinates and the homepage extractor. 10. Ivan Herman, Tim Berners-Lee, Rich Knopman and many others for their bug reports.
Have fun exploring the new dataset :-)
Cheers
Chris
-- Chris Bizer Freie Universität Berlin Phone: +49 30 838 54057 Mail: chris@bizer.de Web: www.bizer.de
-- Chris Bizer Freie Universität Berlin Phone: +49 30 838 54057 Mail: chris@bizer.de Web: www.bizer.de
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
This project is very, very interesting for a content syndacation, but I would have some documentation also for me.
As Wikimedia Ch board member we have registered www.wikipedia.ch but we have not content on it.
I thought to include some content concerning the Swiss in three languages on the same page. Using Mediawiki it's very difficult to made.
You understand that if I have the database I can proceed to include the content in any page without mediawiki.
Ilario Valdelli Wikimedia CH board member
On 9/5/07, Chris Bizer chris@bizer.de wrote:
Hi all,
after quite some work into improving the DBpedia information extraction framework, we have released a new version of the DBpedia dataset today.
DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia and to link other datasets on the Web to Wikipedia data.
Hi Ilario,
yes, this can easily be done as the dataset contains German, Italian and French short and long abstracts as well as links to the original Wikipedia pages in these languages.
The DBpedia data is accassible through a SPARQL query endpoint at http://DBpedia.org/sparql
You can for instance ask queries like:
SELECT ?name ?description_en ?description_de ?musician WHERE { ?musician skos:subject http://dbpedia.org/resource/Category:German_musicians . ?musician foaf:name ?name . OPTIONAL { ?musician rdfs:comment ?description_de . FILTER (LANG(?description_en) = 'de') . } OPTIONAL { ?musician rdfs:comment ?description_fr . FILTER (LANG(?description_de) = 'fr') . } }
to get the German and French abstract for German musicians.
To test this query with the SNORQL query builder, click on:
http://dbpedia.org/snorql/?query=SELECT+%3Fname+%3Fdescription_en+%3Fdescrip...
As the query endpoint is sometimes slow, you can also get all data from our download page and filter the German, French and Italien information out and store it locally.
Please refer to:
http://wiki.dbpedia.org/OnlineAccess for information about accessing the dataset via the SPARQL endpoint. http://wiki.dbpedia.org/Downloads for downloading the dumps.
You find information about the SPARQL query language here: http://www.w3.org/TR/rdf-sparql-query/ and information about toolkits to store and process the dumps here: http://sites.wiwiss.fu-berlin.de/suhl/bizer/toolkits/index.htm
Please feel free to contact me or the dbpedia-discussion@lists.sourceforge.net mailing list with any further question about how to use the dataset.
Cheers
Chris
-- Chris Bizer Freie Universität Berlin +49 30 838 54057 chris@bizer.de www.bizer.de ----- Original Message ----- From: "Ilario Valdelli" valdelli@gmail.com To: "Wikimedia developers" wikitech-l@lists.wikimedia.org Sent: Thursday, September 06, 2007 9:23 AM Subject: Re: [Wikitech-l] DBpedia - Querying Wikipedia like a Database:Improved dataset released.
This project is very, very interesting for a content syndacation, but I would have some documentation also for me.
As Wikimedia Ch board member we have registered www.wikipedia.ch but we have not content on it.
I thought to include some content concerning the Swiss in three languages on the same page. Using Mediawiki it's very difficult to made.
You understand that if I have the database I can proceed to include the content in any page without mediawiki.
Ilario Valdelli Wikimedia CH board member
On 9/5/07, Chris Bizer chris@bizer.de wrote:
Hi all,
after quite some work into improving the DBpedia information extraction framework, we have released a new version of the DBpedia dataset today.
DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia and to link other datasets on the Web to Wikipedia data.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org