In the latest release (2015-10) DBpedia started exploring the citation and reference data from Wikipedia and we were pleasantly surprised by the rich data http://downloads.dbpedia.org/preview.php?file=2015-10_sl_core-i18n_sl_en_sl_citation_data_en.ttl.bz2 we managed to extract.
-
citation_data_en.ttl.bz2 http://downloads.dbpedia.org/2015-10/core-i18n/en/citation_data_en.ttl.bz2 (sample http://downloads.dbpedia.org/preview.php?file=2015-10_sl_core-i18n_sl_en_sl_citation_data_en.ttl.bz2 ) -
citation_links_en.ttl.bz2 http://downloads.dbpedia.org/2015-10/core-i18n/en/citation_links_en.ttl.bz2 (sample http://downloads.dbpedia.org/preview.php?file=2015-10_sl_core-i18n_sl_en_sl_citation_links_en.ttl.bz2 )
This data holds huge potential, especially for the Wikidata challenge of providing a reference source for every statement. It describes not only a lot of bibliographical data, but also a lot of web pages and many other sources around the web.
The data we extract at the moment is quite raw and can be improved in many different ways. Some of the potential improvements are:
-
Extend the citation extractor to handle other Wikipedia language editions https://github.com/dbpedia/extraction-framework/issues/451; currently only English Wikipedia is supported. -
Map the data to a relevant Bibliographic ontology https://github.com/dbpedia/mappings-tracker/issues/79 (there are many candidates and, although BIBO got most votes, we are open to other ontologies) -
Map the data to existing Bibliographic LOD (eg TEL has 100M records, Worldcat 300M) or online books (eg Google Books). See the citationIri issue https://github.com/dbpedia/extraction-framework/issues/452. -
Ways to merge / fuse identical citations from multiple articles -
Use the citation data in the Wikidata primary sources tool https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool -
Surprise us with your ideas!
We welcome contributions that improve the existing citation dataset in any way; and we are open to collaboration and helping. Results will be presented at the next DBpedia meeting: 15 September 2016 in Leipzig, co-located with SEMANTiCS 2016. Each participant should submit a short description of his/her contribution by Monday 12 September 2016 and present his/her work at the meeting. Comments, questions can be posted on the DBpedia discussion & developer lists or in our new DBpedia ideas page http://wiki.dbpedia.org/ideas/idea/261/dbpedia-citations-reference-challenge/ .
Submissions will be judged by the Organizing Committee and the best two will receive a prize.
Organizing Committee
-
Vladimir Alexiev, Ontotext and DBpedia BG -
Anastasia Dimou, Ghent University, iMinds - Dimitris Kontokostas, KILT/AKSW, DBpedia Association
* Data Update for http://wiki.dbpedia.org/ideas/idea/261/dbpedia-citations-reference-challenge... *
Thanks to your feedback (and especially from the WikiCite community), we managed to fix a few bugs and extended the coverage of the extracted citations. The new citation dumps come from the upcoming 2016-04 release and provide *14x more citation data* (from 7.1M triples to 97.5M triples)
We share the results early for the DBpedia challenge here http://downloads.dbpedia.org/temporary/citations/
For those still not sure what they can do with our data, here's what we managed to calculate at the airport while travelling, imagine what you can do with more time and a normal desk;)
Did you know that the most cited Wikipedia...
books are about Football, WW2 and British songs?: * (4853 articles) SEN Encyclopedia of AFL Footballers: Every AFL/VFL Player Since 1897 -> http://books.google.com/books?vid=ISBN978-1-921496-32-5 * (3191 articles) Die Ritterkreuzträger: 1939 - 1945 -> http://books.google.com/books?vid=ISBN978-3-938845-17-2 * (2927 articles) Die Träger des Ritterkreuzes des Eisernen Kreuzes -> http://books.google.com/books?vid=ISBN978-3-7909-0284-6 * (1958 articles) British Hit Singles & Albums -> http://books.google.com/books?vid=ISBN1-904994-10-5 * (1694 articles) Das Deutsche Kreuz -> http://books.google.com/books?vid=ISBN978-3-931533-45-8
Scientific articles are about biology & astronomy?: * 5210 http://doi.org/10.1073/pnas.242603899 * 3757 http://doi.org/10.1101/gr.2596504 * 2449 http://doi.org/10.1038/ng1285 * 1667 http://doi.org/10.1051/0004-6361:20078357 * 1445 http://doi.org/10.1007/bf00171763
websites mostly about census?: * 51328 http://www.stat.gov.pl/broker/access/prefile/listPreFiles.jspa * 21758 http://www.census.gov/geo/www/gazetteer/gazette.html * 21741 http://www.census.gov/prod/www/decennial.html * 11954 http://www.census.gov/popest/data/cities/totals/2014/SUB-EST2014.html * 10680 http://globiz.pyraloidea.org/Pages/Reports/TaxonReport.aspx
Dates (citations with only dates and a reference needed): * February 2007, 5463 times * October 2010, 5245 times * July 2015, 3919 times * October 2015, 3916 times * August 2015, 3885 times (comes from http://citation.dbpedia.org/hash/* IRIs)
see the following lists for complete lists http://downloads.dbpedia.org/temporary/citations/results.same-citations.diff... (we count only references from different pages) http://downloads.dbpedia.org/temporary/citations/results.same-citations.all-... (we count all references, even from same page)
the top 10 domains from wikipedia references are: * 1561315 books.google.com * 1540250 citation.dbpedia.org * 836371 doi.org * 154664 news.bbc.co.uk * 132997 nytimes.com * 129410 bbc.co.uk * 101807 census.gov * 101125 worldcat.org * 89082 news.google.com * 76503 ncbi.nlm.nih.gov see a complete list in: http://downloads.dbpedia.org/temporary/citations/results.domains.count http://downloads.dbpedia.org/temporary/citations/results.domains-distinct.co... (counts distinct citations)
Articles with the most needed citations are: * Football_records_in_Spain (41 citations needed) * Ahmed_Belbachir_Haskouri (29 citations needed) * Tree_model (24 citations needed) * Immigration_to_Chile (21 citations needed) * Larry_Ryckman (18 citations needed) see here for a full list: http://downloads.dbpedia.org/temporary/citations/results.articles-with-citat...
We extract data from many templates. Here's the top 10 and a complete list can be found here: http://downloads.dbpedia.org/temporary/citations/results.template.count * 9348109 Cite_web * 2821628 Cite_news * 1958270 Cite_book * 1294760 Cite_journal * 467933 Citation * 317309 Citation_needed * 46264 Cite_press_release * 37315 Cn * 36258 Cite_encyclopedia * 33754 Cite_episode
We also have some basic statistics for templates with properties and properties alone http://downloads.dbpedia.org/temporary/citations/results.template.count http://downloads.dbpedia.org/temporary/citations/results.template-property.c...
Note that the statistics we provide are meant only as a proof of concept and are based on the enwiki-20160305 dump you can regenerate them using this shell script: http://downloads.dbpedia.org/temporary/citations/generate-basic-citation-sta...
Cheers, Dimitris on behalf of the OC
On Tue, Jun 7, 2016 at 10:51 AM, Dimitris Kontokostas < kontokostas@informatik.uni-leipzig.de> wrote:
In the latest release (2015-10) DBpedia started exploring the citation and reference data from Wikipedia and we were pleasantly surprised by the rich data http://downloads.dbpedia.org/preview.php?file=2015-10_sl_core-i18n_sl_en_sl_citation_data_en.ttl.bz2 we managed to extract.
citation_data_en.ttl.bz2 http://downloads.dbpedia.org/2015-10/core-i18n/en/citation_data_en.ttl.bz2 (sample http://downloads.dbpedia.org/preview.php?file=2015-10_sl_core-i18n_sl_en_sl_citation_data_en.ttl.bz2 )
citation_links_en.ttl.bz2 http://downloads.dbpedia.org/2015-10/core-i18n/en/citation_links_en.ttl.bz2 (sample http://downloads.dbpedia.org/preview.php?file=2015-10_sl_core-i18n_sl_en_sl_citation_links_en.ttl.bz2 )
This data holds huge potential, especially for the Wikidata challenge of providing a reference source for every statement. It describes not only a lot of bibliographical data, but also a lot of web pages and many other sources around the web.
The data we extract at the moment is quite raw and can be improved in many different ways. Some of the potential improvements are:
Extend the citation extractor to handle other Wikipedia language editions https://github.com/dbpedia/extraction-framework/issues/451; currently only English Wikipedia is supported.
Map the data to a relevant Bibliographic ontology https://github.com/dbpedia/mappings-tracker/issues/79 (there are many candidates and, although BIBO got most votes, we are open to other ontologies)
Map the data to existing Bibliographic LOD (eg TEL has 100M records, Worldcat 300M) or online books (eg Google Books). See the citationIri issue https://github.com/dbpedia/extraction-framework/issues/452.
Ways to merge / fuse identical citations from multiple articles
Use the citation data in the Wikidata primary sources tool https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool
Surprise us with your ideas!
We welcome contributions that improve the existing citation dataset in any way; and we are open to collaboration and helping. Results will be presented at the next DBpedia meeting: 15 September 2016 in Leipzig, co-located with SEMANTiCS 2016. Each participant should submit a short description of his/her contribution by Monday 12 September 2016 and present his/her work at the meeting. Comments, questions can be posted on the DBpedia discussion & developer lists or in our new DBpedia ideas page http://wiki.dbpedia.org/ideas/idea/261/dbpedia-citations-reference-challenge/ .
Submissions will be judged by the Organizing Committee and the best two will receive a prize.
Organizing Committee
Vladimir Alexiev, Ontotext and DBpedia BG
Anastasia Dimou, Ghent University, iMinds
- Dimitris Kontokostas, KILT/AKSW, DBpedia Association
-- Dimitris Kontokostas Department of Computer Science, University of Leipzig & DBpedia Association Projects: http://dbpedia.org, http://rdfunit.aksw.org, http://aligned-project.eu Homepage: http://aksw.org/DimitrisKontokostas Research Group: AKSW/KILT http://aksw.org/Groups/KILT