* Data Update for
http://wiki.dbpedia.org/ideas/idea/261/dbpedia-citations-reference-challeng…
*
Thanks to your feedback (and especially from the WikiCite community), we
managed to fix a few bugs and extended the coverage of the extracted
citations.
The new citation dumps come from the upcoming 2016-04 release and provide
*14x more citation data* (from 7.1M triples to 97.5M triples)
We share the results early for the DBpedia challenge here
http://downloads.dbpedia.org/temporary/citations/
For those still not sure what they can do with our data, here's what we
managed to calculate at the airport while travelling, imagine what you can
do with more time and a normal desk;)
Did you know that the most cited Wikipedia...
books are about Football, WW2 and British songs?:
* (4853 articles) SEN Encyclopedia of AFL Footballers: Every AFL/VFL
Player Since 1897 ->
http://books.google.com/books?vid=ISBN978-1-921496-32-5
* (3191 articles) Die Ritterkreuzträger: 1939 - 1945 ->
http://books.google.com/books?vid=ISBN978-3-938845-17-2
* (2927 articles) Die Träger des Ritterkreuzes des Eisernen Kreuzes ->
http://books.google.com/books?vid=ISBN978-3-7909-0284-6
* (1958 articles) British Hit Singles & Albums ->
http://books.google.com/books?vid=ISBN1-904994-10-5
* (1694 articles) Das Deutsche Kreuz ->
http://books.google.com/books?vid=ISBN978-3-931533-45-8
Scientific articles are about biology & astronomy?:
* 5210
http://doi.org/10.1073/pnas.242603899
* 3757
http://doi.org/10.1101/gr.2596504
* 2449
http://doi.org/10.1038/ng1285
* 1667
http://doi.org/10.1051/0004-6361:20078357
* 1445
http://doi.org/10.1007/bf00171763
websites mostly about census?:
* 51328
http://www.stat.gov.pl/broker/access/prefile/listPreFiles.jspa
* 21758
http://www.census.gov/geo/www/gazetteer/gazette.html
* 21741
http://www.census.gov/prod/www/decennial.html
* 11954
http://www.census.gov/popest/data/cities/totals/2014/SUB-EST2014.html
* 10680
http://globiz.pyraloidea.org/Pages/Reports/TaxonReport.aspx
Dates (citations with only dates and a reference needed):
* February 2007, 5463 times
* October 2010, 5245 times
* July 2015, 3919 times
* October 2015, 3916 times
* August 2015, 3885 times
(comes from
http://citation.dbpedia.org/hash/* IRIs)
see the following lists for complete lists
http://downloads.dbpedia.org/temporary/citations/results.same-citations.dif…
(we count only references from different pages)
http://downloads.dbpedia.org/temporary/citations/results.same-citations.all…
(we count all references, even from same page)
the top 10 domains from wikipedia references are:
* 1561315
books.google.com
* 1540250
citation.dbpedia.org
* 836371
doi.org
* 154664 news.bbc.co.uk
* 132997
nytimes.com
* 129410 bbc.co.uk
* 101807
census.gov
* 101125
worldcat.org
* 89082
news.google.com
* 76503
ncbi.nlm.nih.gov
see a complete list in:
http://downloads.dbpedia.org/temporary/citations/results.domains.count
http://downloads.dbpedia.org/temporary/citations/results.domains-distinct.c…
(counts distinct citations)
Articles with the most needed citations are:
* Football_records_in_Spain (41 citations needed)
* Ahmed_Belbachir_Haskouri (29 citations needed)
* Tree_model (24 citations needed)
* Immigration_to_Chile (21 citations needed)
* Larry_Ryckman (18 citations needed)
see here for a full list:
http://downloads.dbpedia.org/temporary/citations/results.articles-with-cita…
We extract data from many templates. Here's the top 10 and a complete list
can be found here:
http://downloads.dbpedia.org/temporary/citations/results.template.count
* 9348109 Cite_web
* 2821628 Cite_news
* 1958270 Cite_book
* 1294760 Cite_journal
* 467933 Citation
* 317309 Citation_needed
* 46264 Cite_press_release
* 37315 Cn
* 36258 Cite_encyclopedia
* 33754 Cite_episode
We also have some basic statistics for templates with properties and
properties alone
http://downloads.dbpedia.org/temporary/citations/results.template.count
http://downloads.dbpedia.org/temporary/citations/results.template-property.…
Note that the statistics we provide are meant only as a proof of concept
and are based on the enwiki-20160305 dump
you can regenerate them using this shell script:
http://downloads.dbpedia.org/temporary/citations/generate-basic-citation-st…
Cheers,
Dimitris on behalf of the OC
On Tue, Jun 7, 2016 at 10:51 AM, Dimitris Kontokostas <
kontokostas(a)informatik.uni-leipzig.de> wrote:
In the latest release (2015-10) DBpedia started
exploring the citation and
reference data from Wikipedia and we were pleasantly surprised by the
rich data
<http://downloads.dbpedia.org/preview.php?file=2015-10_sl_core-i18n_sl_en_sl_citation_data_en.ttl.bz2>
we managed to extract.
-
citation_data_en.ttl.bz2
<http://downloads.dbpedia.org/2015-10/core-i18n/en/citation_data_en.ttl.bz2>
(sample
<http://downloads.dbpedia.org/preview.php?file=2015-10_sl_core-i18n_sl_en_sl_citation_data_en.ttl.bz2>
)
-
citation_links_en.ttl.bz2
<http://downloads.dbpedia.org/2015-10/core-i18n/en/citation_links_en.ttl.bz2>
(sample
<http://downloads.dbpedia.org/preview.php?file=2015-10_sl_core-i18n_sl_en_sl_citation_links_en.ttl.bz2>
)
This data holds huge potential, especially for the Wikidata challenge of providing
a reference source for every statement. It describes not only a lot of
bibliographical data, but also a lot of web pages and many other sources
around the web.
The data we extract at the moment is quite raw and can be improved in many
different ways. Some of the potential improvements are:
-
Extend the citation extractor to handle other Wikipedia language
editions <https://github.com/dbpedia/extraction-framework/issues/451>;
currently only English Wikipedia is supported.
-
Map the data to a relevant Bibliographic ontology
<https://github.com/dbpedia/mappings-tracker/issues/79> (there are
many candidates and, although BIBO got most votes, we are open to other
ontologies)
-
Map the data to existing Bibliographic LOD (eg TEL has 100M records,
Worldcat 300M) or online books (eg Google Books). See the citationIri
issue <https://github.com/dbpedia/extraction-framework/issues/452>.
-
Ways to merge / fuse identical citations from multiple articles
-
Use the citation data in the Wikidata primary sources tool
<https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool>
-
Surprise us with your ideas!
We welcome contributions that improve the existing citation dataset in any
way; and we are open to collaboration and helping. Results will be
presented at the next DBpedia meeting: 15 September 2016 in Leipzig,
co-located with SEMANTiCS 2016. Each participant should submit a short
description of his/her contribution by Monday 12 September 2016 and present
his/her work at the meeting. Comments, questions can be posted on the
DBpedia discussion & developer lists or in our new DBpedia ideas page
<http://wiki.dbpedia.org/ideas/idea/261/dbpedia-citations-reference-challenge/>
.
Submissions will be judged by the Organizing Committee and the best two
will receive a prize.
Organizing Committee
-
Vladimir Alexiev, Ontotext and DBpedia BG
-
Anastasia Dimou, Ghent University, iMinds
- Dimitris Kontokostas, KILT/AKSW, DBpedia Association
--
Dimitris Kontokostas
Department of Computer Science, University of Leipzig & DBpedia
Association
Projects:
http://dbpedia.org,
http://rdfunit.aksw.org,
http://aligned-project.eu
Homepage:
http://aksw.org/DimitrisKontokostas
Research Group: AKSW/KILT
http://aksw.org/Groups/KILT
--
Kontokostas Dimitris