Hi,
a short note on DBpedia. As a whole DBpedia provides:
- Data sets for Download
- Web Service for SPARQL (so you can query the data right away)
- A software for parsing Wikis ( about 10-20 developers, all of them
volunteers except 2 or 3 ):
Hi everyone,
This is a follow up on a previous thread (Wikipedia data sets) related
to the Wikipedia literature review (Chitu Okoli). As I mentioned in my
previous email, part of our study is to identify the data collection
methods and data sets used for Wikipedia studies. Therefore,
we searched for online tools used to extract Wikipedia articles and
for pre-compiled Wikipedia articles data sets; we were able to
identify the following list. Please let us know of any other
sources you know about. Also, we would like to know if there is any
existing Wikipedia page that includes such a list so we can add to it.
Otherwise, where do you suggest adding this list so it is noticeable
and useful for the community?
http://download.wikimedia.org/ /*
official Wikipedia database dumps */
http://datamob.org/datasets/tag/wikipedia /* Multiple
data sets (English Wikipedia articles that have been transformed into
XML) */
http://wiki.dbpedia.org/Datasets /*
Structured information from Wikipedia*/
http://labs.systemone.at/wikipedia3 /*
Wikipedia³ is a conversion of the English Wikipedia into RDF. It's a
monthly updated dataset containing around 47 million triples.*/
http://www.scribd.com/doc/9582/integrating-wikipediawordnet /*
article talking about integrating WorldNet and Wikipedia with YAGO */
http://www.infochimps.com/datasets/taxobox-wikipedia-infoboxes-with-taxonom…
http://www.infochimps.com/link_frame?dataset=11043 /* Wikipedia
Datasets for the Hadoop Hack | Cloudera */
http://www.infochimps.com/link_frame?dataset=11166 /* Wikipedia:
Lists of common misspellings/For machines */
http://www.infochimps.com/link_frame?dataset=11028 /* Building a
(fast) Wikipedia offline reader */
http://www.infochimps.com/link_frame?dataset=11004 /* Using the
Wikipedia page-to-page link database */
http://www.infochimps.com/link_frame?dataset=11285 /* List of films */
http://www.infochimps.com/link_frame?dataset=11598 /* MusicBrainz
Database */
http://dammit.lt/wikistats/ /* Wikitech-l page counters */
http://snap.stanford.edu/data/wiki-meta.html /* Complete Wikipedia
edit history (up to January 2008) */
http://aws.amazon.com/datasets/2596?_encoding=UTF8&jiveRedirect=1
<http://aws.amazon.com/datasets/2596?_encoding=UTF8&jiveRedirect=1>
/* Wikipedia Page Traffic Statistics */
http://aws.amazon.com/datasets/2506 /* Wikipedia XML Data */
http://www-958.ibm.com/software/data/cognos/manyeyes/datasets?q=Wikipedia+
/* list of Wikipedia data sets */
Examples:
http://www-958.ibm.com/software/data/cognos/manyeyes/datasets/top-1000-acce…
/* Top 1000 Accessed Wikipedia Articles */
http://www-958.ibm.com/software/data/cognos/manyeyes/datasets/wikipedia-hit…
/*
Wikipedia Hits */
Tools to extract data from Wikipedia:
http://www.evanjones.ca/software/wikipedia2text.html /*
Extracting Text from Wikipedia */
http://www.infochimps.com/link_frame?dataset=11121 /*
Wikipedia article traffic statistics */
http://blog.afterthedeadline.com/2009/12/04/generating-a-plain-text-corpus-…
/* Generating a Plain Text Corpus from Wikipedia */
http://www.infochimps.com/datasets/wikipedia-articles-title-autocomplete
Thank you,
Mohamad Mehdi
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
--
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Homepage: