Best regards,
Andrew Krizhanovsky
On Wed, Apr 20, 2011 at 12:15 PM, Torsten Zesch
<zesch(a)tk.informatik.tu-darmstadt.de> wrote:
Dear Mohamad,
thanks for compiling this comprehensive list.
You might want to add JWPL:
http://code.google.com/p/jwpl/
and WikipediaMiner:
http://wikipedia-miner.sourceforge.net/
-Torsten
From: wiki-research-l-bounces(a)lists.wikimedia.org
[mailto:wiki-research-l-bounces@lists.wikimedia.org] On Behalf Of mohamad
mehdi
Sent: Monday, April 18, 2011 3:20 PM
To: wiki-research-l(a)lists.wikimedia.org
Subject: [Wiki-research-l] Wikipedia Literature Review - Tools and Data Sets
Hi everyone,
This is a follow up on a previous thread (Wikipedia data sets) related to
the Wikipedia literature review (Chitu Okoli). As I mentioned in my previous
email, part of our study is to identify the data collection methods and data
sets used for Wikipedia studies. Therefore, we searched for online tools
used to extract Wikipedia articles and for pre-compiled Wikipedia articles
data sets; we were able to identify the following list. Please let us know
of any other sources you know about. Also, we would like to know if there is
any existing Wikipedia page that includes such a list so we can add to it.
Otherwise, where do you suggest adding this list so it is noticeable and
useful for the community?
http://download.wikimedia.org/ /* official
Wikipedia database dumps */
http://datamob.org/datasets/tag/wikipedia /* Multiple data
sets (English Wikipedia articles that have been transformed into XML) */
http://wiki.dbpedia.org/Datasets /* Structured
information from Wikipedia*/
http://labs.systemone.at/wikipedia3 /* Wikipedia³ is
a conversion of the English Wikipedia into RDF. It's a monthly updated
dataset containing around 47 million triples.*/
http://www.scribd.com/doc/9582/integrating-wikipediawordnet /* article
talking about integrating WorldNet and Wikipedia with YAGO */
http://www.infochimps.com/datasets/taxobox-wikipedia-infoboxes-with-taxonom…
http://www.infochimps.com/link_frame?dataset=11043 /* Wikipedia Datasets
for the Hadoop Hack | Cloudera */
http://www.infochimps.com/link_frame?dataset=11166 /* Wikipedia: Lists of
common misspellings/For machines */
http://www.infochimps.com/link_frame?dataset=11028 /* Building a (fast)
Wikipedia offline reader */
http://www.infochimps.com/link_frame?dataset=11004 /* Using the Wikipedia
page-to-page link database */
http://www.infochimps.com/link_frame?dataset=11285 /* List of films */
http://www.infochimps.com/link_frame?dataset=11598 /* MusicBrainz Database
*/
http://dammit.lt/wikistats/ /* Wikitech-l page counters */
http://snap.stanford.edu/data/wiki-meta.html /* Complete Wikipedia edit
history (up to January 2008) */
http://aws.amazon.com/datasets/2596?_encoding=UTF8&jiveRedirect=1 /*
Wikipedia Page Traffic Statistics */
http://aws.amazon.com/datasets/2506 /* Wikipedia XML Data */
http://www-958.ibm.com/software/data/cognos/manyeyes/datasets?q=Wikipedia+
/* list of Wikipedia data sets */
Examples:
http://www-958.ibm.com/software/data/cognos/manyeyes/datasets/top-1000-acce…
/* Top 1000 Accessed Wikipedia Articles */
http://www-958.ibm.com/software/data/cognos/manyeyes/datasets/wikipedia-hit… /*
Wikipedia Hits */
Tools to extract data from Wikipedia:
http://www.evanjones.ca/software/wikipedia2text.html /*
Extracting Text from Wikipedia */
http://www.infochimps.com/link_frame?dataset=11121 /* Wikipedia
article traffic statistics */
http://blog.afterthedeadline.com/2009/12/04/generating-a-plain-text-corpus-…
/* Generating a Plain Text Corpus from Wikipedia */
http://www.infochimps.com/datasets/wikipedia-articles-title-autocomplete
Thank you,
Mohamad Mehdi
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l