"The data set contains triples, each consisting of (i) text, a short, raw natural language string; (ii) url, a related concept, represented by an English Wikipedia article's canonical location; and (iii) count, an integer indicating the number of timestext has been observed connected with the concept's url. Our database thus includes weights that measure degrees of association."
"The database that we are providing was designed for recall. It is large and noisy, incorporating 297,073,139 distinct string-concept pairs, aggregated over 3,152,091,432 individual links".
Published in LREC 2012:
“A Cross-Lingual Dictionary for English Wikipedia Concepts”, Valentin I. Spitkovsky, Angel X. Chang, Eighth International Conference on Language Resources and Evaluation (LREC 2012). http://research.google.com/pubs/archive/38098.pdf
Hi all;
Just a quick notice about a new Google dataset related to Wikipedia.[1][2][3]
Regards,
emijrp
[1] http://googleresearch.blogspot.com.es/2012/05/from-words-to-concepts-and-back.html
[2] http://ebiquity.umbc.edu/blogger/2012/05/19/google-releases-database-linking-strings-and-concepts/
[3] http://www-nlp.stanford.edu/pubs/crosswikis-data.tar.bz2/
--Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT comPre-doctoral student at the University of Cádiz (Spain)Personal website: https://sites.google.com/site/emijrp/
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l