English Wikipedia NLP markup - Wiki-research-l

4 Mar 2015

Hi,

   If you want to do more NLP research on enwiki and having an NLP markup
of Wikipedia is the bottleneck, you should look at the WIKI dataset just
released by Chris Re's team at Stanford based on a snapshot of enwiki as of
late January 2015. You can find this and other interesting datasets
released by the team at http://deepdive.stanford.edu/doc/opendata/ The data
format is explained on the top of the page.

   Making the WIKI dataset required 24K machine hours. The team has access
to more machine hours and is actively receiving feedback from the NLP
community to generate more datasets. If you're interested about the recent
release or have suggestions for the team to generate other datasets based
on publicly available data, please contact the team.

Best,
Leila