Hi,
If you want to do more
NLP research on
enwiki and having an
NLP markup of
Wikipedia is the bottleneck, you should look at the WIKI
dataset just released by Chris Re's team at Stanford based on a snapshot of
enwiki as of late January 2015. You can find this and other interesting
datasets released by the team at http://
deepdive.
stanford.
edu/doc/
opendata/ The data format is explained on the top of the page.
Making the WIKI
dataset required 24K machine hours. The team has access to more machine hours and is actively receiving feedback from the
NLP community to generate more
datasets. If you're interested about the recent release or have suggestions for the team to generate other
datasets based on publicly available data, please contact the team.