Hello there,
For research purpose I would like to retrieve information, such as article text and all
revisions (revision content, time stamps, usernames), for English articles under certain
categories (including sub-categories) or probably a set of randomly selected articles, but
not necessary for the whole English Wikipedia.
I tried Export page
(
https://en.wikipedia.org/w/index.php?title=Special:Export&action=submit) but it
limits revisions to 1000. And it generates an XML document as output.
I have been reading some information online but still don't have a very clear picture.
I know there are downloadable dumps compressed in XML format, and also it appears the same
content can be downloaded in form of MySQL database as well.
I am familiar with java, and have some experience with MySQL, XML, PHP, and HTML.
My questions are:
What are the better ways for me to get the information I need? Please be specific.
For example, if I download the data in XML format, do I use MediaWiki (PHP) to retrieve
the information from those XML documents or is there a good java XML parser for wikipedia
to retrieve my desired results?
If the whole content can be downloaded to a MySQL database in my local computer, can I
write a java program with SQL queries to get my desired results from the database, or
MediaWiki is better to retrieve the results from the database?
Thank you,
Ming