Hello there,
For research purpose I would like to retrieve information, such as article text and all revisions (revision content, time stamps, usernames), for English articles under certain categories (including sub-categories) or probably a set of randomly selected articles, but not necessary for the whole English Wikipedia.
I tried Export page (https://en.wikipedia.org/w/index.php?title=Special:Export&action=submit) but it limits revisions to 1000. And it generates an XML document as output.
I have been reading some information online but still don't have a very clear picture. I know there are downloadable dumps compressed in XML format, and also it appears the same content can be downloaded in form of MySQL database as well.
I am familiar with java, and have some experience with MySQL, XML, PHP, and HTML.
My questions are: What are the better ways for me to get the information I need? Please be specific.
For example, if I download the data in XML format, do I use MediaWiki (PHP) to retrieve the information from those XML documents or is there a good java XML parser for wikipedia to retrieve my desired results?
If the whole content can be downloaded to a MySQL database in my local computer, can I write a java program with SQL queries to get my desired results from the database, or MediaWiki is better to retrieve the results from the database?
Thank you, Ming