On 27/01/2011 14:35, Luigi Assom wrote:
Here another question, different topic:
we would like to examine the network property of the wiki. There are already some results here an there, though we would like to have a closer look at it, to eventually improve the knowledge base.
To do that, we need to access the pages of wiki (only articles by now), with article name, abstract, meta keys, internal hyperlinks connecting them, and external hyperlinks base.
We found the db list in gz but they are very large files, and here my question. how to manipulate them with phpmyadmin? any other open source tool to handle datafiles of such size?
an easy way to get first results would be to have the db of articles with above parameters in xml sheet. Also a portion of it would be interesting for a demo project to work on.
Hi Luigi, there are various tools for reading XML dump files and importing them into MySQL, which is probably the best option if you want to handle very large files like the dumps for the English wikipedia. See here: http://meta.wikimedia.org/wiki/Data_dumps#Tools
If you're only interested in a subset of the articles, and just in the current revisions, another possibility is crawling the website via the Mediawiki API http://www.mediawiki.org/wiki/API There are several client libraries, a Google query for you favourite language should return you some pointers.