Re: [Wiki-research-l] New toolbox Wikipedia pages

27 Jan 2011


      On 27/01/2011 14:35, Luigi Assom wrote:
...
Here another question,
different topic:
we would like to examine the network property of the wiki.
There are already some results here an there, though we would like to
have a closer look at it, to eventually improve the knowledge base.
To do that, we need to access the pages of wiki (only articles by now),
with article name, abstract, meta keys, internal hyperlinks connecting
them, and external hyperlinks base.
We found the db list in gz but they are very large files, and here my
question.
how to manipulate them with phpmyadmin?
any other open source tool to handle datafiles of such size?
an easy way to get first results would be to have the db of articles
with above parameters in xml sheet.
Also a portion of it would be interesting for a demo project to work on.
Hi Luigi,
there are various tools for reading XML dump files and importing them 
into MySQL, which is probably the best option if you want to handle very 
large files like the dumps for the English wikipedia. See here: 
http://meta.wikimedia.org/wiki/Data_dumps#Tools
If you're only interested in a subset of the articles, and just in the 
current revisions, another possibility is crawling the website via the 
Mediawiki API http://www.mediawiki.org/wiki/API
There are several client libraries, a Google query for you favourite 
language should return you some pointers.
-- 
Giovanni L. Ciampaglia
PhD Student
University of Lugano, MACS Lab

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] New toolbox Wikipedia pages