2010/10/13 Paul Houle paul@ontology2.com
Don't be intimidated by working with the data dumps. If you've got
an XML API that does streaming processing (I used .NET's XmlReader) and use the old unix trick of piping the output of bunzip2 into your program, it's really pretty easy.
When I worked into it.source (a small dump! something like 300Mby unzipped), I used a simple do-it-yourself string python search routine and I found it really faster then python xml routines. I presume that my scripts are really too rough to deserve sharing, but I encourage programmers to write a "simple dump reader" using speed of string search. My personal trick was to build an "index", t.i. a list of pointers to articles and name of articles into xml file, so that it was simple and fast to recover their content. I used it mainly because I didn't understand API at all. ;-)
Alex