Revision: 5775
Author: wikipedian
Date: 2008-08-01 13:51:21 +0000 (Fri, 01 Aug 2008)
Log Message:
-----------
Added the possibility to work directly on the compressed .bz2 file. It is a bit slower, of course.
The rationale behind this is that it is becoming difficult to unpack the XML dumps. For example, the current pages-articles.xml for
the German Wikipedia is over 4 GB big, so you can no longer unpack it to a FAT32 partition. And even if you have another partition,
the dump will use very much disk space.
Modified Paths:
--------------
trunk/pywikipedia/xmlreader.py
Modified: trunk/pywikipedia/xmlreader.py
===================================================================
--- trunk/pywikipedia/xmlreader.py 2008-07-29 17:42:54 UTC (rev 5774)
+++ trunk/pywikipedia/xmlreader.py 2008-08-01 13:51:21 UTC (rev 5775)
@@ -254,8 +254,13 @@
def new_parse(self):
"""Generator using cElementTree iterparse function"""
-
- context = iterparse(self.filename, events=("start", "end", "start-ns"))
+ if self.filename.endswith('.bz2'):
+ import bz2
+ source = bz2.BZ2File(self.filename)
+ else:
+ # assume it's an uncompressed XML file
+ source = open(self.filename)
+ context = iterparse(source, events=("start", "end", "start-ns"))
root = None
for event, elem in context: