Hi again,
I finally got importDump.php to run and it runs and already imported _161_ GB of the english Wikipedia into the database!
That seems to be far too much.
Here is the row count: page:204876 revision:8263577 text:22768801 user:1 interwiki:124
Does this seem to be reasonable??? 161 GB and still running? 22 millions text entries? Only 1 user entry?
I would appreciate if someone could give me a hint whether those number are correct!
One final comment: the 161 GB are in the innoDB database (when you look in mysql/data).
Thank you and best regards,
Martina
On 12/5/05, Martina Greiner martina.greiner@gmx.de wrote:
Hi again,
I finally got importDump.php to run and it runs and already imported _161_ GB of the english Wikipedia into the database!
Shouldn't the database size be roughly the size of the uncompressed xml file? You can look at that and see if it is similar.
Are you using the pages_full.xml dump, or the pages_current.xml ? The former contains all the history for all pages, while the latter only has the latest version and should be on the order of 1 million entries (articles+images+other namespaces).
Alfio
Martina Greiner wrote:
Shouldn't the database size be roughly the size of the uncompressed xml file? You can look at that and see if it is similar.
The size of the uncompressed files is roughly 100 GB.
The actual database will include indexes, padding, etc which increases the amount of space used.
Currently the dump importer doesn't support compressed text storage in the database; if added this could reduce the amount of disk space required (but of course it'll be harder to work with).
-- brion vibber (brion @ pobox.com)
On 12/5/05, Brion Vibber brion@pobox.com wrote:
Martina Greiner wrote:
Shouldn't the database size be roughly the size of the uncompressed xml file? You can look at that and see if it is similar.
The size of the uncompressed files is roughly 100 GB.
The actual database will include indexes, padding, etc which increases the amount of space used.
Currently the dump importer doesn't support compressed text storage in the database; if added this could reduce the amount of disk space required (but of course it'll be harder to work with).
Ah, this explains why the mysql imports are so huge for me.
FWIW, the only metric I can provide is how much full takes in PostgreSQL, .. its about 91gb in my past expirence (a dump or two ago), this is with all indexes, link tables, etc. But it's not directly compariable because PG autocompresses fat fields (over about 2k) and the importer doesn't compress....
On 12/5/05, Alfio Puglisi alfio.puglisi@gmail.com wrote:
Shouldn't the database size be roughly the size of the uncompressed xml file? You can look at that and see if it is similar.
No, it's not that simple. There is non-insubstantial overhead from the database, but gzip compression on the larger revisions.
wikitech-l@lists.wikimedia.org