Dumps handling / storage / updating etc...

List overview All Threads
Download

newer

older

Cleaning dumps in user-store

CouchDB

Danny B.

9 Dec 2011 9 Dec '11

3:41 p.m.

Hi,

currently there is quite a big mess in how dump files are handled. They are located in several locations without any system in it, some locations are public, some private, thus there are obviously duplicates which eat the space etc. Also their naming differs.

Hence I've got this proposal:

I would like to set up the system for overal handling of dumps which includes system of their storage, naming/linking and updating.

That would help to lower down the used space, easier transfer of the entire dump storage (i.e. in future we could have dedicated HDD(s) to dumps only), easier maintenance etc.

The idea of storage system, naming and linking is nearly complete, maintenance scripts work, but might be tweaked in some cases, some additional scripts might be necessary.

I think that this could be multi maintainer project for couple people (in case one is not around, other one can step in), so is there anybody active interested in joining?

---- FOR THOSE WHO USE DUMPS ----

When moving to the new system this would mean to you: 1) you would have to submit a list of dumps you use 2) you would have to update your tools which use dumps to use the shared dumps

For some time during the transition we would keep symlinks on old locations (instead of the files themselves) but the final step is to have dumps only on one place.

Questions, comments, suggestions?

Kind regards

Danny B.

Show replies by date

Peter Körner

9 Dec 9 Dec

3:46 p.m.

Am 09.12.2011 12:41, schrieb Danny B.:

...

Questions, comments, suggestions?

When you have data to share, the main problem is usually finding someone who is able and willing to store multi-gigabyte files on their server and provide the necessary bandwith for downloaders.

Collecting all dumps in one place begins with building a hosting-location with some terabytes of storage and a fast connection.

Peter

Danny B.

3:52 p.m.

...

------------ Původní zpráva ------------ Od: Peter Körner osm-lists@mazdermind.de

When you have data to share, the main problem is usually finding someone who is able and willing to store multi-gigabyte files on their server and provide the necessary bandwith for downloaders.

Collecting all dumps in one place begins with building a hosting-location with some terabytes of storage and a fast connection.

We already have these dumps stored in /mnt/user-storage/<various places> as well as lot of people have them in their ~.

The purpose is to have them only on one place, since now they are very often duplicated and on many places.

Also, only those dumps, which are being used by TS users are supposed to be stored, the proposal is not about mirroring the dumps.wikimedia.org...

Danny B.

Peter Körner

3:53 p.m.

Am 09.12.2011 12:52, schrieb Danny B.:

...

We already have these dumps stored in /mnt/user-storage/<various places> as well as lot of people have them in their ~.

I'm sorry I've missed that this mail was on the Toolserver-Mailinglist.

Never mind.

Peter

Lars Aronsson

8:47 p.m.

On 12/09/2011 12:52 PM, Danny B. wrote:

...

Also, only those dumps, which are being used by TS users are supposed to be stored, the proposal is not about mirroring the dumps.wikimedia.org...

This is stupid. I suggest we change the ambition and start to actually mirror all of dumps.wikimedia.org.

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Platonides

8:52 p.m.

On 09/12/11 12:52, Danny B. wrote:

...

We already have these dumps stored in /mnt/user-storage/<various places> as well as lot of people have them in their ~.

The purpose is to have them only on one place, since now they are very often duplicated and on many places.

Also, only those dumps, which are being used by TS users are supposed to be stored, the proposal is not about mirroring the dumps.wikimedia.org...

Danny B.

While /mnt/user-store/dump is a mess, we have a bit of organization at /mnt/user-store/dumps where they are inside folders by dbname. Although they should be additionally categorised inside into folders by date<.

I'm surprised by the number of uncompressed files there (ie .xml or .sql). Many times it wouldn't even be needed to decompress them.

Lars Aronsson

9:39 p.m.

On 12/09/2011 05:52 PM, Platonides wrote:

...

I'm surprised by the number of uncompressed files there (ie .xml or .sql). Many times it wouldn't even be needed to decompress them.

The popular pywikipediabot framework has an -xml: option, and I used to believe that it required the filename of an uncompressed XML file. But I was wrong. The following works just fine:

python replace.py -lang:da \ -xml:../dumps/dawiki/dawiki-20110404-pages-articles.xml.bz2 \ dansk svensk

If the following would also work (but it does not), we wouldn't have to worry about disk space at all:

python replace.py -lang:da \

-xml:http://dumps.wikimedia.org/dawiki/20111202/dawiki-20111202-pages-articles.xm... \ dansk svensk

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Darkdadaah

10:01 p.m.

...

On 12/09/2011 05:52 PM, Platonides wrote: If the following would also work (but it does not), we wouldn't have to worry about disk space at all: python replace.py -lang:da \ -xml:http://dumps.wikimedia.org/dawiki/20111202/dawiki-20111202-pages-articles.xm... \ dansk svensk

Would that not put a burden on the bandwidth, especially with repeated use of the same file? Unless the files were automatically cached... in the user-store ?

Darkdadaah

Stefan Kühn

10 Dec 10 Dec

11:18 p.m.

Am 09.12.2011 17:52, schrieb Platonides:

...

While /mnt/user-store/dump is a mess, we have a bit of organization at /mnt/user-store/dumps where they are inside folders by dbname. Although they should be additionally categorised inside into folders by date<.

I'm surprised by the number of uncompressed files there (ie .xml or .sql). Many times it wouldn't even be needed to decompress them.

When I create this directory "dump" there where no directory "dumps". Today we can easy merge this two directories. In the future I will download the dumps in the directory "dumps" under the right project directory like "dewiki" or so. I work with perl and need the uncompressed file in XML to read the dump. I have no idea how to read with perl a compressed file. I need only the newest dump, so at the moment my script delete all other dumps of an project and only let the newest and the second newest in the directory "dump".

Stefan (sk)

Jeremy Baron

11:52 p.m.

On Sat, Dec 10, 2011 at 14:18, Stefan Kühn kuehn-s@gmx.net wrote:

...

I work with perl and need the uncompressed file in XML to read the dump. I have no idea how to read with perl a compressed file.

Is it sufficient to receive the XML on stdin or do you need to be able to seek?

It is trivial to give you XML on stdin e.g. $ < path/to/bz2 bzip2 -d | perl script.pl

-Jeremy

Stefan Kühn

11 Dec 11 Dec

1:45 p.m.

Am 10.12.2011 20:52, schrieb Jeremy Baron:

...

Is it sufficient to receive the XML on stdin or do you need to be able to seek?

It is trivial to give you XML on stdin e.g. $< path/to/bz2 bzip2 -d | perl script.pl

Hmm, the stdin is possible, but I think this will need many memory of RAM on the server. I think this is no option for the future. Every language grows every day and the dumps will also grow. The next problem is the parallel use of a compressed file. If more user use this compressed file like your idea, then bzip2 will crash the server IMHO.

I think it is no problem to store the uncompressed XML files for an easy usage. We should make rules, where they have to stay and how long or we need a list, where every user can say "I need only the two newest dumps of enwiki, dewiki,...". If a dump is not needed, then we can delete this file.

Stefan (sk)

Ariel T. Glenn

1:51 p.m.

I hope at least a couple of you are subscribed to the xmldatadumps-l list to keep track of developments with the dumps. Last month I started running a sort of poor-person's incremental, very experimental at this point, but perhaps that will turn out to be useful for folks who are just looking to parse through the latest content.

Ariel

Στις 11-12-2011, ημέρα Κυρ, και ώρα 10:45 +0100, ο/η Stefan Kühn έγραψε:

...

Am 10.12.2011 20:52, schrieb Jeremy Baron:

...
Is it sufficient to receive the XML on stdin or do you need to be able to seek?

It is trivial to give you XML on stdin e.g. $< path/to/bz2 bzip2 -d | perl script.pl

Hmm, the stdin is possible, but I think this will need many memory of RAM on the server. I think this is no option for the future. Every language grows every day and the dumps will also grow. The next problem is the parallel use of a compressed file. If more user use this compressed file like your idea, then bzip2 will crash the server IMHO.

I think it is no problem to store the uncompressed XML files for an easy usage. We should make rules, where they have to stay and how long or we need a list, where every user can say "I need only the two newest dumps of enwiki, dewiki,...". If a dump is not needed, then we can delete this file.

Stefan (sk)

Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette

Platonides

7:47 p.m.

On 11/12/11 10:45, Stefan Kühn wrote:

...

Am 10.12.2011 20:52, schrieb Jeremy Baron:

...
Is it sufficient to receive the XML on stdin or do you need to be able to seek?

It is trivial to give you XML on stdin e.g. $< path/to/bz2 bzip2 -d | perl script.pl

Hmm, the stdin is possible, but I think this will need many memory of RAM on the server. I think this is no option for the future. Every language grows every day and the dumps will also grow. The next problem is the parallel use of a compressed file. If more user use this compressed file like your idea, then bzip2 will crash the server IMHO.

I think it is no problem to store the uncompressed XML files for an easy usage. We should make rules, where they have to stay and how long or we need a list, where every user can say "I need only the two newest dumps of enwiki, dewiki,...". If a dump is not needed, then we can delete this file.

Stefan (sk)

You seem to think that piping the output from bzip2 will hold the xml dump uncompressed in memory until your script processes it. That's wrong. bzip2 will begin uncompressing and writing to the pipe, when the pipe fills, it will get blocked. As your perl script reads from there, there's space freed and the unbzipping can progress.

Carl (CBM)

12 Dec 12 Dec

4:59 p.m.

On Sun, Dec 11, 2011 at 10:47 AM, Platonides platonides@gmail.com wrote:

...

You seem to think that piping the output from bzip2 will hold the xml dump uncompressed in memory until your script processes it. That's wrong. bzip2 will begin uncompressing and writing to the pipe, when the pipe fills, it will get blocked. As your perl script reads from there, there's space freed and the unbzipping can progress.

This is correct, but the overall memory usage depends on the XML library and programming technique being used. For XML that is too large to comfortably fit in memory, there are techniques to allow for the script to process the data before the entire XML file is parsed (google "SAX" or "stream-oriented parsing"). But this requires more advanced programming techniques, such as callbacks, compared to the more naive method of parsing all the XML into a data structure and then returning the data structure. That naive technique can result in large memory use if, say, the program tries to create a memory array of every page revision on enwiki.

Of course if the perl script is doing the parsing itself, by just matching regular expressions, this is not hard to do in a stream-oriented way.

- Carl

Platonides

8:27 p.m.

On 12/12/11 13:59, Carl (CBM) wrote:

...

This is correct, but the overall memory usage depends on the XML library and programming technique being used. For XML that is too large to comfortably fit in memory, there are techniques to allow for the script to process the data before the entire XML file is parsed (google "SAX" or "stream-oriented parsing"). But this requires more advanced programming techniques, such as callbacks, compared to the more naive method of parsing all the XML into a data structure and then returning the data structure. That naive technique can result in large memory use if, say, the program tries to create a memory array of every page revision on enwiki.

Of course if the perl script is doing the parsing itself, by just matching regular expressions, this is not hard to do in a stream-oriented way.

Carl

Obviously. No matter if it's read from a .xml or a .xml.bz2, if it tried to build a xml tree in memory the memory usage would be incredibly huge. I would expect such app to get killed for such.

Lars Aronsson

12:55 a.m.

On 12/11/2011 10:45 AM, Stefan Kühn wrote:

...

Hmm, the stdin is possible, but I think this will need many memory of RAM on the server. I think this is no option for the future. Every language grows every day and the dumps will also grow.

No, Stefan, it's not a matter of RAM, but of CPU. When your program reads from a pipe, the decompression program (bunzip2 or gunzip) consumes a few extra processor cycles every time your program reads the next kilobyte or megabyte of input. Most often, these CPU cycles are cheaper than storing the uncompressed XML file on disk.

Sometimes, reading compressed data and decompressing it, is also faster than reading the larger uncompressed data from disk.

If you read the entire compressed file into RAM and decompress it in RAM before starting to use it, then a lot of RAM will be needed. But there is no reason to do this for an XML file, which is always processed like a stream or sequence. (Remember that UNIX pipes were invented in a time when streaming data from one tape station to another was common, and a PDP-11 had 32 Kbyte of RAM.)

Here's how I read the *.sql.gz files in Perl:

my $page = "enwiki-20111128-page.sql.gz"; if ($page =~ /.gz$/) { open(PAGE, "gunzip <$page |"); } else { open(PAGE, "<$page"); } while (<PAGE>) { chomp; ...

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Mike Dupont

11 Dec 11 Dec

2:02 p.m.

Hi there, I have experience with this topic.

here is a simple read function : use Compress::Bzip2 qw(:all ); use IO::Uncompress::Bunzip2 qw ($Bunzip2Error); use IO::File; sub ReadFile { my $filename=shift; my $html=""; my $fh; if ($filename =~/.bz2/) { $fh=IO::Uncompress::Bunzip2->new( $filename) or die "Couldn't open bzipped input file: $Bunzip2Error\n";

} else { $fh= IO::File->new( $filename) or die "Couldn't open input file $@\n"; }

while(<$fh>) { $html .= $_; } $html; }

I have examples of how to process the huge bz file in parts here, without downloading the whole thing http://bazaar.launchpad.net/~jamesmikedupont/+junk/openstreetmap-wikipedia/v... basically you can download with http a partialfile http://bazaar.launchpad.net/~jamesmikedupont/+junk/openstreetmap-wikipedia/v... $req->init_header('Range' => sprintf("bytes=%s-%s", $startpos , $endpos - 1 ));

then use bz2 recover to extract data from that block.

let me know if you have any questions

On Sat, Dec 10, 2011 at 8:52 PM, Jeremy Baron jeremy@tuxmachine.com wrote:

...

On Sat, Dec 10, 2011 at 14:18, Stefan Kühn kuehn-s@gmx.net wrote:

...
I work with perl and need the uncompressed file in XML to read the dump. I have no idea how to read with perl a compressed file.

Is it sufficient to receive the XML on stdin or do you need to be able to seek?

It is trivial to give you XML on stdin e.g. $ < path/to/bz2 bzip2 -d | perl script.pl

-Jeremy

Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette

-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org

Stefan Kühn

2:10 p.m.

Am 11.12.2011 11:02, schrieb Mike Dupont:

...

let me know if you have any questions

Thanks for this script. I will try this.

Stefan (sk)

Lars Aronsson

9 Dec 9 Dec

8:46 p.m.

On 12/09/2011 12:46 PM, Peter Körner wrote:

...

Collecting all dumps in one place begins with building a hosting-location with some terabytes of storage and a fast connection.

To me, that sounds like -- the toolserver! I'm sorry if this suggestion is naive. Why is the toolserver short on disk space? When I downloaded some dumps, why did I sometimes get only 200 kbytes/seconds? Are we on an ADSL line?

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

4757

Age (days ago)

4760

Last active (days ago)

toolserver-l@lists.wikimedia.org

18 comments

10 participants

tags (0)

participants (10)

Ariel T. Glenn
Carl (CBM)
Danny B.
Darkdadaah
Jeremy Baron
Lars Aronsson
Mike Dupont
Peter Körner
Platonides
Stefan Kühn