Hi,
currently there is quite a big mess in how dump files are handled. They are located in several locations without any system in it, some locations are public, some private, thus there are obviously duplicates which eat the space etc. Also their naming differs.
Hence I've got this proposal:
I would like to set up the system for overal handling of dumps which includes system of their storage, naming/linking and updating.
That would help to lower down the used space, easier transfer of the entire dump storage (i.e. in future we could have dedicated HDD(s) to dumps only), easier maintenance etc.
The idea of storage system, naming and linking is nearly complete, maintenance scripts work, but might be tweaked in some cases, some additional scripts might be necessary.
I think that this could be multi maintainer project for couple people (in case one is not around, other one can step in), so is there anybody active interested in joining?
---- FOR THOSE WHO USE DUMPS ----
When moving to the new system this would mean to you: 1) you would have to submit a list of dumps you use 2) you would have to update your tools which use dumps to use the shared dumps
For some time during the transition we would keep symlinks on old locations (instead of the files themselves) but the final step is to have dumps only on one place.
Questions, comments, suggestions?
Kind regards
Danny B.
Am 09.12.2011 12:41, schrieb Danny B.:
Questions, comments, suggestions?
When you have data to share, the main problem is usually finding someone who is able and willing to store multi-gigabyte files on their server and provide the necessary bandwith for downloaders.
Collecting all dumps in one place begins with building a hosting-location with some terabytes of storage and a fast connection.
Peter
------------ Původní zpráva ------------ Od: Peter Körner osm-lists@mazdermind.de
When you have data to share, the main problem is usually finding someone who is able and willing to store multi-gigabyte files on their server and provide the necessary bandwith for downloaders.
Collecting all dumps in one place begins with building a hosting-location with some terabytes of storage and a fast connection.
We already have these dumps stored in /mnt/user-storage/<various places> as well as lot of people have them in their ~.
The purpose is to have them only on one place, since now they are very often duplicated and on many places.
Also, only those dumps, which are being used by TS users are supposed to be stored, the proposal is not about mirroring the dumps.wikimedia.org...
Danny B.
On 12/09/2011 12:52 PM, Danny B. wrote:
Also, only those dumps, which are being used by TS users are supposed to be stored, the proposal is not about mirroring the dumps.wikimedia.org...
This is stupid. I suggest we change the ambition and start to actually mirror all of dumps.wikimedia.org.
On 09/12/11 12:52, Danny B. wrote:
We already have these dumps stored in /mnt/user-storage/<various places> as well as lot of people have them in their ~.
The purpose is to have them only on one place, since now they are very often duplicated and on many places.
Also, only those dumps, which are being used by TS users are supposed to be stored, the proposal is not about mirroring the dumps.wikimedia.org...
Danny B.
While /mnt/user-store/dump is a mess, we have a bit of organization at /mnt/user-store/dumps where they are inside folders by dbname. Although they should be additionally categorised inside into folders by date<.
I'm surprised by the number of uncompressed files there (ie .xml or .sql). Many times it wouldn't even be needed to decompress them.
On 12/09/2011 05:52 PM, Platonides wrote:
I'm surprised by the number of uncompressed files there (ie .xml or .sql). Many times it wouldn't even be needed to decompress them.
The popular pywikipediabot framework has an -xml: option, and I used to believe that it required the filename of an uncompressed XML file. But I was wrong. The following works just fine:
python replace.py -lang:da \ -xml:../dumps/dawiki/dawiki-20110404-pages-articles.xml.bz2 \ dansk svensk
If the following would also work (but it does not), we wouldn't have to worry about disk space at all:
python replace.py -lang:da \
-xml:http://dumps.wikimedia.org/dawiki/20111202/dawiki-20111202-pages-articles.xm... \ dansk svensk
On 12/09/2011 05:52 PM, Platonides wrote: If the following would also work (but it does not), we wouldn't have to worry about disk space at all: python replace.py -lang:da \ -xml:http://dumps.wikimedia.org/dawiki/20111202/dawiki-20111202-pages-articles.xm... \ dansk svensk
Would that not put a burden on the bandwidth, especially with repeated use of the same file? Unless the files were automatically cached... in the user-store ?
Darkdadaah
Am 09.12.2011 17:52, schrieb Platonides:
While /mnt/user-store/dump is a mess, we have a bit of organization at /mnt/user-store/dumps where they are inside folders by dbname. Although they should be additionally categorised inside into folders by date<.
I'm surprised by the number of uncompressed files there (ie .xml or .sql). Many times it wouldn't even be needed to decompress them.
When I create this directory "dump" there where no directory "dumps". Today we can easy merge this two directories. In the future I will download the dumps in the directory "dumps" under the right project directory like "dewiki" or so. I work with perl and need the uncompressed file in XML to read the dump. I have no idea how to read with perl a compressed file. I need only the newest dump, so at the moment my script delete all other dumps of an project and only let the newest and the second newest in the directory "dump".
Stefan (sk)
On Sat, Dec 10, 2011 at 14:18, Stefan Kühn kuehn-s@gmx.net wrote:
I work with perl and need the uncompressed file in XML to read the dump. I have no idea how to read with perl a compressed file.
Is it sufficient to receive the XML on stdin or do you need to be able to seek?
It is trivial to give you XML on stdin e.g. $ < path/to/bz2 bzip2 -d | perl script.pl
-Jeremy
Am 10.12.2011 20:52, schrieb Jeremy Baron:
Is it sufficient to receive the XML on stdin or do you need to be able to seek?
It is trivial to give you XML on stdin e.g. $< path/to/bz2 bzip2 -d | perl script.pl
Hmm, the stdin is possible, but I think this will need many memory of RAM on the server. I think this is no option for the future. Every language grows every day and the dumps will also grow. The next problem is the parallel use of a compressed file. If more user use this compressed file like your idea, then bzip2 will crash the server IMHO.
I think it is no problem to store the uncompressed XML files for an easy usage. We should make rules, where they have to stay and how long or we need a list, where every user can say "I need only the two newest dumps of enwiki, dewiki,...". If a dump is not needed, then we can delete this file.
Stefan (sk)
I hope at least a couple of you are subscribed to the xmldatadumps-l list to keep track of developments with the dumps. Last month I started running a sort of poor-person's incremental, very experimental at this point, but perhaps that will turn out to be useful for folks who are just looking to parse through the latest content.
Ariel
Στις 11-12-2011, ημέρα Κυρ, και ώρα 10:45 +0100, ο/η Stefan Kühn έγραψε:
Am 10.12.2011 20:52, schrieb Jeremy Baron:
Is it sufficient to receive the XML on stdin or do you need to be able to seek?
It is trivial to give you XML on stdin e.g. $< path/to/bz2 bzip2 -d | perl script.pl
Hmm, the stdin is possible, but I think this will need many memory of RAM on the server. I think this is no option for the future. Every language grows every day and the dumps will also grow. The next problem is the parallel use of a compressed file. If more user use this compressed file like your idea, then bzip2 will crash the server IMHO.
I think it is no problem to store the uncompressed XML files for an easy usage. We should make rules, where they have to stay and how long or we need a list, where every user can say "I need only the two newest dumps of enwiki, dewiki,...". If a dump is not needed, then we can delete this file.
Stefan (sk)
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
On 11/12/11 10:45, Stefan Kühn wrote:
Am 10.12.2011 20:52, schrieb Jeremy Baron:
Is it sufficient to receive the XML on stdin or do you need to be able to seek?
It is trivial to give you XML on stdin e.g. $< path/to/bz2 bzip2 -d | perl script.pl
Hmm, the stdin is possible, but I think this will need many memory of RAM on the server. I think this is no option for the future. Every language grows every day and the dumps will also grow. The next problem is the parallel use of a compressed file. If more user use this compressed file like your idea, then bzip2 will crash the server IMHO.
I think it is no problem to store the uncompressed XML files for an easy usage. We should make rules, where they have to stay and how long or we need a list, where every user can say "I need only the two newest dumps of enwiki, dewiki,...". If a dump is not needed, then we can delete this file.
Stefan (sk)
You seem to think that piping the output from bzip2 will hold the xml dump uncompressed in memory until your script processes it. That's wrong. bzip2 will begin uncompressing and writing to the pipe, when the pipe fills, it will get blocked. As your perl script reads from there, there's space freed and the unbzipping can progress.
On Sun, Dec 11, 2011 at 10:47 AM, Platonides platonides@gmail.com wrote:
You seem to think that piping the output from bzip2 will hold the xml dump uncompressed in memory until your script processes it. That's wrong. bzip2 will begin uncompressing and writing to the pipe, when the pipe fills, it will get blocked. As your perl script reads from there, there's space freed and the unbzipping can progress.
This is correct, but the overall memory usage depends on the XML library and programming technique being used. For XML that is too large to comfortably fit in memory, there are techniques to allow for the script to process the data before the entire XML file is parsed (google "SAX" or "stream-oriented parsing"). But this requires more advanced programming techniques, such as callbacks, compared to the more naive method of parsing all the XML into a data structure and then returning the data structure. That naive technique can result in large memory use if, say, the program tries to create a memory array of every page revision on enwiki.
Of course if the perl script is doing the parsing itself, by just matching regular expressions, this is not hard to do in a stream-oriented way.
- Carl
On 12/12/11 13:59, Carl (CBM) wrote:
This is correct, but the overall memory usage depends on the XML library and programming technique being used. For XML that is too large to comfortably fit in memory, there are techniques to allow for the script to process the data before the entire XML file is parsed (google "SAX" or "stream-oriented parsing"). But this requires more advanced programming techniques, such as callbacks, compared to the more naive method of parsing all the XML into a data structure and then returning the data structure. That naive technique can result in large memory use if, say, the program tries to create a memory array of every page revision on enwiki.
Of course if the perl script is doing the parsing itself, by just matching regular expressions, this is not hard to do in a stream-oriented way.
- Carl
Obviously. No matter if it's read from a .xml or a .xml.bz2, if it tried to build a xml tree in memory the memory usage would be incredibly huge. I would expect such app to get killed for such.
On 12/11/2011 10:45 AM, Stefan Kühn wrote:
Hmm, the stdin is possible, but I think this will need many memory of RAM on the server. I think this is no option for the future. Every language grows every day and the dumps will also grow.
No, Stefan, it's not a matter of RAM, but of CPU. When your program reads from a pipe, the decompression program (bunzip2 or gunzip) consumes a few extra processor cycles every time your program reads the next kilobyte or megabyte of input. Most often, these CPU cycles are cheaper than storing the uncompressed XML file on disk.
Sometimes, reading compressed data and decompressing it, is also faster than reading the larger uncompressed data from disk.
If you read the entire compressed file into RAM and decompress it in RAM before starting to use it, then a lot of RAM will be needed. But there is no reason to do this for an XML file, which is always processed like a stream or sequence. (Remember that UNIX pipes were invented in a time when streaming data from one tape station to another was common, and a PDP-11 had 32 Kbyte of RAM.)
Here's how I read the *.sql.gz files in Perl:
my $page = "enwiki-20111128-page.sql.gz"; if ($page =~ /.gz$/) { open(PAGE, "gunzip <$page |"); } else { open(PAGE, "<$page"); } while (<PAGE>) { chomp; ...
Hi there, I have experience with this topic.
here is a simple read function : use Compress::Bzip2 qw(:all ); use IO::Uncompress::Bunzip2 qw ($Bunzip2Error); use IO::File; sub ReadFile { my $filename=shift; my $html=""; my $fh; if ($filename =~/.bz2/) { $fh=IO::Uncompress::Bunzip2->new( $filename) or die "Couldn't open bzipped input file: $Bunzip2Error\n";
} else { $fh= IO::File->new( $filename) or die "Couldn't open input file $@\n"; }
while(<$fh>) { $html .= $_; } $html; }
I have examples of how to process the huge bz file in parts here, without downloading the whole thing http://bazaar.launchpad.net/~jamesmikedupont/+junk/openstreetmap-wikipedia/v... basically you can download with http a partialfile http://bazaar.launchpad.net/~jamesmikedupont/+junk/openstreetmap-wikipedia/v... $req->init_header('Range' => sprintf("bytes=%s-%s", $startpos , $endpos - 1 ));
then use bz2 recover to extract data from that block.
let me know if you have any questions
On Sat, Dec 10, 2011 at 8:52 PM, Jeremy Baron jeremy@tuxmachine.com wrote:
On Sat, Dec 10, 2011 at 14:18, Stefan Kühn kuehn-s@gmx.net wrote:
I work with perl and need the uncompressed file in XML to read the dump. I have no idea how to read with perl a compressed file.
Is it sufficient to receive the XML on stdin or do you need to be able to seek?
It is trivial to give you XML on stdin e.g. $ < path/to/bz2 bzip2 -d | perl script.pl
-Jeremy
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
On 12/09/2011 12:46 PM, Peter Körner wrote:
Collecting all dumps in one place begins with building a hosting-location with some terabytes of storage and a fast connection.
To me, that sounds like -- the toolserver! I'm sorry if this suggestion is naive. Why is the toolserver short on disk space? When I downloaded some dumps, why did I sometimes get only 200 kbytes/seconds? Are we on an ADSL line?
toolserver-l@lists.wikimedia.org