Xmldatadumps-l August 2013

xmldatadumps-l@lists.wikimedia.org

11 participants
12 discussions

by Richard Jelinek

Hi, don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't... I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used to compress the xml dumps instead of bzip2. Why? Because its sibling (pbunzip2) has a bug bunzip2 hasn't. :-) Strange? Read on. A few hours ago, I filed a bug report for pbzip2 (see https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test results done even some few hours before that. The results indicate that: bzip2 and pbzip2 are vice-versa compatible each one can create archives, the other one can read. But if it is for uncomressing, only pbzip2 compressed archives are good for pbunzip2. I propose compressing the archives with pbzip2 for the following reasons: 1) If your archiving machines are SMP systems this could lead to a better usage of system ressources (i.e. faster compression). 2) Compression with pbzip2 is harmless for regular users of bunzip2, so everything should run for these people as usual. 3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a speedup that scales nearly linearly with the number of CPUs in the host. So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-) cheers, -- Dipl.-Inf. Univ. Richard C. Jelinek PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Human Language Technology Experts Sitz der Gesellschaft: Fürth 69216618 Mind Units Registergericht: AG Fürth, HRB-9201

8 years, 3 months

First preview version of incremental dumps

by Petr Onderka

Hi, after a month of work on my GSoC project Incremental Dumps [1], I think I have now something worth sharing and talking about, though it's still far from complete. What the code can do now is to read a pages-history XML dump and create the various kinds of dumps (pages/stub, current/history) in the new format from that. It can then convert a dump in the new format back to XML. The XML output is almost the same as existing XML dumps, but there are some differences [2]. The current state of the new format also now has a detailed specification [3] (this describes the current version, the format is still in flux and can change daily). If you want, you can also try running the code. [4] It's not production-quality yet (e.g. it doesn't report errors properly), but it should work. Compilation instructions are in the README file. Any comments or questions are welcome. Petr Onderka User:Svick [1]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps [2]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format/XML_… [3]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format/Spec… [4]: https://github.com/wikimedia/operations-dumps-incremental/tree/gsoc

10 years, 7 months

Page, page-articles and page_len

by Ivan Bruque i Calero

Apart of the problems I have related to speed inserting page table data from the SQL dump, there is a thing I don't understand. Why there are about 30 millions "INSERT INTO page" in page.sql but only about 10 millions in page-articles.sql dump?

10 years, 8 months

rev_len and page_len

by Xavier Vinyals Mirabent

Hi everyone! I'm trying to load the dumps into my mysql database, but so far for the revision and page tables, I get either null or zero values for the rev_len and page_len columns. Does anybody have any idea why? What dump file contains this information? Thanks! -- Xavi

10 years, 8 months

Page, page-articles and page_len

by A B

10 years, 8 months

Re: [Xmldatadumps-l] Installing dumps too slow

by Byrial Jensen

Den 23-08-2013 01:27, A B skrev: > So, how could I do that? One way would be to edit the first part of the sql file which creates the table before using it. Another way which gives better control would be to write your own program to parse the sql file and insert the data into the database. See http://toolserver.org/~byrial/wikidata-programs/read_page_table.c for an example of that. The program only inserts some of the columns into the table, and also only selected rows depending on the namespace. Reagrds, - Byrial > El 23/08/2013, a les 1:16, Byrial Jensen <byrial(a)vip.cybercity.dk> va escriure: > >> Den 23-08-2013 00:54, A B skrev: >>> I'm trying to install enwiki-20130503-page.sql into an empty database and the process is too slow (it takes days), so I think it isn't normal. What I do is: >>> >>> 1. Installing tables.sql >>> 2. Executing a script to disable foreign key checks >>> 3. Installing enwiki-20130503-page.sq (mysql -u <usr> -p<pwd> enwiki < enwiki-20130503-page.sql &). >>> >>> I don't know if the problem can be my MySQL configuration: >>> >>> Any idea? >> >> I don't know about the configuration, but I would first read all rows into a table without any keys at all, and then create the keys I need afterwards. It is much faster to do the necessary sorting once instead of updating the keys for each inserted row. It is not enough to an "alter table page disable keys" because unique keys will still be updated and checked for uniqueness at each inserted row. >> >> Regards, >> - Byrial

10 years, 8 months

Installing dumps too slow

by A B

I'm trying to install enwiki-20130503-page.sql into an empty database and the process is too slow (it takes days), so I think it isn't normal. What I do is: 1. Installing tables.sql 2. Executing a script to disable foreign key checks 3. Installing enwiki-20130503-page.sq (mysql -u <usr> -p<pwd> enwiki < enwiki-20130503-page.sql &). I don't know if the problem can be my MySQL configuration: [client] port = 3306 socket = /var/run/mysqld/mysqld.sock default-character-set =utf8 [mysqld_safe] socket = /var/run/mysqld/mysqld.sock nice = 0 [mysqld] user = mysql pid-file = /var/run/mysqld/mysqld.pid socket = /var/run/mysqld/mysqld.sock port = 3306 basedir = /usr datadir = /var/lib/mysql tmpdir = /tmp lc-messages-dir = /usr/share/mysql skip-external-locking bind-address = 127.0.0.1 key_buffer = 16M max_allowed_packet = 256K thread_stack = 192K thread_cache_size = 8 myisam-recover = BACKUP query_cache_limit = 1M query_cache_size = 16M log_error = /var/log/mysql/error.log expire_logs_days = 10 max_binlog_size = 100M collation-server = utf8_unicode_ci init-connect ='SET NAMES utf8' character-set-server = utf8 [mysqldump] quick quote-names max_allowed_packet = 16M [mysql] default-character-set = utf8 [isamchk] key_buffer = 16M !includedir /etc/mysql/conf.d/ Any idea?

10 years, 8 months

Generating diffs of XML dumps

by Randall Farmer

Hi, data dumps people. I've been working on a tool to generate smaller update files from XML dumps by just saving the diff between the previous dump and the latest revision. It doesn't aim to do everything the new dump format does, and indeed it looks like the dumps project will make it obsolete at some point, but I'm putting this out there to see if it's useful in the interim. You can get it at https://github.com/twotwotwo/dltp After downloading one of the binaries, you can put it in a directory with enwiki-20130805-pages-articles.xml.bz2 and run: ./dltp http://www.rfarmer.net/dltp/en/enwiki-20130805-20130812-cut-merged.xml.dltp… That download is ~50 MB, but it expands to about 4GB of XML, consisting of the latest revision's text for every page that was updated 8/5-8/12. (A diff for 7/9-8/5 was 440MB.) You can pipe the XML output to something, rather than saving it to a file, using the -c flag. This all sounds great, and maybe it is if bandwidth is your bottleneck. There are lots of important caveats, though: - *It looks likely to be obsoleted by official WMF diff dumps. *I started on this a while back, and at the time, from what I'd read, I thought diff dumps weren't a high priority for the official project so might be worth implementing unofficially. More recently it sounds more like official diff dumps actually might not be all that far off, so if you invest in using this, it might not pay off for all that long. - These diffs give you *only the latest revision* of each page in namespace 0. No full history. - You have to keep the old dump file around. - The tool reads and *unzips* the old dump each time. bunzip2 is slow, so *you may want to keep reference XML in uncompressed form*, or (better) compressed with gzip or lzop. - On Windows, dltp unzips at slower than native speed, so there's even more reason to store your source file uncompressed. - No matter what, the old file can take a while to read through; *5-25 minutes to expand the file, depending on how your reference XML is stored, is entirely plausible.* - It uses adds-changes dumps, so it doesn't do anything to account for deletions or oversighting. - *It's new, non-battle-tested software, so caveat emptor.* So, I'm curious: - Does this help you, given the caveats? Would you start using, e.g., weekly deltas if I posted them? - If you actually ran the dltp command above, did you have any trouble? How long did it take? - If you'd use this, what sort of project are you thinking of on what sort of machine? (A public site hosting Wiki content or something else? Big server, VPS, desktop? Linux or another OS?) - Which wiki (enwiki, etc.) would you want diffs for? Other than that the source is out there and anyone can use it, I'm not promising to post deltas or anything now (since I don't know if folks would rather wait on official delta dumps, etc.). But interested to see if there's any potential use here. Best, Randall

10 years, 8 months

Re: [Xmldatadumps-l] [Wikitech-l] Generating diffs of XML dumps

by gnosygnu

Nice work on the program! To answer your questions: >- Does this help you, given the caveats? Would you start using, e.g., weekly deltas if I posted them? It is helpful in that it is ready for use now, and is very easy to use. I would probably need some time to automate it for my application, but it is possible that I could consume weekly deltas within a week or two. I'm not sure if it'd be worth your while. See more below. >- If you actually ran the dltp command above, did you have any trouble? How long did it take? I gave it a run with the two latest pages-article dumps from simplewiki (http://dumps.wikimedia.org/simplewiki/) and it worked well with pack [1] and unpack [2]. Simplewiki is about 89 MB and dltp took about 10 seconds to generate a file of 4.2 MB (the dump files were already unzipped) . Unpacking took about 12 seconds to generate the 450 MB xml file. For reference, this is a Windows 7 machine with a dual-core 2.2 GhZ processor and 2 GB memory. Zipped took longer: about 2 minutes. I'll try the English Wikipedia counterparts tomorrow, but I'd guess it wouldn't take more than 25 min for unzipped. I also tried cut / merge, but didn't really have anything meaningful to use, so can only testify that they didn't fail. [1] dltp386.exe simplewiki-20130813-pages-articles.xml simplewiki-20130724-pages-articles.xml [2] dltp386.exe simplewiki-20130813-pages-articles.xml.dltp.gz >- If you'd use this, what sort of project are you thinking of on what sort of machine? I'd use it for XOWA: https://sourceforge.net/projects/xowa/. This is a desktop app that runs on an individual user's machine for reading Wikipedia offline. I've had a few users ask about not having to download the entire dump every time, so you may see some delta usage there. Unfortunately, the user base is currently small (700 total downloads per month = ??? unique users), so it would not be worth your time to post the deltas solely for my benefit. >- Which wiki (enwiki, etc.) would you want diffs for? Certainly, the 3 largest ones: enwiki, dewiki, frwiki. If possible, the 8 wikis with more than 1 million articles. XOWA can read any wiki, but it wouldn't make sense to post diffs for a small Wikipedia (Latin) when there would be no one to download them. Also, this is probably outside the scope of dltp, but is there any way for the diff file to be self-contained? For example, if the old dump had 10 articles and the new dump had 11 articles with 1 new article and 1 changed article, then the diff file would only have 2 articles: the 1 new article and the 1 changed one. This may not save as much space, but it'd be easier for users to work with the delta file, then to remember to keep the original dump file around. Hope my feedback is useful. Good luck. On Tue, Aug 13, 2013 at 6:48 PM, Randall Farmer <randall(a)wawd.com> wrote: > [Adding wikitech once more with feeling. Sorry for all of the copies, > Xmldatadumps-l.] > > Hi, everyone. > > I've been working on a tool to generate smaller update files from XML dumps > by just saving the diff between the previous dump and the latest revision. > It doesn't aim to do everything the new dump format does, and indeed it > looks like the dumps project will make it obsolete at some point, but I'm > putting this out there to see if it's useful in the interim. > > You can get it at https://github.com/twotwotwo/dltp > > After downloading one of the binaries, you can put it in a directory > with enwiki-20130805-pages-articles.xml.bz2 and run: > > ./dltp > http://www.rfarmer.net/dltp/en/enwiki-20130805-20130812-cut-merged.xml.dltp… > > That download is ~50 MB, but it expands to about 4GB of XML, consisting of > the latest revision's text for every page that was updated 8/5-8/12. (A > diff for 7/9-8/5 was 440MB.) You can pipe the XML output to something, > rather than saving it to a file, using the -c flag. > > This all sounds great, and maybe it is if bandwidth is your bottleneck. > There are lots of important caveats, though: > > - *It looks likely to be obsoleted by official WMF diff dumps. *I started > on this a while back, and at the time, from what I'd read, I thought diff > dumps weren't a high priority for the official project so might be worth > implementing unofficially [ed: looking back, I probably should've been more > cautious about this]. More recently it sounds more like official diff dumps > actually might not be all that far off, so if you invest in using this, it > might not pay off for all that long. > - These diffs give you *only the latest revision* of each page in namespace > 0. No full history. > - You have to keep the old dump file around. > - The tool reads and *unzips* the old dump each time. bunzip2 is slow, so *you > may want to keep reference XML in uncompressed form*, or (better) > compressed with gzip or lzop. > - On Windows, dltp unzips at slower than native speed, so there's even more > reason to store your source file uncompressed. > - No matter what, the old file can take a while to read through; *5-25 > minutes to expand the file, depending on how your reference XML is stored, > is entirely plausible.* > - It uses adds-changes dumps, so it doesn't do anything to account for > deletions or oversighting. > - *It's new, non-battle-tested software, so caveat emptor.* > > So, I'm curious: > > - Does this help you, given the caveats? Would you start using, e.g., > weekly deltas if I posted them? > - If you actually ran the dltp command above, did you have any trouble? How > long did it take? > - If you'd use this, what sort of project are you thinking of on what sort > of machine? (A public site hosting Wiki content or something else? Big > server, VPS, desktop? Linux or another OS?) > - Which wiki (enwiki, etc.) would you want diffs for? > > Other than that the source is out there and anyone can use it, I'm not > promising to post deltas or anything now (since perhaps many folks would > rather wait on official delta dumps, etc.). But interested to see if > there's any potential use here. > > Best, > Randall > _______________________________________________ > Wikitech-l mailing list > Wikitech-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l

10 years, 8 months

all dumps now running out of ashburn

by Ariel T. Glenn

There are some datasets (for example, the pageview stats) that get copied over to the host in Tampa first from the server where they are produced, but all dumps now originate in Ashburn. We're running 6 processes for the smaller wikis as promised, so we'll see how that does in terms of frequency. Things not visible on the index html page are a manual run of the problem step of wikidata, and a rerun of two pieces of the de history dumps. Sometime later I'll schedule a move of all those miscellanous items other people produce, and then finally a move of the hostnames (download and dumps.wikimedia.org). Ariel

10 years, 8 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l August 2013