Xmldatadumps-l July 2013

xmldatadumps-l@lists.wikimedia.org

20 participants
13 discussions

by Richard Jelinek

Hi, don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't... I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used to compress the xml dumps instead of bzip2. Why? Because its sibling (pbunzip2) has a bug bunzip2 hasn't. :-) Strange? Read on. A few hours ago, I filed a bug report for pbzip2 (see https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test results done even some few hours before that. The results indicate that: bzip2 and pbzip2 are vice-versa compatible each one can create archives, the other one can read. But if it is for uncomressing, only pbzip2 compressed archives are good for pbunzip2. I propose compressing the archives with pbzip2 for the following reasons: 1) If your archiving machines are SMP systems this could lead to a better usage of system ressources (i.e. faster compression). 2) Compression with pbzip2 is harmless for regular users of bunzip2, so everything should run for these people as usual. 3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a speedup that scales nearly linearly with the number of CPUs in the host. So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-) cheers, -- Dipl.-Inf. Univ. Richard C. Jelinek PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Human Language Technology Experts Sitz der Gesellschaft: Fürth 69216618 Mind Units Registergericht: AG Fürth, HRB-9201

8 years, 3 months

First preview version of incremental dumps

by Petr Onderka

Hi, after a month of work on my GSoC project Incremental Dumps [1], I think I have now something worth sharing and talking about, though it's still far from complete. What the code can do now is to read a pages-history XML dump and create the various kinds of dumps (pages/stub, current/history) in the new format from that. It can then convert a dump in the new format back to XML. The XML output is almost the same as existing XML dumps, but there are some differences [2]. The current state of the new format also now has a detailed specification [3] (this describes the current version, the format is still in flux and can change daily). If you want, you can also try running the code. [4] It's not production-quality yet (e.g. it doesn't report errors properly), but it should work. Compilation instructions are in the README file. Any comments or questions are welcome. Petr Onderka User:Svick [1]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps [2]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format/XML_… [3]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format/Spec… [4]: https://github.com/wikimedia/operations-dumps-incremental/tree/gsoc

10 years, 7 months

Extracted page abstracts for Yahoo

by Andreas Meier

Hello, there is a problem with the extracted page abstracts for Yahoo on the big wikis moved to the new infrastructure. During generation everything seems to be fine, but it ended with a 159kb file. An other question: Why is this step not parallelized? Best regards Andreas Meier

10 years, 8 months

Suggested file format of new incremental dumps

by Petr Onderka

For my GSoC project Incremental data dumps [1], I'm creating a new file format to replace Wikimedia's XML data dumps. A sketch of how I imagine the file format to look like is at http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format. What do you think? Does it make sense? Would it work for your use case? Any comments or suggestions are welcome. Petr Onderka [[User:Svick]] [1]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps

10 years, 8 months

migration to eqiad (ashburn dc) continued

by Ariel T. Glenn

Hello XML dump users, The workers that handle the rest of the wikis will finish what they are working on (whichever runs are in progress now) and then will terminate. When all are stopped I will be starting up these processes on hosts in the other data center. I'll be running 6 workers at a time which should put us in the time frame of 8-9 days between runs per wiki. We'll see how that goes. Ariel

10 years, 9 months

Namespaces in pages-meta-history.xml

by Johannes Daxenberger

Hi, I was wondering on the order/sorting of revisions inside the pages-meta-history dumps, especially with respect to the namespaces. Does the order of revisions in the dumps account for namespaces (e.g. are revisions from the Template namespace located towards the end of the dump?) or is the order bound to any other parameter which potentially influences the location of revisions from certain namespaces? I'm currently processing the (March 2013) dewiki dump. Regards, Johannes --- Johannes Daxenberger Doctoral Researcher | IT Administration Ubiquitous Knowledge Processing (UKP Lab) FB 20 Computer Science Department Technische Universität Darmstadt Hochschulstr. 10, D-64289 Darmstadt, Germany email: daxenberger(at)ukp.informatik.tu-darmstadt.de phone: [+49] (0)6151 16-6227, fax: -5455, room: S2/02/B111 www.ukp.tu-darmstadt.de<http://www.ukp.tu-darmstadt.de/> Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de<http://www.werc.tu-darmstadt.de/>

10 years, 9 months

Import logging

by A B

Hi guys, I'm trying to import the enwiki-pages-logging.xml into a MySQL database and I'm having a lot of troubles converting de XML into a SQL statements. I'm using importDump.php to do this conversion, but I'm getting an error when the script tries to import a register with the next data: <logtitle>�0”1r¨©m¨¡l¨¡dev¨©-si�5õ1ha-n¨¡da-s¨±tra</logtitle> It seems an encoding problem, but I think I have everything correct. Does anybody have a suggestion? Thanks in advance. Mel

10 years, 9 months

wikipedia dump contents

by Xavier Vinyals Mirabent

Hi all, I've imported the data dumps into mysql and after running some queries I've noticed the column "rev_len" in revision is empty and "page_len" in the pages table is always equal to 0 for every row. Can anybody tell me anything about this? I was really hoping to use this information. Best, Xavi -- Xavier Vinyals-Mirabent ----------------------------------------------------- Mail: 1925, 4th Street South 4-101 Hanson Hall Minneapolis, MN 55455-0462 Email: v <renes(a)umn.edu>inya002(a)umn.edu Office: 3-157 Hanson Hall Phone: +1 6126257837 Homepage: www.econ.umn.edu/~vinya002

10 years, 9 months

Namespace names

by Byrial Jensen

Hi, is there a file somewhere with a list of all namespace names and numbers for all the Wikimedia wikis. The list should at least have the canonical names, but preferably also any aliases. This is useful to find out which namespaces interwiki links and language links goes to. The canonical names are in the siteinfo section of the XML dumps of each wiki, but not the aliases - and it is not practical to download complete dumps for all projects just to get namespace names. Regards, - Byrial

10 years, 9 months

dumps of "big" wikis now moved to other datacenter

by Ariel T. Glenn

The following wikis are now running out of our datacenter in Ashburn: eswiki, ptwiki, plwiki, ruwiki, jawiki, dewiki, frwiki, nlwiki, itwiki Two worker processes handle these but each wiki is dumped in 4 parallel jobs. This means that we'll be back to shorter and more frequent run times for them all. A reminder that we will not be recombining the full history dumps into one file, as that would undercut the speed gains we make by parallelizing. As with the enwiki move, you won't see up to the minute updates for these wikis on the html page, since the data must be synced over before updates can show up. In a few days we'll start working on the move of the rest of the wikis. Ariel

10 years, 9 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l July 2013