Xmldatadumps-l July 2014

xmldatadumps-l@lists.wikimedia.org

7 participants
7 discussions

by Richard Jelinek

Hi, don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't... I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used to compress the xml dumps instead of bzip2. Why? Because its sibling (pbunzip2) has a bug bunzip2 hasn't. :-) Strange? Read on. A few hours ago, I filed a bug report for pbzip2 (see https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test results done even some few hours before that. The results indicate that: bzip2 and pbzip2 are vice-versa compatible each one can create archives, the other one can read. But if it is for uncomressing, only pbzip2 compressed archives are good for pbunzip2. I propose compressing the archives with pbzip2 for the following reasons: 1) If your archiving machines are SMP systems this could lead to a better usage of system ressources (i.e. faster compression). 2) Compression with pbzip2 is harmless for regular users of bunzip2, so everything should run for these people as usual. 3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a speedup that scales nearly linearly with the number of CPUs in the host. So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-) cheers, -- Dipl.-Inf. Univ. Richard C. Jelinek PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Human Language Technology Experts Sitz der Gesellschaft: Fürth 69216618 Mind Units Registergericht: AG Fürth, HRB-9201

8 years, 3 months

ETA for dumps?

by Dennis During

The listing of dumps shows several recent aborted dumps. This is worrying as I am heavily committed to a program of new-entry creation to eliminate redlinks that depends on enwikt's dump. There seems to have been little progress in clearing up the problem. When might a smooth, predictable flow of dumps resume?

9 years, 9 months

"Extracted page abstracts for Yahoo" for Wikidata

by Amir Ladsgroup

Hello, Wikidata dumps (e.g this <http://dumps.wikimedia.org/wikidatawiki/20140612/>) have an annoying plus one named Yahoo abstracts, It has more than 16 GBs (mainly because it's not zipped) and because content of Wikidata pages are saved in term of numbers and codes instead of wikitext (e.g. this <https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q42>), usually the abstract is as the same as whole page. Please fix it. Best -- Amir

9 years, 9 months

Re: [Xmldatadumps-l] incremental dump issues

by Federico Leva (Nemo)

wp mirror, 05/07/2014 06:11: > Dear Federico, > > Thanks for the links. The advise on > <https://meta.wikimedia.org/wiki/Data_dumps/ImportDump.php> I have > already implemented. Bits of > <https://www.mediawiki.org/wiki/Manual:Performance_tuning> are also > implemented. > > I am not clear about ``setting proper l10n cache (on CDB).'' <https://www.mediawiki.org/wiki/Manual:Performance_tuning#Output_caching> > <https://www.mediawiki.org/wiki/Localisation#Caching> > <https://www.mediawiki.org/wiki/Manual:$wgCacheDirectory> which I assume you already set as you did everything in [[Manual:Performance tuning]]. Nemo > > 1) Extension:LocalisationUpdate. I use the `LocalisationUpdate' > extension, which populates `/cache/' with 295 files having names like > `l10nupdate-<languagecode>.cache'. Is there something I should do with > these? > > 2) interwiki.cdb. I am no sure what you mean with CDB. Are you referring > to the `interwiki.cdb' file? This I just learned about today while > trying to figure out what CDB means for `mediawiki'. I downloaded > <http://noc.wikimedia.org/interwiki/interwiki.cdb> to see what it is. I > guess that it is intended to replace the `interwiki' database table. Is > there something I should do with this? > > SIncerely Yours, > Kent

9 years, 9 months

incremental dump issues

by wp mirror

Dear Ariel, I have begun to make use of the Incremental XML Data Dumps, and have a few questions. 0) Acronymns For brevity, I shall coin two terms: xdump - XML Data Dump xincr - Incremental XML Data Dump 1) Checksums The checksum files for the `xincr's is not formatted correctly, causing `md5sum' to throw an error. The correct format is: <checksum><two spaces><filename><newline> (shell)$ cat simplewiki-20140703-md5sums.txt d03f3a91ef0273eb814f39a1d13788cb c51f2bd5ef6bd42ce65cf4a7fca72400 (shell)$ md5sum --check simplewiki-20140703-md5sums.txt md5sum: simplewiki-20140703-md5sums.txt: no properly formatted MD5 checksum lines found 2) maintenance/importDump.php Whereas no incremental SQL files are provided, I cannot use `mwxml2sql' and must instead use `importDump.php'. However, I have encountered a few issues when using `importDump.php' on `xincr's. 2.1) Speed: Importation proceeds at less than 0.1 pages/sec. This means that, for the largest wikis (commonswiki, enwiki, wikidatawiki) importation cannot be completed before the `xincr' for the next day is posted. 2.2) Pauses: Normally, when running `top', I can see at least on CPU at near 100% for `php' and `mysql'. However, sometimes importation pauses for several minutes, with no apparent CPU or disk activity. I assume that there is a time-out somewhere that allows importation to proceed again. Any comments on this phenomenon would be most welcome. 2.3) Fails: Sometimes importation fails. I see this often with the `xincr's from `betawikiversity'. I have not yet isolated specific records that cause failure. But it raises the question: Is `importDump.php' still supported? 3) Tools Can you please advise as to the best method for importing `xincr's? Is there another importation tool that you would recommend (one that is both supported and fast)? Sincerely Yours, Kent

9 years, 9 months

Stability of English Wikipedia PageIds

by Johannes Daxenberger

Hi, I recently came across several articles from the English Wikipedia for which the (page)ids in the dumps apparently have changed over time. E.g., in a dump from April 2011 as well as in another one from January 2012, the article "Marseille" had the id 71486, while the article with the same name currently (according to the MediaWiki API as well as in the May 2014 dump) has the id 40888948. Does anybody have an idea how this might have happened and whether this is a frequent phenomenon? Thanks, Johannes --- Johannes Daxenberger Doctoral Researcher | IT Administration Ubiquitous Knowledge Processing (UKP Lab) FB 20 Computer Science Department Technische Universität Darmstadt Hochschulstr. 10, D-64289 Darmstadt, Germany email: daxenberger(at)ukp.informatik.tu-darmstadt.de phone: [+49] (0)6151 16-6227, fax: -5455, room: S2/02/B111 www.ukp.tu-darmstadt.de Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de

9 years, 9 months

change to generated dumps index page

by Ariel T. Glenn

Folks will have noticed that the dumps index.html page generated by the monitor has changed. A bunch of content has been added, at the request of the legal team, and the css has changed, stealing from the static html page above it. If the new font sizes or whatever are too hard on the eyes or you want to tweak the layout a bit to make it more readable, see the file in this gerrit change: https://gerrit.wikimedia.org/r/#/c/143645/ and submit a patchset to puppet. Changes merged in the repository will take effect by the next puppet run, i.e. in about half an hour. Ariel

9 years, 9 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l July 2014