Xmldatadumps-l May 2012

xmldatadumps-l@lists.wikimedia.org

11 participants
11 discussions

by Richard Jelinek

Hi, don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't... I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used to compress the xml dumps instead of bzip2. Why? Because its sibling (pbunzip2) has a bug bunzip2 hasn't. :-) Strange? Read on. A few hours ago, I filed a bug report for pbzip2 (see https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test results done even some few hours before that. The results indicate that: bzip2 and pbzip2 are vice-versa compatible each one can create archives, the other one can read. But if it is for uncomressing, only pbzip2 compressed archives are good for pbunzip2. I propose compressing the archives with pbzip2 for the following reasons: 1) If your archiving machines are SMP systems this could lead to a better usage of system ressources (i.e. faster compression). 2) Compression with pbzip2 is harmless for regular users of bunzip2, so everything should run for these people as usual. 3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a speedup that scales nearly linearly with the number of CPUs in the host. So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-) cheers, -- Dipl.-Inf. Univ. Richard C. Jelinek PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Human Language Technology Experts Sitz der Gesellschaft: Fürth 69216618 Mind Units Registergericht: AG Fürth, HRB-9201

8 years, 3 months

scheduling reboot / maintenance on dumps server

by Ariel T. Glenn

We've seen some kernel errors in the logs on the server that hosts the dumps. In order to upgrade and reboot, we need the currrently running dumps to exit as soon as they complete the current phase of their runs. I expect to complete this before the end of the week; if necessary I will shoot existing processes to get it done. In the meantime as dumps complete, new ones will not be started; thanks for your patience. Ariel

11 years, 11 months

Re: [Xmldatadumps-l] [Wikitech-l] XML dumps/Media mirrors update

by emijrp

Create a script that makes a request to Special:Export using this category as feed https://en.wikipedia.org/wiki/Category:Candidates_for_speedy_deletion More info https://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export 2012/5/21 Mike Dupont <jamesmikedupont(a)googlemail.com> > Well I whould be happy for items like this : > http://en.wikipedia.org/wiki/Template:Db-a7 > would it be possible to extract them easily? > mike > > On Thu, May 17, 2012 at 2:23 PM, Ariel T. Glenn <ariel(a)wikimedia.org> > wrote: > > There's a few other reasons articles get deleted: copyright issues, > > personal identifying data, etc. This makes maintaning the sort of > > mirror you propose problematic, although a similar mirror is here: > > http://deletionpedia.dbatley.com/w/index.php?title=Main_Page > > > > The dumps contain only data publically available at the time of the run, > > without deleted data. > > > > The articles aren't permanently deleted of course. The revisions texts > > live on in the database, so a query on toolserver, for example, could be > > used to get at them, but that would need to be for research purposes. > > > > Ariel > > > > Στις 17-05-2012, ημέρα Πεμ, και ώρα 13:30 +0200, ο/η Mike Dupont έγραψε: > >> Hi, > >> I am thinking about how to collect articles deleted based on the "not > >> notable" criteria, > >> is there any way we can extract them from the mysql binlogs? how are > >> these mirrors working? I would be interested in setting up a mirror of > >> deleted data, at least that which is not spam/vandalism based on tags. > >> mike > >> > >> On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn <ariel(a)wikimedia.org> > wrote: > >> > We now have three mirror sites, yay! The full list is linked to from > >> > http://dumps.wikimedia.org/ and is also available at > >> > > http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Curren… > >> > > >> > Summarizing, we have: > >> > > >> > C3L (Brazil) with the last 5 good known dumps, > >> > Masaryk University (Czech Republic) with the last 5 known good dumps, > >> > Your.org (USA) with the complete archive of dumps, and > >> > > >> > for the latest version of uploaded media, Your.org with http/ftp/rsync > >> > access. > >> > > >> > Thanks to Carlos, Kevin and Yenya respectively at the above sites for > >> > volunteering space, time and effort to make this happen. > >> > > >> > As people noticed earlier, a series of media tarballs per-project > >> > (excluding commons) is being generated. As soon as the first run of > >> > these is complete we'll announce its location and start generating > them > >> > on a semi-regular basis. > >> > > >> > As we've been getting the bugs out of the mirroring setup, it is > getting > >> > easier to add new locations. Know anyone interested? Please let us > >> > know; we would love to have them. > >> > > >> > Ariel > >> > > >> > > >> > _______________________________________________ > >> > Wikitech-l mailing list > >> > Wikitech-l(a)lists.wikimedia.org > >> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > >> > >> > >> > > > > > > > > _______________________________________________ > > Wikitech-l mailing list > > Wikitech-l(a)lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > > -- > James Michael DuPont > Member of Free Libre Open Source Software Kosova http://flossk.org > Contributor FOSM, the CC-BY-SA map of the world http://fosm.org > Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3 > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > -- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT <http://code.google.com/p/avbot/> | StatMediaWiki<http://statmediawiki.forja.rediris.es> | WikiEvidens <http://code.google.com/p/wikievidens/> | WikiPapers<http://wikipapers.referata.com> | WikiTeam <http://code.google.com/p/wikiteam/> Personal website: https://sites.google.com/site/emijrp/

11 years, 11 months

minor deployment script update (runners will stop and be restarted)

by Ariel T. Glenn

Over the next few days you'll see the number of processes drop as jobs on each host complete. I'll be restarting themon each host as they finish; this is part of my work to get deployment to suck less. Ariel

11 years, 11 months

(no subject)

by andreasmeier80＠gmx.de

http://tgiturkiye.net/currentevents/20JamesMacdonald/ -- Neu: Geschenkt! 50% mehr Speicher für Ihr Freemail-Postfach! Jetzt informieren: https://service.gmx.net/de/cgi/g.fcgi/products/mailcheck

11 years, 11 months

Last English Wikipedia dump in a single file

by emijrp

Hi; I remember that here[1] there was a 7z file with one of the last (or the last one) English Wikipedia dump (in a single file). Where is it? Regards, emijrp [1] http://dumps.wikimedia.org/enwiki/20100130/ -- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT <http://code.google.com/p/avbot/> | StatMediaWiki<http://statmediawiki.forja.rediris.es> | WikiEvidens <http://code.google.com/p/wikievidens/> | WikiPapers<http://wikipapers.referata.com> | WikiTeam <http://code.google.com/p/wikiteam/> Personal website: https://sites.google.com/site/emijrp/

11 years, 11 months

XML dumps/Media mirrors update

by Ariel T. Glenn

We now have three mirror sites, yay! The full list is linked to from http://dumps.wikimedia.org/ and is also available at http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Curren… Summarizing, we have: C3L (Brazil) with the last 5 good known dumps, Masaryk University (Czech Republic) with the last 5 known good dumps, Your.org (USA) with the complete archive of dumps, and for the latest version of uploaded media, Your.org with http/ftp/rsync access. Thanks to Carlos, Kevin and Yenya respectively at the above sites for volunteering space, time and effort to make this happen. As people noticed earlier, a series of media tarballs per-project (excluding commons) is being generated. As soon as the first run of these is complete we'll announce its location and start generating them on a semi-regular basis. As we've been getting the bugs out of the mirroring setup, it is getting easier to add new locations. Know anyone interested? Please let us know; we would love to have them. Ariel

11 years, 11 months

Re: [Xmldatadumps-l] things look ok after 1.19, so...

by Andreas Meier

Hello, there seems to be only two worker processes for bigger wikis instead of three as before. Best regards Andreas

11 years, 11 months

Image dumps

by emijrp

Hi all; Just a quick notice. Some image dumps are being generated http://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/ Download before they change their mind ; ) Regards, emijrp -- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT <http://code.google.com/p/avbot/> | StatMediaWiki<http://statmediawiki.forja.rediris.es> | WikiEvidens <http://code.google.com/p/wikievidens/> | WikiPapers<http://wikipapers.referata.com> | WikiTeam <http://code.google.com/p/wikiteam/> Personal website: https://sites.google.com/site/emijrp/

11 years, 11 months

Picture of the year

by burslem

It's amazing there are already so many years available for download. Especially the larger zips must have been somewhat time-consuming to compile! It would be great if 2008 or pre-2006 packages become available in the near future. It is a really interesting development to see http://dumps.wikimedia.org being interested in not only compiling large current packaged databases of various wikis, but also more historic content. In the past, the Internet Archive was the sole distributor of older (also historic) wiki packages. Eventually in the far future, there will have to be some sort of viable mechanism for cloning all the images stored on wikimedia, though as for now the Picture of the Year packages are very interesting for those more interested in the pretty images of wikipedia. The POTY images also make great wallpaper packs!

11 years, 11 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l May 2012