Xmldatadumps-l December 2012

xmldatadumps-l@lists.wikimedia.org

5 participants
6 discussions

by Richard Jelinek

Hi, don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't... I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used to compress the xml dumps instead of bzip2. Why? Because its sibling (pbunzip2) has a bug bunzip2 hasn't. :-) Strange? Read on. A few hours ago, I filed a bug report for pbzip2 (see https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test results done even some few hours before that. The results indicate that: bzip2 and pbzip2 are vice-versa compatible each one can create archives, the other one can read. But if it is for uncomressing, only pbzip2 compressed archives are good for pbunzip2. I propose compressing the archives with pbzip2 for the following reasons: 1) If your archiving machines are SMP systems this could lead to a better usage of system ressources (i.e. faster compression). 2) Compression with pbzip2 is harmless for regular users of bunzip2, so everything should run for these people as usual. 3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a speedup that scales nearly linearly with the number of CPUs in the host. So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-) cheers, -- Dipl.-Inf. Univ. Richard C. Jelinek PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Human Language Technology Experts Sitz der Gesellschaft: Fürth 69216618 Mind Units Registergericht: AG Fürth, HRB-9201

8 years, 3 months

Processing french dump

by Benoit Lelong

Hi all, I am currently planning to process the last french dump. I would like to ask if somebody has already found or used a good OpenNLP french sentence detection model. If yes please let me know where to find one. Thanks in advance, Best regards, Benoit.

11 years, 1 month

Re: [Xmldatadumps-l] [WP-MIRROR] Questions regarding Metalink and SPDY

by wp mirror

Dear List Members, Does anyone know if the WikiMedia Foundation plans to support Metalink or SPDY for its dump files and/or image files? See RFP references below. WP-MIRROR downloads dump and image files to build a mirror of a set of wikipedias. WP-MIRROR 0.5 is feature complete. I am now looking for ways to optimize performance (i.e. reduce mirror build time). Were the WMF to support the above two protocols, downloads would be faster and require less time spent on validation. Sincererly Yours, Kent On 12/29/12, Sumana Harihareswara <sumanah(a)wikimedia.org> wrote: > Hello! I'm sorry, but I don't know the answer to these questions; > perhaps you could email the dumps mailing list > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l ? My > apologies. > > > Sumana Harihareswara > Engineering Community Manager > Wikimedia Foundation > > > On Sun, Dec 16, 2012 at 6:14 AM, wp mirror <wpmirrordev(a)gmail.com> wrote: >> Dear Sumana, >> >> 1) Metalink. Does the Wikimedia Foundation have any plans to support >> metalink for either its dump files or its image files? >> >> Documentation: >> <http://tools.ietf.org/html/rfc5854>, "The Metalink Download Description >> Format" >> <http://tools.ietf.org/html/rfc6249>, "Metalink/HTTP: Mirrors and Hashes" >> >> 2) SPDY. Does the Wikimedia Foundation have any plans to support SPDY? >> >> Documentation: <http://www.chromium.org/spdy> >> >> 3) WP-MIRROR. We last communicated 2012-01-06 in regards to WP-MIRROR. >> >> Status: WP-MIRROR 0.5 is `feature complete', and works >> `out-of-the-box' for the GNU/Linux distributions: Debian 7.0 (wheezy) >> and Ubuntu 12.10 (quantal). >> >> Future: Attention is turning towards performance enhancement and >> porting to other distributions. >> >> Homepage: <http://www.nongnu.org/wp-mirror/> >> >> Please give it a try. Feedback is most welcome. >> >> Sincerely Yours, >> Kent >

11 years, 3 months

Only two dumps

by Andreas Meier

Hello, at the moment there are only 2 dumps of the bigger wikis produced. Best regards Andreas

11 years, 4 months

encouraging mirro use (making list more visible)

by Ariel T. Glenn

I had an email exchange wth one of the folks at our mirror sites about the low volume of traffic they are getting. Clearly we need to publicize this list better, bearing in mind that files on our mirrors may be a day behind the live site. I wouldn't think that a day's delay is very important in the grand scheme of things though. So I'm looking for suggestions on how to best make the list of mirrors visible to dumps users/downloaders. This includes changes to [1] and [2] among other things. Bear in mind that'best' also implies 'easy to do' or 'here is a patch' :-D Ariel [1] https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=blob;f=xmldu… (downliad page for all dumps showing each dump in order of completion) [2] https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=blob;f=xmldu… (download page for a given dump)

11 years, 4 months

snapshot1 server outage

by Ariel T. Glenn

Snapshot1, which was running several dumps for 'big' wikis, fell over due to swapdeath today. While we investigate the issue, those jobs will be stalled. I'll send an update as soon as we have more info. Ariel

11 years, 4 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l December 2012