Xmldatadumps-l January 2016

xmldatadumps-l@lists.wikimedia.org

16 participants
5 discussions

Extracting featured article meta-history dumps

by Anmol Dalmia

Hello. I wish to extract the meta- history dumps of the articles currently enlisted under various quality categories like Featured, Good, Stub, etc. Is there is list of such articles and their article ids or any such tool that can crawl through to produce these lists? Any help is highly appreciated. I have already been months on it. -- With Regards ANMOL DALMIA M Tech (Dual) Information Security, 2017 National Institute of Technology, Rourkela, India

8 years, 2 months

Old Dumps

by Christian Morbidoni

Hi all, My name is Christian Morbidoni, I am a researcher at the University of Ancona, Italy, and I'm new to this list. We are using wikimedia pageviews and dumps in our research experiments in temporal mining and we are looking for a old dump (summer 2014) to mach our testing period. In particular what we would like to have is a complete list of all wikipedia titles as up to summer 2014. I see the archived dumps ( https://dumps.wikimedia.org/enwiki/) do not reach that date. Is there some way to reach the old dumps? may be upon request? Any help is appreciated. best, Christian

8 years, 3 months

Re: [Xmldatadumps-l] [Wikitech-l] Wikipedia dumps

by Tilman Bayer

On Sun, Jan 10, 2016 at 4:05 PM, Bernardo Sulzbach < mafagafogigante(a)gmail.com> wrote: > On Sun, Jan 10, 2016 at 9:55 PM, Neil Harris <neil(a)tonal.clara.co.uk> > wrote: > > Hello! I've noticed that no enwiki dump seems to have been generated so > far > > this month. Is this by design, or has there been some sort of dump > failure? > > Does anyone know when the next enwiki dump might happen? > > > > I would also be interested. > > -- > Bernardo Sulzbach > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > CCing the Xmldatadumps mailing list <https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l>, where someone has already posted <https://lists.wikimedia.org/pipermail/xmldatadumps-l/2016-January/001214.ht…> about what might be the same issue. -- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB

8 years, 3 months

pbzip2 proposal

by Richard Jelinek

Hi, don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't... I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used to compress the xml dumps instead of bzip2. Why? Because its sibling (pbunzip2) has a bug bunzip2 hasn't. :-) Strange? Read on. A few hours ago, I filed a bug report for pbzip2 (see https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test results done even some few hours before that. The results indicate that: bzip2 and pbzip2 are vice-versa compatible each one can create archives, the other one can read. But if it is for uncomressing, only pbzip2 compressed archives are good for pbunzip2. I propose compressing the archives with pbzip2 for the following reasons: 1) If your archiving machines are SMP systems this could lead to a better usage of system ressources (i.e. faster compression). 2) Compression with pbzip2 is harmless for regular users of bunzip2, so everything should run for these people as usual. 3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a speedup that scales nearly linearly with the number of CPUs in the host. So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-) cheers, -- Dipl.-Inf. Univ. Richard C. Jelinek PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Human Language Technology Experts Sitz der Gesellschaft: Fürth 69216618 Mind Units Registergericht: AG Fürth, HRB-9201

8 years, 3 months

2016-01 dumps halted?

by gnosygnu

Hi all. Sorry if this is a known issue, but dumps look like they've stopped after 2016-01-04. See https://dumps.wikimedia.org/backup-index.html Is there an ETA on resumption? Or is something else at stake? Let me know if you need more info. Thanks.

8 years, 3 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l January 2016