Xmldatadumps-l April 2015

xmldatadumps-l@lists.wikimedia.org

4 participants
6 discussions

by Richard Jelinek

Hi, don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't... I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used to compress the xml dumps instead of bzip2. Why? Because its sibling (pbunzip2) has a bug bunzip2 hasn't. :-) Strange? Read on. A few hours ago, I filed a bug report for pbzip2 (see https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test results done even some few hours before that. The results indicate that: bzip2 and pbzip2 are vice-versa compatible each one can create archives, the other one can read. But if it is for uncomressing, only pbzip2 compressed archives are good for pbunzip2. I propose compressing the archives with pbzip2 for the following reasons: 1) If your archiving machines are SMP systems this could lead to a better usage of system ressources (i.e. faster compression). 2) Compression with pbzip2 is harmless for regular users of bunzip2, so everything should run for these people as usual. 3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a speedup that scales nearly linearly with the number of CPUs in the host. So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-) cheers, -- Dipl.-Inf. Univ. Richard C. Jelinek PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Human Language Technology Experts Sitz der Gesellschaft: Fürth 69216618 Mind Units Registergericht: AG Fürth, HRB-9201

8 years, 3 months

Add/Change dumps skipped for April 19, 2015

by Hydriz Scholz

Hi all, The Add/Change dumps for April 19, 2015 seems to be missing for all wikis. [1] Can someone have a look at what went wrong? The dumps are working fine for April 18, 2015 and April 20, 2015. Thank you! [1]: http://dumps.wikimedia.org/other/incr/ -- Best regards, Hydriz Scholz

9 years

Re: [Xmldatadumps-l] [Analytics] Missing media file request counts dataset for April 14, 2015

by Christian Aistleitner

Hi, [ re-arranged due to top-posting ] On Sat, Apr 18, 2015 at 09:13:47AM -0400, Andrew Otto wrote: > > On Apr 18, 2015, at 04:04, Hydriz Scholz <admin(a)alphacorp.tk> wrote: > > > > The media file request count files for upload.wikimedia.org has > > got a missing file for April 14, 2015. [1] There should be a file > > called "mediacounts.top1000.2015-04-14.v00.csv.zip", but it was > > apparently not generated and skipped. > > > > Can someone look into this? Thank you. > > I have been fighting with some cluster issues all week, and will get this sorted out this coming week. It seems the file appeared in the meantime. Have fun, Christian -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

9 years

Missing media file request counts dataset for April 14, 2015

by Hydriz Scholz

Hi all, The media file request count files for upload.wikimedia.org has got a missing file for April 14, 2015. [1] There should be a file called "mediacounts.top1000.2015-04-14.v00.csv.zip", but it was apparently not generated and skipped. Can someone look into this? Thank you. [1]: https://dumps.wikimedia.org/other/mediacounts/daily/2015/ -- Best regards, Hydriz Scholz

9 years

english wikipedia dumps failed

by Alex Druk

Hi, Many enwiki dumps stucked. http://dumps.wikimedia.org/enwiki/20150403/ Could someone check and rerun? -- Thank you. Alex Druk alex.druk(a)gmail.com (775) 237-8550 Google voice

9 years

Archived Dumps from 2008

by Yeshwanth C

Hi Everyone, We are a couple of undergraduate students at IIT Bombay working on the entity linking problem. It is the process of annotating a piece of text with entities from a knowledge base. A common test set for the above task is from the Knowledge Base Population task from the Text Analysis Conference. The reference knowledge base for the task was extracted from an October 2008 dump of Wikipedia. Unfortunately, when the TAC knowledge base was being created, a lot of important information concerning the Wikipedia category hierarchy was lost since they only retain links between entity pages. Beyond this, the TAC knowledge base also does not have the PageIDs of the entities extracted from Wikipedia which makes matching the entities in TAC with the current version of Wikipedia hard due to renames and deletions. We were wondering if there was anyway we could gain access to a dump from October 2008. We found that the dump from January 2008 was not complete as far as the TAC knowledge base is concerned. Any help will be greatly appreciated. Thanks, C. Yeshwanth

9 years

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l April 2015